hi, I'm very, very new to Hadoop, Hive, etc. and I have to import data into hive tables.
Environment: Amazon EMR, S3, etc. The input file is on S3 and I copied it into my HDFS. 1. flat table with one column and loaded data into it: CREATE TABLE mdmp_raw_data (json_record STRING); LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO TABLE `mdmp_raw_data`; That worked, I can access some data, like this: SELECT d.carrier, d.language, d.country FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'context') b AS requestTimestamp, context LATERAL VIEW json_tuple(b.context, 'locale') c AS locale LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country') d AS carrier, language, country LIMIT 1; Result: o2 - de Deutsch Deutschland I can also select the array at once: SELECT b.requestTimestamp, b.batch FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch') b AS requestTimestamp, batch LIMIT 1; This will give me: [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST Doge Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much joy."}}, ...] This "batch" may contain n events will a structure like above. I want to put all events in a table where each "element" will be stored in a unique column: timestamp, requestId, sessionId, event, userId, action, context, properties 2. explode the "batch" I read a lot about SerDe, etc. - but I don't get it. - I tried to create a table with an array and load the data into it - several errors use explode in query but it doesn't accept "batch" as array - integrated several SerDes but get things like "unknown function jspilt" - I'm lost in too many documents, howtos, etc. and could need some advices... Thank you in advance! Best, Chris