I think you can just take a look at jsonserde It does take care of nested json documents. (though you will need to know entire json structure upfront)
Here is example of using it http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/ On Mon, Jun 23, 2014 at 2:28 PM, Christian Link <christian.l...@mdmp.com> wrote: > Hi Jerome, > > thanks...I've already found "Brickhouse" and the Hive UDFs, but it didn't > help. > > Today I'll try again to process the json file after going through all my > mails...maybe I'll find a solution. > > Best, > Chris > > > On Fri, Jun 20, 2014 at 7:16 PM, Jerome Banks <jba...@tagged.com> wrote: > >> Christian, >> Sorry to spam this newsgroup, and this is not a commercial >> endorsement, but check out the Hive UDFs in the Brickhouse project ( >> http://github.com/klout/brickhouse ) ( >> http://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/ >> ) >> >> You can convert arbitrary complex Hive structures to an from json with >> it's to_json and from_json UDF's. See the blog posting for an explanation. >> >> -- jerome >> >> >> On Fri, Jun 20, 2014 at 8:26 AM, Christian Link <christian.l...@mdmp.com> >> wrote: >> >>> hi, >>> >>> I'm very, very new to Hadoop, Hive, etc. and I have to import data into >>> hive tables. >>> >>> Environment: Amazon EMR, S3, etc. >>> >>> The input file is on S3 and I copied it into my HDFS. >>> >>> 1. flat table with one column and loaded data into it: >>> >>> CREATE TABLE mdmp_raw_data (json_record STRING); >>> LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO >>> TABLE `mdmp_raw_data`; >>> That worked, I can access some data, like this: >>> >>> SELECT d.carrier, d.language, d.country >>> FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data, >>> 'requestTimestamp', 'context') b AS requestTimestamp, context >>> LATERAL VIEW json_tuple(b.context, 'locale') c AS locale >>> LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country') d >>> AS carrier, language, country >>> LIMIT 1; >>> >>> Result: o2 - de Deutsch Deutschland >>> >>> I can also select the array at once: >>> >>> SELECT b.requestTimestamp, b.batch >>> FROM mdmp_raw_data a >>> LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch') b AS >>> requestTimestamp, batch >>> LIMIT 1; >>> This will give me: >>> >>> >>> [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST >>> Doge >>> Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much >>> joy."}}, ...] >>> >>> This "batch" may contain n events will a structure like above. >>> >>> I want to put all events in a table where each "element" will be stored >>> in a unique column: timestamp, requestId, sessionId, event, userId, action, >>> context, properties >>> >>> 2. explode the "batch" I read a lot about SerDe, etc. - but I don't get >>> it. >>> >>> - I tried to create a table with an array and load the data into it - >>> several errors >>> use explode in query but it doesn't accept "batch" as array >>> - integrated several SerDes but get things like "unknown function jspilt" >>> - I'm lost in too many documents, howtos, etc. and could need some >>> advices... >>> >>> Thank you in advance! >>> >>> Best, Chris >>> >> >> > -- Nitin Pawar