I think you can just take a look at jsonserde

It does take care of nested json documents. (though you will need to know
entire json structure upfront)

Here is example of using it
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/




On Mon, Jun 23, 2014 at 2:28 PM, Christian Link <christian.l...@mdmp.com>
wrote:

> Hi Jerome,
>
> thanks...I've already found "Brickhouse" and the Hive UDFs, but it didn't
> help.
>
> Today I'll try again to process the json file after going through all my
> mails...maybe I'll find a solution.
>
> Best,
> Chris
>
>
> On Fri, Jun 20, 2014 at 7:16 PM, Jerome Banks <jba...@tagged.com> wrote:
>
>> Christian,
>>    Sorry to spam this newsgroup, and this is not a commercial
>> endorsement, but check out the Hive UDFs in the Brickhouse project (
>> http://github.com/klout/brickhouse ) (
>> http://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/
>> )
>>
>> You can convert arbitrary complex Hive structures to an from json with
>> it's to_json and from_json UDF's. See the blog posting for an explanation.
>>
>> -- jerome
>>
>>
>> On Fri, Jun 20, 2014 at 8:26 AM, Christian Link <christian.l...@mdmp.com>
>> wrote:
>>
>>> hi,
>>>
>>> I'm very, very new to Hadoop, Hive, etc. and I have to import data into
>>> hive tables.
>>>
>>> Environment: Amazon EMR, S3, etc.
>>>
>>> The input file is on S3 and I copied it into my HDFS.
>>>
>>> 1. flat table with one column and loaded data into it:
>>>
>>>   CREATE TABLE mdmp_raw_data (json_record STRING);
>>>   LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO
>>> TABLE `mdmp_raw_data`;
>>> That worked, I can access some data, like this:
>>>
>>> SELECT d.carrier, d.language, d.country
>>>   FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data,
>>> 'requestTimestamp', 'context') b    AS requestTimestamp, context
>>>   LATERAL VIEW json_tuple(b.context, 'locale') c AS locale
>>>   LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country') d
>>> AS carrier, language, country
>>>   LIMIT 1;
>>>
>>> Result: o2 - de Deutsch Deutschland
>>>
>>> I can also select the array at once:
>>>
>>> SELECT b.requestTimestamp, b.batch
>>>   FROM mdmp_raw_data a
>>>   LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch') b AS
>>> requestTimestamp, batch
>>>   LIMIT 1;
>>> This will give me:
>>>
>>>  
>>> [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST
>>> Doge
>>> Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much
>>> joy."}}, ...]
>>>
>>> This "batch" may contain n events will a structure like above.
>>>
>>> I want to put all events in a table where each "element" will be stored
>>> in a unique column: timestamp, requestId, sessionId, event, userId, action,
>>> context, properties
>>>
>>> 2. explode the "batch" I read a lot about SerDe, etc. - but I don't get
>>> it.
>>>
>>> - I tried to create a table with an array and load the data into it -
>>> several errors
>>> use explode in query but it doesn't accept "batch" as array
>>> - integrated several SerDes but get things like "unknown function jspilt"
>>> - I'm lost in too many documents, howtos, etc. and could need some
>>> advices...
>>>
>>> Thank you in advance!
>>>
>>> Best, Chris
>>>
>>
>>
>


-- 
Nitin Pawar

Reply via email to