I've got 1.1.9.2 (thanks to Roberto) and the data is looking better... I'll test the "doubel/int" thingie, now.
Best, Chris On Mon, Jun 23, 2014 at 8:51 PM, Swagatika Tripathy <swagatikat...@gmail.com > wrote: > Hi, > Use 1.9.3 Jason serde with dependencies jar. Its the latest one I suppose. > > Thanks > Swagatika > On Jun 23, 2014 11:57 PM, "Roberto Congiu" <roberto.con...@openx.com> > wrote: > >> Hi, >> 1.1.4 is a oldish version of the JSON serde, have you tried with the most >> recent from the master branch ? >> >> >> On Mon, Jun 23, 2014 at 10:23 AM, Christian Link <christian.l...@mdmp.com >> > wrote: >> >>> Hi, >>> >>> thanks...but I need to sort things out with ONE SerDe/strategy... >>> I've started with André's idea by using Roberto Congiu's SerDe and >>> André's template to create a table with the right schema and loading the >>> data aftrerwards. >>> >>> But it's not completely working... >>> >>> I did the following (sorry for spaming...): >>> >>> 1. create table and load data >>> >>> -- create database (if not exists) >>> CREATE DATABASE IF NOT EXISTS mdmp_api_dump; >>> >>> -- connect to database; >>> USE mdmp_api_dump; >>> >>> -- add SerDE for json processing >>> ADD JAR /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar; >>> >>> -- drop old raw data >>> DROP TABLE IF EXISTS mdmp_raw_data; >>> >>> -- create raw data table >>> CREATE TABLE mdmp_raw_data ( >>> action string, >>> batch array< >>> struct< >>> timestamp:string, >>> traits:map<string,string>, >>> requestId:string, >>> sessionId:string, >>> event:string, >>> userId:string, >>> action:string, >>> context:map<string,string>, >>> properties:map<string,string> >>> >>> > >>> >, >>> context struct< >>> build:map<string,string>, >>> device:struct< >>> brand:string, >>> manufacturer:string, >>> model:string, >>> release:string, >>> sdk:int >>> >, >>> display:struct< >>> density:double, >>> height:int, >>> width:int >>> >, >>> integrations:map<string,boolean>, >>> library:string, >>> libraryVersion:string, >>> locale:map<string,string>, >>> location:map<string,string>, >>> telephony:map<string,string>, >>> wifi:map<string,boolean> >>> >, >>> received_at string, >>> requestTimestamp string, >>> writeKey string >>> ) >>> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' >>> STORED AS TEXTFILE; >>> >>> -- load data >>> LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO >>> TABLE `mdmp_raw_data`; >>> >>> 2. run query against the "raw data" and create "formatted table": >>> >>> ADD JAR /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar; >>> >>> USE mdmp_api_dump; >>> >>> DROP TABLE IF EXISTS mdmp_api_data; >>> >>> CREATE TABLE mdmp_api_data AS >>> SELECT DISTINCT >>> a.action, >>> a.received_at, >>> a.requestTimestamp, >>> a.writeKey, >>> a.context.device.brand as brand, >>> a.context.device.manufacturer as manufacturer, >>> a.context.device.model as model, >>> a.context.device.release as release, >>> a.context.device.sdk as sdk, >>> -- a.context.display.density as density, >>> a.context.display.height as height, >>> a.context.display.width as width, >>> a.context.telephony['radio'] as tel_radio, >>> a.context.telephony['carrier'] as tel_carrier, >>> a.context.wifi['connected'] as wifi_connected, >>> a.context.wifi['available'] as wifi_available, >>> a.context.locale['carrier'] as loce_carrier, >>> a.context.locale['language'] as loce_language, >>> a.context.locale['country'] as loce_country, >>> a.context.integrations['Tapstream'] as int_tapstream, >>> a.context.integrations['Amplitude'] as int_amplitude, >>> a.context.integrations['Localytics'] as int_localytics, >>> a.context.integrations['Flurry'] as int_flurry, >>> a.context.integrations['Countly'] as int_countly, >>> a.context.integrations['Quantcast'] as int_quantcast, >>> a.context.integrations['Crittercism'] as int_crittercism, >>> a.context.integrations['Google Analytics'] as int_googleanalytics, >>> a.context.integrations['Mixpanel'] as int_mixpanel, >>> b.batch.action AS b_action, >>> b.batch.context, >>> b.batch.event, >>> b.batch.properties, >>> b.batch.requestId, >>> b.batch.sessionId, >>> b.batch.timestamp, >>> b.batch.traits, >>> b.batch.userId >>> FROM mdmp_raw_data a >>> LATERAL VIEW explode(a.batch) b AS batch; >>> >>> So far so good... (besides a silly double/int bug in the outdated SerDe) >>> I thought. >>> >>> But it turned out, that some fields are NULL - within all records. >>> >>> Affected fields are: >>> b.batch.event, >>> b.batch.requestId, >>> b.batch.sessionId, >>> b.batch.userId >>> >>> I can see values in the json file, but neither in the "raw table" nor >>> in the final table...that's really strange. >>> >>> An example record: >>> {"requestTimestamp":"2014-06-19T14:25:26+02:00","context":{"libraryVersion":"0.6.13","telephony":{"radio":"gsm","carrier":"o2 >>> - >>> de"},"wifi":{"connected":true,"available":true},"location":{},"locale":{"carrier":"o2 >>> - >>> de","language":"Deutsch","country":"Deutschland"},"library":"analytics-android","device":{"brand":"htc","model":"HTC >>> One >>> S","sdk":16,"release":"4.1.1","manufacturer":"HTC"},"display":{"density":1.5,"width":540,"height":960},"build":{"name":"1.0","code":1},"integrations":{"Tapstream":false,"Amplitude":false,"Localytics":false,"Flurry":false,"Countly":false,"Bugsnag":false,"Quantcast":false,"Crittercism":false,"Google >>> Analytics":false,"Mixpanel":false}},"batch":[{"timestamp":"2014-06-19T14:25:17+02:00","requestId":"32377337-3f99-4ac5-bfc6-d3654584655b","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff >>> ruff!"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"fbfd45c9-cf0f-4cb3-955c-85c65220a5bd","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,08"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"3a643b12-64e5-4a7c-b44b-e3e09dbc5b66","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"action":"identify","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"timestamp":"2014-06-19T14:25:19+02:00","traits":{"email":" >>> do...@mdmp.com","name":"Carmelo >>> Doge"},"requestId":"ef2910f4-cd4f-4175-89d0-7d91b35c229f","sessionId":"75cd18db8a364c2","userId":"doge74167705ruffruff"},{"timestamp":"2014-06-19T14:25:19+02:00","requestId":"1676bb06-abee-4135-a206-d57c4a1bc24d","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge App >>> Usage","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{}},{"timestamp":"2014-06-19T14:25:20+02:00","requestId":"66532c8a-c5da-4852-b8b6-04df8f3052d5","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Many >>> data."}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"a1a79d8c-fe58-4567-8dec-a8d1d2ae2713","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,87"}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"259209ac-b135-4d5f-bdac-535eccc0400e","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"timestamp":"2014-06-19T14:25:23+02:00","requestId":"59b6d57c-c7a5-4b2a-af6d-fa10ae0de60c","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such >>> App!"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"8b05226f-bdf5-4af8-bb91-84da1b874c6e","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,50"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"0f366675-5641-4238-b2a9-176735de6edd","sessionId":"75cd18db8a364c2","event":"TEST >>> Doge >>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff >>> ruff!"}},{"timestamp":"2014-06-19T14:25:26+02:00","requestId":"9e832534-5114-4ec1-bc20-1dcf1c354d0c","sessionId":"75cd18db8a364c2","event":"Session >>> end","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"start":"14:25:09","end":"14:25:26"}}],"writeKey":"a8RCFSAVjmT5qyxLKMzt12kcXWOIusvw","action":"import","received_at":"2014-06-19T12:25:29.790+00:00"} >>> >>> >>> Funny thing is, that I'm sure that I've seen these values earlier >>> today...I've reloaded the data/tables several times to see if this is still >>> working...well. :) >>> >>> I'm gonna stop for today...another try tomorrow. >>> >>> Thanks so far and many greetings from Berlin, >>> Chris >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Mon, Jun 23, 2014 at 6:57 PM, Sachin Goyal <sgo...@walmartlabs.com> >>> wrote: >>> >>>> >>>> You can also use hive-json-schema to automate Hive schema generation >>>> from JSON: >>>> https://github.com/quux00/hive-json-schema >>>> >>>> >>>> From: Nitin Pawar <nitinpawar...@gmail.com<mailto: >>>> nitinpawar...@gmail.com>> >>>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" < >>>> user@hive.apache.org<mailto:user@hive.apache.org>> >>>> Date: Monday, June 23, 2014 at 2:25 AM >>>> To: "user@hive.apache.org<mailto:user@hive.apache.org>" < >>>> user@hive.apache.org<mailto:user@hive.apache.org>> >>>> Subject: Re: how to load json with nested array into hive? >>>> >>>> I think you can just take a look at jsonserde >>>> >>>> It does take care of nested json documents. (though you will need to >>>> know entire json structure upfront) >>>> >>>> Here is example of using it >>>> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/ >>>> >>>> >>>> >>>> >>>> On Mon, Jun 23, 2014 at 2:28 PM, Christian Link < >>>> christian.l...@mdmp.com<mailto:christian.l...@mdmp.com>> wrote: >>>> Hi Jerome, >>>> >>>> thanks...I've already found "Brickhouse" and the Hive UDFs, but it >>>> didn't help. >>>> >>>> Today I'll try again to process the json file after going through all >>>> my mails...maybe I'll find a solution. >>>> >>>> Best, >>>> Chris >>>> >>>> >>>> On Fri, Jun 20, 2014 at 7:16 PM, Jerome Banks <jba...@tagged.com >>>> <mailto:jba...@tagged.com>> wrote: >>>> Christian, >>>> Sorry to spam this newsgroup, and this is not a commercial >>>> endorsement, but check out the Hive UDFs in the Brickhouse project ( >>>> http://github.com/klout/brickhouse ) ( >>>> http://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/ >>>> ) >>>> >>>> You can convert arbitrary complex Hive structures to an from json with >>>> it's to_json and from_json UDF's. See the blog posting for an explanation. >>>> >>>> -- jerome >>>> >>>> >>>> On Fri, Jun 20, 2014 at 8:26 AM, Christian Link < >>>> christian.l...@mdmp.com<mailto:christian.l...@mdmp.com>> wrote: >>>> hi, >>>> >>>> I'm very, very new to Hadoop, Hive, etc. and I have to import data into >>>> hive tables. >>>> >>>> Environment: Amazon EMR, S3, etc. >>>> >>>> The input file is on S3 and I copied it into my HDFS. >>>> >>>> 1. flat table with one column and loaded data into it: >>>> >>>> CREATE TABLE mdmp_raw_data (json_record STRING); >>>> LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO >>>> TABLE `mdmp_raw_data`; >>>> That worked, I can access some data, like this: >>>> >>>> SELECT d.carrier, d.language, d.country >>>> FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data, >>>> 'requestTimestamp', 'context') b AS requestTimestamp, context >>>> LATERAL VIEW json_tuple(b.context, 'locale') c AS locale >>>> LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country') d >>>> AS carrier, language, country >>>> LIMIT 1; >>>> >>>> Result: o2 - de Deutsch Deutschland >>>> >>>> I can also select the array at once: >>>> >>>> SELECT b.requestTimestamp, b.batch >>>> FROM mdmp_raw_data a >>>> LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch') b AS >>>> requestTimestamp, batch >>>> LIMIT 1; >>>> This will give me: >>>> >>>> >>>> [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST >>>> Doge >>>> Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much >>>> joy."}}, ...] >>>> >>>> This "batch" may contain n events will a structure like above. >>>> >>>> I want to put all events in a table where each "element" will be stored >>>> in a unique column: timestamp, requestId, sessionId, event, userId, action, >>>> context, properties >>>> >>>> 2. explode the "batch" I read a lot about SerDe, etc. - but I don't get >>>> it. >>>> >>>> - I tried to create a table with an array and load the data into it - >>>> several errors >>>> use explode in query but it doesn't accept "batch" as array >>>> - integrated several SerDes but get things like "unknown function >>>> jspilt" >>>> - I'm lost in too many documents, howtos, etc. and could need some >>>> advices... >>>> >>>> Thank you in advance! >>>> >>>> Best, Chris >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>> >>> >> >> >> -- >> ---------------------------------------------------------- >> Good judgement comes with experience. >> Experience comes with bad judgement. >> ---------------------------------------------------------- >> Roberto Congiu - Data Engineer - OpenX >> tel: +1 626 466 1141 >> >