Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Sam Joe
I tried using EXPLODE function on the nested json array but it doesn't work
and throws following error:

FAILED: UDFArgumentException explode() takes an array or a map as a
parameter


Thanks,
Joel

On Tue, Oct 27, 2015 at 3:20 PM, Sam Joe  wrote:

> Hi,
>
> Is it possible to use json_tuple function to extract data from json arrays
> (nested too). I am trying to process json data as string and avoid using
> serdes since user data may be malformed.
>
> Please see a sample json data given below:
>
>
> {
>
> "filter_level": "low",
>
> "retweeted": false,
>
> "in_reply_to_screen_name": null,
>
> "possibly_sensitive": false,
>
> "truncated": false,
>
> "lang": "en",
>
> "in_reply_to_status_id_str": null,
>
> "id": 654395184428515332,
>
> "extended_entities": {
>
> "media": [{
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301608990388224,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608990388224"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301608998764544,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608998764544"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301608994586624,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608994586624"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301609179115521,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301609179115521"
>
> }]
>
> }
>
> }
>
>
> Appreciate any help!
>
>
> *Thanks,*
>
> *Joel*
>
>
>
>


RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
looking at your sample data, you shouldn't need to use lateral view explode 
unless you are trying to get 1 entry per row for  your media sizes (thumb, 
small, large, medium, etc) ...

Try starting with something simple like :

SELECT get_json_object(text_col, '$.id') as id FROM tweets_raw limit 10;

You should also be able to use json_tuple(), but start simple

From: Sam Joe [mailto:games2013@gmail.com]
Sent: Tuesday, October 27, 2015 1:43 PM
To: user@hive.apache.org
Subject: Re: Using json_tuple for Nested json Arrays

Hi Ryan,

Thanks for your reply! I didn't try using  get_json_object() UDF. I will try 
using that and let you know the results.

I tried using the following script which failed :

  SELECT tr2.id<http://tr2.id>, tr2.possibly_sensitive, tr3.media, 
media_object.source_user_id
  FROM tweets_raw tr1
  LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities', 
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
  LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
  LATERAL VIEW EXPLODE(tr3.media) media_exploded as media_object
  where tr2.id<http://tr2.id>='654395184428515332'
  LIMIT 1;

FAILED: UDFArgumentException explode() takes an array or a map as a parameter


Thanks,
Joel

On Tue, Oct 27, 2015 at 3:37 PM, Ryan Harris 
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
Do you have an example of the query that you tried (which failed).
In short, you probably want to use the get_json_object() UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object

if you need the JSON array broken into individual records, you might require 
lateral view explode as in these examples:
http://mechanics.flite.com/blog/2014/04/16/using-explode-and-lateral-view-in-hive/
http://stackoverflow.com/questions/28716165/how-to-query-struct-array-with-hive-get-json-object


From: Sam Joe [mailto:games2013@gmail.com<mailto:games2013@gmail.com>]
Sent: Tuesday, October 27, 2015 1:29 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Using json_tuple for Nested json Arrays

I tried using EXPLODE function on the nested json array but it doesn't work and 
throws following error:

FAILED: UDFArgumentException explode() takes an array or a map as a parameter


Thanks,
Joel

On Tue, Oct 27, 2015 at 3:20 PM, Sam Joe 
<games2013@gmail.com<mailto:games2013@gmail.com>> wrote:
Hi,

Is it possible to use json_tuple function to extract data from json arrays 
(nested too). I am trying to process json data as string and avoid using serdes 
since user data may be malformed.

Please see a sample json data given below:


{
"filter_level": "low",
"retweeted": false,
"in_reply_to_screen_name": null,
"possibly_sensitive": false,
"truncated": false,
"lang": "en",
"in_reply_to_status_id_str": null,
"id": 654395184428515332,
"extended_entities": {
"media": [{
"sizes": {
"thumb": {
"w": 150,
"resize": "crop",
"h": 150
},
"small": {
"w": 340,
"resize": "fit",
"h": 255
},
"large": {
"w": 1024,
"resize": "fit",
"h": 768
},
"medium": {
"w": 600,
"resize": "fit",
"h": 450
}
},
"source_user_id": 16864598,
"media_url": "http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
"display_url": "pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>",
"type": "photo",
"url": "http://t.co/i3004WyF4g;,
"id": 654301608990388224,
"media_url_https": "https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
"expanded_url": 
"http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
"source_user_id_str": "16864598",
"indices": [143,
144],
"source_status_id_str": "654301626665189376",
"source_status_id": 654301626665189376,
"id_str": "654301608990388224"
},
{
"sizes": {
"thumb": {
"w": 150,
"resize": "crop",
"h": 150
},
"small": {
"w": 340,
"resize": "fit",
"h": 255
},
"large": {
"w": 1024,
"resize": "fit",
"h": 768
},
"medium": {
"w": 600,
"resize": "fit",
"h": 450
}
},
"source_user_id": 16864598,
"media_url": "http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
"display_url": "pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>",
"type": "phot

RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
I see where you are going with this now

Not sure if you might be bumping into this bug:
https://issues.apache.org/jira/browse/HIVE-1575

since this line
  LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
pulls the JSON array as a "top-level" object...

does this not work?
  LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities', 
'possibly_sensitive', 'extended_entities.media') tr2 as id, extended_entities, 
possibly_sensitive, media


From: Sam Joe [mailto:games2013@gmail.com]
Sent: Tuesday, October 27, 2015 2:25 PM
To: user@hive.apache.org
Subject: Re: Using json_tuple for Nested json Arrays

Hi Ryan,

The simple query is running fine as shown below:

hive>   SELECT tr2.id<http://tr2.id>, tr2.possibly_sensitive
>   FROM tweets_raw tr1
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities', 
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>   where tr2.id<http://tr2.id>='654395184428515332'
>   LIMIT 1;
OK
654395184428515332  false
Time taken: 1.813 seconds, Fetched: 1 row(s)
hive>


However, if I try to get any data from the json array it's failing.

Thanks,
Joel

On Tue, Oct 27, 2015 at 4:21 PM, Ryan Harris 
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
looking at your sample data, you shouldn't need to use lateral view explode 
unless you are trying to get 1 entry per row for  your media sizes (thumb, 
small, large, medium, etc) ...

Try starting with something simple like :

SELECT get_json_object(text_col, '$.id') as id FROM tweets_raw limit 10;

You should also be able to use json_tuple(), but start simple

From: Sam Joe [mailto:games2013@gmail.com<mailto:games2013@gmail.com>]
Sent: Tuesday, October 27, 2015 1:43 PM

To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Using json_tuple for Nested json Arrays

Hi Ryan,

Thanks for your reply! I didn't try using  get_json_object() UDF. I will try 
using that and let you know the results.

I tried using the following script which failed :

  SELECT tr2.id<http://tr2.id>, tr2.possibly_sensitive, tr3.media, 
media_object.source_user_id
  FROM tweets_raw tr1
  LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities', 
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
  LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
  LATERAL VIEW EXPLODE(tr3.media) media_exploded as media_object
  where tr2.id<http://tr2.id>='654395184428515332'
  LIMIT 1;

FAILED: UDFArgumentException explode() takes an array or a map as a parameter


Thanks,
Joel

On Tue, Oct 27, 2015 at 3:37 PM, Ryan Harris 
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
Do you have an example of the query that you tried (which failed).
In short, you probably want to use the get_json_object() UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object

if you need the JSON array broken into individual records, you might require 
lateral view explode as in these examples:
http://mechanics.flite.com/blog/2014/04/16/using-explode-and-lateral-view-in-hive/
http://stackoverflow.com/questions/28716165/how-to-query-struct-array-with-hive-get-json-object


From: Sam Joe [mailto:games2013@gmail.com<mailto:games2013@gmail.com>]
Sent: Tuesday, October 27, 2015 1:29 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Using json_tuple for Nested json Arrays

I tried using EXPLODE function on the nested json array but it doesn't work and 
throws following error:

FAILED: UDFArgumentException explode() takes an array or a map as a parameter


Thanks,
Joel

On Tue, Oct 27, 2015 at 3:20 PM, Sam Joe 
<games2013@gmail.com<mailto:games2013@gmail.com>> wrote:
Hi,

Is it possible to use json_tuple function to extract data from json arrays 
(nested too). I am trying to process json data as string and avoid using serdes 
since user data may be malformed.

Please see a sample json data given below:


{
"filter_level": "low",
"retweeted": false,
"in_reply_to_screen_name": null,
"possibly_sensitive": false,
"truncated": false,
"lang": "en",
"in_reply_to_status_id_str": null,
"id": 654395184428515332,
"extended_entities": {
"media": [{
"sizes": {
"thumb": {
"w": 150,
"resize": "crop",
"h": 150
},
"small": {
"w": 340,
"resize": "fit",
"h": 255
},
"large": {
"w": 1024,
"resize": "fit",
"h": 768
},
"medium": {
"w": 600,
"resize": "fit",
"h": 450
}
},
"source_user_id": 16864598,
&q

RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
Do you have an example of the query that you tried (which failed).
In short, you probably want to use the get_json_object() UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object

if you need the JSON array broken into individual records, you might require 
lateral view explode as in these examples:
http://mechanics.flite.com/blog/2014/04/16/using-explode-and-lateral-view-in-hive/
http://stackoverflow.com/questions/28716165/how-to-query-struct-array-with-hive-get-json-object


From: Sam Joe [mailto:games2013@gmail.com]
Sent: Tuesday, October 27, 2015 1:29 PM
To: user@hive.apache.org
Subject: Re: Using json_tuple for Nested json Arrays

I tried using EXPLODE function on the nested json array but it doesn't work and 
throws following error:

FAILED: UDFArgumentException explode() takes an array or a map as a parameter


Thanks,
Joel

On Tue, Oct 27, 2015 at 3:20 PM, Sam Joe 
<games2013@gmail.com<mailto:games2013@gmail.com>> wrote:
Hi,

Is it possible to use json_tuple function to extract data from json arrays 
(nested too). I am trying to process json data as string and avoid using serdes 
since user data may be malformed.

Please see a sample json data given below:


{
"filter_level": "low",
"retweeted": false,
"in_reply_to_screen_name": null,
"possibly_sensitive": false,
"truncated": false,
"lang": "en",
"in_reply_to_status_id_str": null,
"id": 654395184428515332,
"extended_entities": {
"media": [{
"sizes": {
"thumb": {
"w": 150,
"resize": "crop",
"h": 150
},
"small": {
"w": 340,
"resize": "fit",
"h": 255
},
"large": {
"w": 1024,
"resize": "fit",
"h": 768
},
"medium": {
"w": 600,
"resize": "fit",
"h": 450
}
},
"source_user_id": 16864598,
"media_url": "http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
"display_url": "pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>",
"type": "photo",
"url": "http://t.co/i3004WyF4g;,
"id": 654301608990388224,
"media_url_https": "https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
"expanded_url": 
"http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
"source_user_id_str": "16864598",
"indices": [143,
144],
"source_status_id_str": "654301626665189376",
"source_status_id": 654301626665189376,
"id_str": "654301608990388224"
},
{
"sizes": {
"thumb": {
"w": 150,
"resize": "crop",
"h": 150
},
"small": {
"w": 340,
"resize": "fit",
"h": 255
},
"large": {
"w": 1024,
"resize": "fit",
"h": 768
},
"medium": {
"w": 600,
"resize": "fit",
"h": 450
}
},
"source_user_id": 16864598,
"media_url": "http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
"display_url": "pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>",
"type": "photo",
"url": "http://t.co/i3004WyF4g;,
"id": 654301608998764544,
"media_url_https": "https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
"expanded_url": 
"http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
"source_user_id_str": "16864598",
"indices": [143,
144],
"source_status_id_str": "654301626665189376",
"source_status_id": 654301626665189376,
"id_str": "654301608998764544"
},
{
"sizes": {
"thumb": {
"w": 150,
"resize": "crop",
"h": 150
},
"small": {
"w": 340,
"resize": "fit",
"h": 255
},
"large": {
"w": 1024,
"resize": "fit",
"h": 768
},
"medium": {
"w": 600,
"resize": "fit",
"h": 450
}
},
"source_user_id": 16864598,
"media_url": "http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
"display_url": "pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>",
"type": "photo",
"url": "http://t.co/i3004WyF4g;,
"id": 654301608994586624,
"media_url_https": "https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
"expanded_url": 
"http://twitter.com/lordlancaster/status/654301626665189

Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Sam Joe
Hi Ryan,

The simple query is running fine as shown below:

hive>   SELECT tr2.id, tr2.possibly_sensitive
>   FROM tweets_raw tr1
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>   where tr2.id='654395184428515332'
>   LIMIT 1;
OK
654395184428515332  false
Time taken: 1.813 seconds, Fetched: 1 row(s)
hive>


However, if I try to get any data from the json array it's failing.

Thanks,
Joel

On Tue, Oct 27, 2015 at 4:21 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
wrote:

> looking at your sample data, you shouldn't need to use lateral view
> explode unless you are trying to get 1 entry per row for  your media sizes
> (thumb, small, large, medium, etc) ...
>
>
>
> Try starting with something simple like :
>
>
>
> SELECT get_json_object(text_col, '$.id') as id FROM tweets_raw limit 10;
>
>
>
> You should also be able to use json_tuple(), but start simple
>
>
>
> *From:* Sam Joe [mailto:games2013@gmail.com]
> *Sent:* Tuesday, October 27, 2015 1:43 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: Using json_tuple for Nested json Arrays
>
>
>
> Hi Ryan,
>
>
>
> Thanks for your reply! I didn't try using  get_json_object() UDF. I will
> try using that and let you know the results.
>
>
>
> I tried using the following script which failed :
>
>
>
>   SELECT tr2.id, tr2.possibly_sensitive, tr3.media,
> media_object.source_user_id
>
>   FROM tweets_raw tr1
>
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
> 'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>
>   LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
>
>   LATERAL VIEW EXPLODE(tr3.media) media_exploded as media_object
>
>   where tr2.id='654395184428515332'
>
>   LIMIT 1;
>
>
>
> FAILED: UDFArgumentException explode() takes an array or a map as a
> parameter
>
>
>
>
>
> Thanks,
>
> Joel
>
>
>
> On Tue, Oct 27, 2015 at 3:37 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
> wrote:
>
> Do you have an example of the query that you tried (which failed).
>
> In short, you probably want to use the get_json_object() UDF:
>
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object
>
>
>
> if you need the JSON array broken into individual records, you might
> require lateral view explode as in these examples:
>
>
> http://mechanics.flite.com/blog/2014/04/16/using-explode-and-lateral-view-in-hive/
>
>
> http://stackoverflow.com/questions/28716165/how-to-query-struct-array-with-hive-get-json-object
>
>
>
>
>
> *From:* Sam Joe [mailto:games2013@gmail.com]
> *Sent:* Tuesday, October 27, 2015 1:29 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Using json_tuple for Nested json Arrays
>
>
>
> I tried using EXPLODE function on the nested json array but it doesn't
> work and throws following error:
>
>
>
> FAILED: UDFArgumentException explode() takes an array or a map as a
> parameter
>
>
>
>
>
> Thanks,
>
> Joel
>
>
>
> On Tue, Oct 27, 2015 at 3:20 PM, Sam Joe <games2013@gmail.com> wrote:
>
> Hi,
>
>
>
> Is it possible to use json_tuple function to extract data from json arrays
> (nested too). I am trying to process json data as string and avoid using
> serdes since user data may be malformed.
>
>
>
> Please see a sample json data given below:
>
>
>
>
>
> {
>
> "filter_level": "low",
>
> "retweeted": false,
>
> "in_reply_to_screen_name": null,
>
> "possibly_sensitive": false,
>
> "truncated": false,
>
> "lang": "en",
>
> "in_reply_to_status_id_str": null,
>
> "id": 654395184428515332,
>
> "extended_entities": {
>
> "media": [{
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 1

Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Sam Joe
Hi Ryan,

Thanks for your reply! I didn't try using  get_json_object() UDF. I will
try using that and let you know the results.

I tried using the following script which failed :

  SELECT tr2.id, tr2.possibly_sensitive, tr3.media,
media_object.source_user_id
  FROM tweets_raw tr1
  LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
  LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
  LATERAL VIEW EXPLODE(tr3.media) media_exploded as media_object
  where tr2.id='654395184428515332'
  LIMIT 1;

FAILED: UDFArgumentException explode() takes an array or a map as a
parameter


Thanks,
Joel

On Tue, Oct 27, 2015 at 3:37 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
wrote:

> Do you have an example of the query that you tried (which failed).
>
> In short, you probably want to use the get_json_object() UDF:
>
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object
>
>
>
> if you need the JSON array broken into individual records, you might
> require lateral view explode as in these examples:
>
>
> http://mechanics.flite.com/blog/2014/04/16/using-explode-and-lateral-view-in-hive/
>
>
> http://stackoverflow.com/questions/28716165/how-to-query-struct-array-with-hive-get-json-object
>
>
>
>
>
> *From:* Sam Joe [mailto:games2013@gmail.com]
> *Sent:* Tuesday, October 27, 2015 1:29 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Using json_tuple for Nested json Arrays
>
>
>
> I tried using EXPLODE function on the nested json array but it doesn't
> work and throws following error:
>
>
>
> FAILED: UDFArgumentException explode() takes an array or a map as a
> parameter
>
>
>
>
>
> Thanks,
>
> Joel
>
>
>
> On Tue, Oct 27, 2015 at 3:20 PM, Sam Joe <games2013@gmail.com> wrote:
>
> Hi,
>
>
>
> Is it possible to use json_tuple function to extract data from json arrays
> (nested too). I am trying to process json data as string and avoid using
> serdes since user data may be malformed.
>
>
>
> Please see a sample json data given below:
>
>
>
>
>
> {
>
> "filter_level": "low",
>
> "retweeted": false,
>
> "in_reply_to_screen_name": null,
>
> "possibly_sensitive": false,
>
> "truncated": false,
>
> "lang": "en",
>
> "in_reply_to_status_id_str": null,
>
> "id": 654395184428515332,
>
> "extended_entities": {
>
> "media": [{
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301608990388224,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608990388224"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> }

Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Sam Joe
Hi Ryan,

The statement returns null for media as shown below:

hive> SELECT tr2.id, tr2.possibly_sensitive, tr2.media
>   FROM tweets_raw tr1
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
'possibly_sensitive', 'extended_entities.media') tr2 as id,
extended_entities, possibly_sensitive, media
>   where tr2.id='654395184428515332'
>   LIMIT 1;
OK
654395184428515332  false   NULL


I am able to return the media json array using the following script:

hive>   SELECT tr2.id, tr2.possibly_sensitive, tr3.media
>   FROM tweets_raw tr1
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>   LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
>   where tr2.id='654395184428515332'
>   LIMIT 1;
OK
654395184428515332  false
[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","display_url":;
pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
","id":654301608990388224,"media_url_https":"
https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","expanded_url":;
http://twitter.com/lordlancaster/status/654301626665189376/photo/1
","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608990388224"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","display_url":;
pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
","id":654301608998764544,"media_url_https":"
https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","expanded_url":;
http://twitter.com/lordlancaster/status/654301626665189376/photo/1
","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608998764544"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg","display_url":;
pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
","id":654301608994586624,"media_url_https":"
https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg","expanded_url":;
http://twitter.com/lordlancaster/status/654301626665189376/photo/1
","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608994586624"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
http://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg","display_url":;
pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
","id":654301609179115521,"media_url_

RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
hmmm...I'm not sure what the return value type of json_tuple is...
I'd probably try creating a temporary table from your working query below and 
then work on getting the lateral view explode to work against the temp table.

FAILED: UDFArgumentException explode() takes an array or a map as a parameter
 Apparently, hive doesn't think tr3.media is an array or map..so what is it?

From: Sam Joe [mailto:games2013@gmail.com]
Sent: Tuesday, October 27, 2015 2:56 PM
To: user@hive.apache.org
Subject: Re: Using json_tuple for Nested json Arrays

Hi Ryan,

The statement returns null for media as shown below:

hive> SELECT tr2.id<http://tr2.id>, tr2.possibly_sensitive, tr2.media
>   FROM tweets_raw tr1
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities', 
'possibly_sensitive', 'extended_entities.media') tr2 as id, extended_entities, 
possibly_sensitive, media
>   where tr2.id<http://tr2.id>='654395184428515332'
>   LIMIT 1;
OK
654395184428515332  false   NULL


I am able to return the media json array using the following script:

hive>   SELECT tr2.id<http://tr2.id>, tr2.possibly_sensitive, tr3.media
>   FROM tweets_raw tr1
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities', 
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>   LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
>   where tr2.id<http://tr2.id>='654395184428515332'
>   LIMIT 1;
OK
654395184428515332  false   
[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","display_url":"pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>","type":"photo","url":"http://t.co/i3004WyF4g","id":654301608990388224,"media_url_https":"https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","expanded_url":"http://twitter.com/lordlancaster/status/654301626665189376/photo/1","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608990388224"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","display_url":"pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>","type":"photo","url":"http://t.co/i3004WyF4g","id":654301608998764544,"media_url_https":"https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","expanded_url":"http://twitter.com/lordlancaster/status/654301626665189376/photo/1","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608998764544"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg","display_url":"pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>","type":"photo","url":"http://t.co/i3004WyF4g","id":654301608994586624,"media_url_https":"https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg","expanded_url":"http://twitter.com/lordlancaster/status/654301626665189376/photo/1","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":6

Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Sam Joe
Hi Ryan,

I think tr3.media  a complex json array having nested json tuple objects.
For example, sizes is a json tuple object present inside the array which I
think the function EXPLODE is not expecting. May be the explode function is
expecting a closing brace '}' corresponding to the first brace '{' instead
of another opening brace.

So, could this be a bug?

Thanks,
Joel

On Tue, Oct 27, 2015 at 5:22 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
wrote:

> hmmm...I'm not sure what the return value type of json_tuple is...
>
> I'd probably try creating a temporary table from your working query below
> and then work on getting the lateral view explode to work against the temp
> table.
>
>
>
> FAILED: UDFArgumentException explode() takes an array or a map as a
> parameter
>
>  Apparently, hive doesn't think tr3.media is an array or map..so what is
> it?
>
>
>
> *From:* Sam Joe [mailto:games2013@gmail.com]
> *Sent:* Tuesday, October 27, 2015 2:56 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: Using json_tuple for Nested json Arrays
>
>
>
> Hi Ryan,
>
>
>
> The statement returns null for media as shown below:
>
>
>
> hive> SELECT tr2.id, tr2.possibly_sensitive, tr2.media
>
> >   FROM tweets_raw tr1
>
> >   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
> 'possibly_sensitive', 'extended_entities.media') tr2 as id,
> extended_entities, possibly_sensitive, media
>
> >   where tr2.id='654395184428515332'
>
> >   LIMIT 1;
>
> OK
>
> 654395184428515332  false   NULL
>
>
>
>
>
> I am able to return the media json array using the following script:
>
>
>
> hive>   SELECT tr2.id, tr2.possibly_sensitive, tr3.media
>
> >   FROM tweets_raw tr1
>
> >   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
> 'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>
> >   LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as
> media
>
> >   where tr2.id='654395184428515332'
>
> >   LIMIT 1;
>
> OK
>
> 654395184428515332  false
> [{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
> http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","display_url":;
> pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
> ","id":654301608990388224,"media_url_https":"
> https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","expanded_url":;
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1
> ","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608990388224"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
> http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","display_url":;
> pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
> ","id":654301608998764544,"media_url_https":"
> https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","expanded_url":;
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1
> ","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608998764544"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h"

Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Nishant Aggarwal
Hello Sam,
You can easily achieve this by using elephant-bird.jars in pig. We are also
caturing tweets via flume and filter them using pig and elephant-jars. You
can find the related jars over internet.

Cheers,
Nishant Aggarwal
On 28 Oct 2015 00:50, "Sam Joe"  wrote:

> Hi,
>
> Is it possible to use json_tuple function to extract data from json arrays
> (nested too). I am trying to process json data as string and avoid using
> serdes since user data may be malformed.
>
> Please see a sample json data given below:
>
>
> {
>
> "filter_level": "low",
>
> "retweeted": false,
>
> "in_reply_to_screen_name": null,
>
> "possibly_sensitive": false,
>
> "truncated": false,
>
> "lang": "en",
>
> "in_reply_to_status_id_str": null,
>
> "id": 654395184428515332,
>
> "extended_entities": {
>
> "media": [{
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301608990388224,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608990388224"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301608998764544,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608998764544"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301608994586624,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608994586624"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg;,
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g;,
>
> "id": 654301609179115521,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg;,
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301609179115521"
>
> }]
>
> }
>
> }
>
>
> Appreciate any help!
>
>
> *Thanks,*
>
> *Joel*
>
>
>
>


RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
I mean that I don't know what the return hive datatype of json_tuple is.  If it 
is returning a json array based on the hive string datatype, explode() isn't 
going to be happy with that and it will throw the error that you are getting.

if you do a create table as statement, you can then check the resulting 
datatype of that table to see exactly what is returned from json_tuple().

If json_tuple() is returning a hive array datatype, then I'm not sure why your 
original query isn't working correctly unless you possibly need to wrap it in a 
sub-query...either way, the next step I would take in troubleshooting would be 
to use CTAS to
a) check the datatype of the 'media' column if it isn't an array, you might 
checkout 
https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFJsonAsArray.java
b) if 'media' turns out to be an array datatype you just need to troubleshoot 
the explode()

From: Sam Joe [mailto:games2013@gmail.com]
Sent: Tuesday, October 27, 2015 7:37 PM
To: user@hive.apache.org
Subject: Re: Using json_tuple for Nested json Arrays

Hi Ryan,

I think tr3.media  a complex json array having nested json tuple objects. For 
example, sizes is a json tuple object present inside the array which I think 
the function EXPLODE is not expecting. May be the explode function is expecting 
a closing brace '}' corresponding to the first brace '{' instead of another 
opening brace.

So, could this be a bug?

Thanks,
Joel

On Tue, Oct 27, 2015 at 5:22 PM, Ryan Harris 
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
hmmm...I'm not sure what the return value type of json_tuple is...
I'd probably try creating a temporary table from your working query below and 
then work on getting the lateral view explode to work against the temp table.

FAILED: UDFArgumentException explode() takes an array or a map as a parameter
 Apparently, hive doesn't think tr3.media is an array or map..so what is it?

From: Sam Joe [mailto:games2013@gmail.com<mailto:games2013@gmail.com>]
Sent: Tuesday, October 27, 2015 2:56 PM

To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Using json_tuple for Nested json Arrays

Hi Ryan,

The statement returns null for media as shown below:

hive> SELECT tr2.id<http://tr2.id>, tr2.possibly_sensitive, tr2.media
>   FROM tweets_raw tr1
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities', 
'possibly_sensitive', 'extended_entities.media') tr2 as id, extended_entities, 
possibly_sensitive, media
>   where tr2.id<http://tr2.id>='654395184428515332'
>   LIMIT 1;
OK
654395184428515332  false   NULL


I am able to return the media json array using the following script:

hive>   SELECT tr2.id<http://tr2.id>, tr2.possibly_sensitive, tr3.media
>   FROM tweets_raw tr1
>   LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities', 
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>   LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
>   where tr2.id<http://tr2.id>='654395184428515332'
>   LIMIT 1;
OK
654395184428515332  false   
[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","display_url":"pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>","type":"photo","url":"http://t.co/i3004WyF4g","id":654301608990388224,"media_url_https":"https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","expanded_url":"http://twitter.com/lordlancaster/status/654301626665189376/photo/1","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608990388224"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","display_url":"pic.twitter.com/i3004WyF4g

Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Sam Joe
Thanks Nishant! Will try using Pig json loader too to achieve this
requirement. If you have any tutorial for extracting data from complex
nested json arrays (as the example given in my previous email), please send
it.

Appreciate your help!

Thanks,
Joel

On Tue, Oct 27, 2015 at 10:20 PM, Nishant Aggarwal 
wrote:

> Hello Sam,
> You can easily achieve this by using elephant-bird.jars in pig. We are
> also caturing tweets via flume and filter them using pig and elephant-jars.
> You can find the related jars over internet.
>
> Cheers,
> Nishant Aggarwal
> On 28 Oct 2015 00:50, "Sam Joe"  wrote:
>
>> Hi,
>>
>> Is it possible to use json_tuple function to extract data from json
>> arrays (nested too). I am trying to process json data as string and avoid
>> using serdes since user data may be malformed.
>>
>> Please see a sample json data given below:
>>
>>
>> {
>>
>> "filter_level": "low",
>>
>> "retweeted": false,
>>
>> "in_reply_to_screen_name": null,
>>
>> "possibly_sensitive": false,
>>
>> "truncated": false,
>>
>> "lang": "en",
>>
>> "in_reply_to_status_id_str": null,
>>
>> "id": 654395184428515332,
>>
>> "extended_entities": {
>>
>> "media": [{
>>
>> "sizes": {
>>
>> "thumb": {
>>
>> "w": 150,
>>
>> "resize": "crop",
>>
>> "h": 150
>>
>> },
>>
>> "small": {
>>
>> "w": 340,
>>
>> "resize": "fit",
>>
>> "h": 255
>>
>> },
>>
>> "large": {
>>
>> "w": 1024,
>>
>> "resize": "fit",
>>
>> "h": 768
>>
>> },
>>
>> "medium": {
>>
>> "w": 600,
>>
>> "resize": "fit",
>>
>> "h": 450
>>
>> }
>>
>> },
>>
>> "source_user_id": 16864598,
>>
>> "media_url": "http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>>
>> "display_url": "pic.twitter.com/i3004WyF4g",
>>
>> "type": "photo",
>>
>> "url": "http://t.co/i3004WyF4g;,
>>
>> "id": 654301608990388224,
>>
>> "media_url_https": "https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>>
>> "expanded_url": "
>> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>>
>> "source_user_id_str": "16864598",
>>
>> "indices": [143,
>>
>> 144],
>>
>> "source_status_id_str": "654301626665189376",
>>
>> "source_status_id": 654301626665189376,
>>
>> "id_str": "654301608990388224"
>>
>> },
>>
>> {
>>
>> "sizes": {
>>
>> "thumb": {
>>
>> "w": 150,
>>
>> "resize": "crop",
>>
>> "h": 150
>>
>> },
>>
>> "small": {
>>
>> "w": 340,
>>
>> "resize": "fit",
>>
>> "h": 255
>>
>> },
>>
>> "large": {
>>
>> "w": 1024,
>>
>> "resize": "fit",
>>
>> "h": 768
>>
>> },
>>
>> "medium": {
>>
>> "w": 600,
>>
>> "resize": "fit",
>>
>> "h": 450
>>
>> }
>>
>> },
>>
>> "source_user_id": 16864598,
>>
>> "media_url": "http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
>>
>> "display_url": "pic.twitter.com/i3004WyF4g",
>>
>> "type": "photo",
>>
>> "url": "http://t.co/i3004WyF4g;,
>>
>> "id": 654301608998764544,
>>
>> "media_url_https": "https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
>>
>> "expanded_url": "
>> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>>
>> "source_user_id_str": "16864598",
>>
>> "indices": [143,
>>
>> 144],
>>
>> "source_status_id_str": "654301626665189376",
>>
>> "source_status_id": 654301626665189376,
>>
>> "id_str": "654301608998764544"
>>
>> },
>>
>> {
>>
>> "sizes": {
>>
>> "thumb": {
>>
>> "w": 150,
>>
>> "resize": "crop",
>>
>> "h": 150
>>
>> },
>>
>> "small": {
>>
>> "w": 340,
>>
>> "resize": "fit",
>>
>> "h": 255
>>
>> },
>>
>> "large": {
>>
>> "w": 1024,
>>
>> "resize": "fit",
>>
>> "h": 768
>>
>> },
>>
>> "medium": {
>>
>> "w": 600,
>>
>> "resize": "fit",
>>
>> "h": 450
>>
>> }
>>
>> },
>>
>> "source_user_id": 16864598,
>>
>> "media_url": "http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
>>
>> "display_url": "pic.twitter.com/i3004WyF4g",
>>
>> "type": "photo",
>>
>> "url": "http://t.co/i3004WyF4g;,
>>
>> "id": 654301608994586624,
>>
>> "media_url_https": "https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
>>
>> "expanded_url": "
>> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>>
>> "source_user_id_str": "16864598",
>>
>> "indices": [143,
>>
>> 144],
>>
>> "source_status_id_str": "654301626665189376",
>>
>> "source_status_id": 654301626665189376,
>>
>> "id_str": "654301608994586624"
>>
>> },
>>
>> {
>>
>> "sizes": {
>>
>> "thumb": {
>>
>> "w": 150,
>>
>> "resize": "crop",
>>
>> "h": 150
>>
>> },
>>
>> "small": {
>>
>> "w": 340,
>>
>> "resize": "fit",
>>
>> "h": 255
>>
>> },
>>
>> "large": {
>>
>> "w": 1024,
>>
>> "resize": "fit",
>>
>> "h": 768
>>
>> },
>>
>> "medium": {
>>
>> "w": 600,
>>
>> "resize": "fit",
>>
>> "h": 450
>>
>> }
>>
>> },
>>
>> "source_user_id": 16864598,
>>
>> "media_url": "http://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg;,
>>
>> "display_url": "pic.twitter.com/i3004WyF4g",
>>
>> "type": "photo",
>>
>> "url": "http://t.co/i3004WyF4g;,
>>
>> "id": 654301609179115521,
>>
>> "media_url_https": "https://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg;,
>>
>> "expanded_url": "
>> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>>
>> "source_user_id_str": 

Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Gopal Vijayaraghavan
Hi,

> If you have any tutorial for extracting data from complex nested json
>arrays (as the example given in my previous email), please send it.

90% of working with the real world is cleansing bad data. People
under-sell hive's flexibility in situations like this.


This is what I do 

hive> compile `
import org.apache.hadoop.hive.ql.exec.UDF \;
import groovy.json.JsonSlurper \;
import org.apache.hadoop.io.Text \;
public class JsonExtract extends UDF {
  public int evaluate(Text a){
def jsonSlurper = new JsonSlurper() \;
def obj = jsonSlurper.parseText(a.toString())\;
return  obj.val1\;
  }
} ` AS GROOVY NAMED json_extract.groovy;


hive> CREATE TEMPORARY FUNCTION json_extract as 'JsonExtract';


hive> select json_extract('{"val1": 2}') from date_dim limit 1;

select json_extract('{"val1": 2}') from date_dim limit 1
OK
2
Time taken: 0.13 seconds, Fetched: 1 row(s)


Caveats - this generates bytecode at runtime, so keep an eye on the

hive> list jars;

Because there's no real namespacing, naming your classes/functions the
same while developing can drive you crazy (a little).

Cheers,
Gopal










Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Nishant Aggarwal
Hello Sam,

Please find attached PIG script for the same. You may find the necessary
jars below.

http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-pig

Note: Same functionality can be achieved in Hive as well.




Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305
http://in.linkedin.com/pub/nishant-aggarwal/53/698/11b


On Wed, Oct 28, 2015 at 8:30 AM, Sam Joe  wrote:

> Thanks Nishant! Will try using Pig json loader too to achieve this
> requirement. If you have any tutorial for extracting data from complex
> nested json arrays (as the example given in my previous email), please send
> it.
>
> Appreciate your help!
>
> Thanks,
> Joel
>
> On Tue, Oct 27, 2015 at 10:20 PM, Nishant Aggarwal 
> wrote:
>
>> Hello Sam,
>> You can easily achieve this by using elephant-bird.jars in pig. We are
>> also caturing tweets via flume and filter them using pig and elephant-jars.
>> You can find the related jars over internet.
>>
>> Cheers,
>> Nishant Aggarwal
>> On 28 Oct 2015 00:50, "Sam Joe"  wrote:
>>
>>> Hi,
>>>
>>> Is it possible to use json_tuple function to extract data from json
>>> arrays (nested too). I am trying to process json data as string and avoid
>>> using serdes since user data may be malformed.
>>>
>>> Please see a sample json data given below:
>>>
>>>
>>> {
>>>
>>> "filter_level": "low",
>>>
>>> "retweeted": false,
>>>
>>> "in_reply_to_screen_name": null,
>>>
>>> "possibly_sensitive": false,
>>>
>>> "truncated": false,
>>>
>>> "lang": "en",
>>>
>>> "in_reply_to_status_id_str": null,
>>>
>>> "id": 654395184428515332,
>>>
>>> "extended_entities": {
>>>
>>> "media": [{
>>>
>>> "sizes": {
>>>
>>> "thumb": {
>>>
>>> "w": 150,
>>>
>>> "resize": "crop",
>>>
>>> "h": 150
>>>
>>> },
>>>
>>> "small": {
>>>
>>> "w": 340,
>>>
>>> "resize": "fit",
>>>
>>> "h": 255
>>>
>>> },
>>>
>>> "large": {
>>>
>>> "w": 1024,
>>>
>>> "resize": "fit",
>>>
>>> "h": 768
>>>
>>> },
>>>
>>> "medium": {
>>>
>>> "w": 600,
>>>
>>> "resize": "fit",
>>>
>>> "h": 450
>>>
>>> }
>>>
>>> },
>>>
>>> "source_user_id": 16864598,
>>>
>>> "media_url": "http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>>>
>>> "display_url": "pic.twitter.com/i3004WyF4g",
>>>
>>> "type": "photo",
>>>
>>> "url": "http://t.co/i3004WyF4g;,
>>>
>>> "id": 654301608990388224,
>>>
>>> "media_url_https": "https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg;,
>>>
>>> "expanded_url": "
>>> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>>>
>>> "source_user_id_str": "16864598",
>>>
>>> "indices": [143,
>>>
>>> 144],
>>>
>>> "source_status_id_str": "654301626665189376",
>>>
>>> "source_status_id": 654301626665189376,
>>>
>>> "id_str": "654301608990388224"
>>>
>>> },
>>>
>>> {
>>>
>>> "sizes": {
>>>
>>> "thumb": {
>>>
>>> "w": 150,
>>>
>>> "resize": "crop",
>>>
>>> "h": 150
>>>
>>> },
>>>
>>> "small": {
>>>
>>> "w": 340,
>>>
>>> "resize": "fit",
>>>
>>> "h": 255
>>>
>>> },
>>>
>>> "large": {
>>>
>>> "w": 1024,
>>>
>>> "resize": "fit",
>>>
>>> "h": 768
>>>
>>> },
>>>
>>> "medium": {
>>>
>>> "w": 600,
>>>
>>> "resize": "fit",
>>>
>>> "h": 450
>>>
>>> }
>>>
>>> },
>>>
>>> "source_user_id": 16864598,
>>>
>>> "media_url": "http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
>>>
>>> "display_url": "pic.twitter.com/i3004WyF4g",
>>>
>>> "type": "photo",
>>>
>>> "url": "http://t.co/i3004WyF4g;,
>>>
>>> "id": 654301608998764544,
>>>
>>> "media_url_https": "https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg;,
>>>
>>> "expanded_url": "
>>> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>>>
>>> "source_user_id_str": "16864598",
>>>
>>> "indices": [143,
>>>
>>> 144],
>>>
>>> "source_status_id_str": "654301626665189376",
>>>
>>> "source_status_id": 654301626665189376,
>>>
>>> "id_str": "654301608998764544"
>>>
>>> },
>>>
>>> {
>>>
>>> "sizes": {
>>>
>>> "thumb": {
>>>
>>> "w": 150,
>>>
>>> "resize": "crop",
>>>
>>> "h": 150
>>>
>>> },
>>>
>>> "small": {
>>>
>>> "w": 340,
>>>
>>> "resize": "fit",
>>>
>>> "h": 255
>>>
>>> },
>>>
>>> "large": {
>>>
>>> "w": 1024,
>>>
>>> "resize": "fit",
>>>
>>> "h": 768
>>>
>>> },
>>>
>>> "medium": {
>>>
>>> "w": 600,
>>>
>>> "resize": "fit",
>>>
>>> "h": 450
>>>
>>> }
>>>
>>> },
>>>
>>> "source_user_id": 16864598,
>>>
>>> "media_url": "http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
>>>
>>> "display_url": "pic.twitter.com/i3004WyF4g",
>>>
>>> "type": "photo",
>>>
>>> "url": "http://t.co/i3004WyF4g;,
>>>
>>> "id": 654301608994586624,
>>>
>>> "media_url_https": "https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg;,
>>>
>>> "expanded_url": "
>>> http://twitter.com/lordlancaster/status/654301626665189376/photo/1;,
>>>
>>> "source_user_id_str": "16864598",
>>>
>>> "indices": [143,
>>>
>>> 144],
>>>
>>> "source_status_id_str": "654301626665189376",
>>>
>>> "source_status_id": 654301626665189376,
>>>
>>> "id_str": "654301608994586624"
>>>
>>> },
>>>
>>> {
>>>
>>> "sizes": {
>>>
>>> "thumb": {
>>>
>>> "w": 150,
>>>
>>>