Hi Ryan,
The statement returns null for media as shown below:
hive> SELECT tr2.id, tr2.possibly_sensitive, tr2.media
> FROM tweets_raw tr1
> LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
'possibly_sensitive', 'extended_entities.media') tr2 as id,
extended_entities, possibly_sensitive, media
> where tr2.id='654395184428515332'
> LIMIT 1;
OK
654395184428515332 false NULL
I am able to return the media json array using the following script:
hive> SELECT tr2.id, tr2.possibly_sensitive, tr3.media
> FROM tweets_raw tr1
> LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
> LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
> where tr2.id='654395184428515332'
> LIMIT 1;
OK
654395184428515332 false
[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","display_url":"
pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
","id":654301608990388224,"media_url_https":"
https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg","expanded_url":"
http://twitter.com/lordlancaster/status/654301626665189376/photo/1
","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608990388224"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","display_url":"
pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
","id":654301608998764544,"media_url_https":"
https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg","expanded_url":"
http://twitter.com/lordlancaster/status/654301626665189376/photo/1
","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608998764544"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg","display_url":"
pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
","id":654301608994586624,"media_url_https":"
https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg","expanded_url":"
http://twitter.com/lordlancaster/status/654301626665189376/photo/1
","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301608994586624"},{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":255},"large":{"w":1024,"resize":"fit","h":768},"medium":{"w":600,"resize":"fit","h":450}},"source_user_id":16864598,"media_url":"
http://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg","display_url":"
pic.twitter.com/i3004WyF4g","type":"photo","url":"http://t.co/i3004WyF4g
","id":654301609179115521,"media_url_https":"
https://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg","expanded_url":"
http://twitter.com/lordlancaster/status/654301626665189376/photo/1
","source_user_id_str":"16864598","indices":[143,144],"source_status_id_str":"654301626665189376","source_status_id":654301626665189376,"id_str":"654301609179115521"}]
Time taken: 0.191 seconds, Fetched: 1 row(s)
hive>
However, I would like to know how I can access the json tuples present
inside the media json array.
Appreciate your help!
Thanks,
Joel
On Tue, Oct 27, 2015 at 4:44 PM, Ryan Harris <[email protected]>
wrote:
> I see where you are going with this now....
>
>
>
> Not sure if you might be bumping into this bug:
>
> https://issues.apache.org/jira/browse/HIVE-1575
>
>
>
> since this line
>
> LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
>
> pulls the JSON array as a "top-level" object...
>
>
>
> does this not work?
>
> LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
> 'possibly_sensitive', 'extended_entities.media') tr2 as id,
> extended_entities, possibly_sensitive, media
>
>
>
>
>
> *From:* Sam Joe [mailto:[email protected]]
> *Sent:* Tuesday, October 27, 2015 2:25 PM
>
> *To:* [email protected]
> *Subject:* Re: Using json_tuple for Nested json Arrays
>
>
>
> Hi Ryan,
>
>
>
> The simple query is running fine as shown below:
>
>
>
> hive> SELECT tr2.id, tr2.possibly_sensitive
>
> > FROM tweets_raw tr1
>
> > LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
> 'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>
> > where tr2.id='654395184428515332'
>
> > LIMIT 1;
>
> OK
>
> 654395184428515332 false
>
> Time taken: 1.813 seconds, Fetched: 1 row(s)
>
> hive>
>
>
>
>
>
> However, if I try to get any data from the json array it's failing.
>
>
>
> Thanks,
>
> Joel
>
>
>
> On Tue, Oct 27, 2015 at 4:21 PM, Ryan Harris <[email protected]>
> wrote:
>
> looking at your sample data, you shouldn't need to use lateral view
> explode unless you are trying to get 1 entry per row for your media sizes
> (thumb, small, large, medium, etc) ...
>
>
>
> Try starting with something simple like :
>
>
>
> SELECT get_json_object(text_col, '$.id') as id FROM tweets_raw limit 10;
>
>
>
> You should also be able to use json_tuple(), but start simple....
>
>
>
> *From:* Sam Joe [mailto:[email protected]]
> *Sent:* Tuesday, October 27, 2015 1:43 PM
>
>
> *To:* [email protected]
> *Subject:* Re: Using json_tuple for Nested json Arrays
>
>
>
> Hi Ryan,
>
>
>
> Thanks for your reply! I didn't try using get_json_object() UDF. I will
> try using that and let you know the results.
>
>
>
> I tried using the following script which failed :
>
>
>
> SELECT tr2.id, tr2.possibly_sensitive, tr3.media,
> media_object.source_user_id
>
> FROM tweets_raw tr1
>
> LATERAL VIEW json_tuple(tr1.text_col, 'id', 'extended_entities',
> 'possibly_sensitive') tr2 as id, extended_entities, possibly_sensitive
>
> LATERAL VIEW json_tuple(tr2.extended_entities, 'media') tr3 as media
>
> LATERAL VIEW EXPLODE(tr3.media) media_exploded as media_object
>
> where tr2.id='654395184428515332'
>
> LIMIT 1;
>
>
>
> FAILED: UDFArgumentException explode() takes an array or a map as a
> parameter
>
>
>
>
>
> Thanks,
>
> Joel
>
>
>
> On Tue, Oct 27, 2015 at 3:37 PM, Ryan Harris <[email protected]>
> wrote:
>
> Do you have an example of the query that you tried (which failed).
>
> In short, you probably want to use the get_json_object() UDF:
>
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object
>
>
>
> if you need the JSON array broken into individual records, you might
> require lateral view explode as in these examples:
>
>
> http://mechanics.flite.com/blog/2014/04/16/using-explode-and-lateral-view-in-hive/
>
>
> http://stackoverflow.com/questions/28716165/how-to-query-struct-array-with-hive-get-json-object
>
>
>
>
>
> *From:* Sam Joe [mailto:[email protected]]
> *Sent:* Tuesday, October 27, 2015 1:29 PM
> *To:* [email protected]
> *Subject:* Re: Using json_tuple for Nested json Arrays
>
>
>
> I tried using EXPLODE function on the nested json array but it doesn't
> work and throws following error:
>
>
>
> FAILED: UDFArgumentException explode() takes an array or a map as a
> parameter
>
>
>
>
>
> Thanks,
>
> Joel
>
>
>
> On Tue, Oct 27, 2015 at 3:20 PM, Sam Joe <[email protected]> wrote:
>
> Hi,
>
>
>
> Is it possible to use json_tuple function to extract data from json arrays
> (nested too). I am trying to process json data as string and avoid using
> serdes since user data may be malformed.
>
>
>
> Please see a sample json data given below:
>
>
>
>
>
> {
>
> "filter_level": "low",
>
> "retweeted": false,
>
> "in_reply_to_screen_name": null,
>
> "possibly_sensitive": false,
>
> "truncated": false,
>
> "lang": "en",
>
> "in_reply_to_status_id_str": null,
>
> "id": 654395184428515332,
>
> "extended_entities": {
>
> "media": [{
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg",
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g",
>
> "id": 654301608990388224,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MPWsAAOnZo.jpg",
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1",
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608990388224"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg",
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g",
>
> "id": 654301608998764544,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MRWgAAGOcj.jpg",
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1",
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608998764544"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg",
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g",
>
> "id": 654301608994586624,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg",
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1",
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301608994586624"
>
> },
>
> {
>
> "sizes": {
>
> "thumb": {
>
> "w": 150,
>
> "resize": "crop",
>
> "h": 150
>
> },
>
> "small": {
>
> "w": 340,
>
> "resize": "fit",
>
> "h": 255
>
> },
>
> "large": {
>
> "w": 1024,
>
> "resize": "fit",
>
> "h": 768
>
> },
>
> "medium": {
>
> "w": 600,
>
> "resize": "fit",
>
> "h": 450
>
> }
>
> },
>
> "source_user_id": 16864598,
>
> "media_url": "http://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg",
>
> "display_url": "pic.twitter.com/i3004WyF4g",
>
> "type": "photo",
>
> "url": "http://t.co/i3004WyF4g",
>
> "id": 654301609179115521,
>
> "media_url_https": "https://pbs.twimg.com/media/CRSL2M8WcAEXowZ.jpg",
>
> "expanded_url": "
> http://twitter.com/lordlancaster/status/654301626665189376/photo/1",
>
> "source_user_id_str": "16864598",
>
> "indices": [143,
>
> 144],
>
> "source_status_id_str": "654301626665189376",
>
> "source_status_id": 654301626665189376,
>
> "id_str": "654301609179115521"
>
> }]
>
> }
>
> }
>
>
>
> Appreciate any help!
>
>
>
> *Thanks,*
>
> *Joel*
>
>
>
>
>
>
> ------------------------------
>
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>
>
> ------------------------------
>
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>
>
> ------------------------------
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>