In my case it returns the empty records when flatten is not used. 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags from `twitter.json` t limit 10; +------------+ | hashtags | +------------+ | [] | | [{"text":"SportsNews","indices":[0,11]}] | | [] | | [{"text":"SportsNews","indices":[0,11]}] | | [] | | [] | | [] | | [] | | [] | | [{"text":"CARvsSEA","indices":[36,45]}] | +------------+
On Jan 21, 2015, at 4:26 PM, Hao Zhu <h...@maprtech.com> wrote: > Actually not due to flatten, if you directly query the file, it will only > show the non-null values. > > 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t; > +------------+ > | EXPR$0 | > +------------+ > | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] | > +------------+ > 1 row selected (0.123 seconds) > > Thanks, > Hao > > On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht < > aengelbre...@maprtech.com> wrote: > >> Very interesting, flatten seems to bypass empty records. Not sure if that >> is an ideal result for all use cases, but certainly usable in this case. >> >> Thanks >> —Andries >> >> >> On Jan 21, 2015, at 3:45 PM, Hao Zhu <h...@maprtech.com> wrote: >> >>> I can also fetch non-null values for attached json file which contains 6 >> "entities", 3 of them are null, 3 of them have "text" value. >>> >>> Could you share your complete "twitter.json"? >>> >>> 0: jdbc:drill:> with tmp as >>> . . . . . . . > ( >>> . . . . . . . > select flatten(t.entities.hashtags) as c from >> dfs.tmp.`z.json` t >>> . . . . . . . > ) >>> . . . . . . . > select tmp.c.text from tmp; >>> +------------+ >>> | EXPR$0 | >>> +------------+ >>> | GoPatriots | >>> | aaa | >>> | bbb | >>> +------------+ >>> 3 rows selected (0.122 seconds) >>> >>> Thanks, >>> Hao >>> >>> >>> >>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht < >> aengelbre...@maprtech.com> wrote: >>> The sample data I posted only has 1 element, but some records have >> multiple elements in them. >>> >>> Interestingly enough though >>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10; >>> >>> Produces >>> +------------+ >>> | EXPR$0 | >>> +------------+ >>> | null | >>> | SportsNews | >>> | null | >>> | SportsNews | >>> | null | >>> | null | >>> | null | >>> | null | >>> | null | >>> | CARvsSEA | >>> +——————+ >>> >>> >>> And >>> >>> select t.entities.hashtags[0] from `twitter.json` t limit 10; >>> >>> +------------+ >>> | EXPR$0 | >>> +------------+ >>> | {"indices":[]} | >>> | {"text":"SportsNews","indices":[90,99]} | >>> | {"indices":[]} | >>> | {"text":"SportsNews","indices":[90,99]} | >>> | {"indices":[]} | >>> | {"indices":[]} | >>> | {"indices":[]} | >>> | {"indices":[]} | >>> | {"indices":[]} | >>> | {"text":"CARvsSEA","indices":[90,99]} | >>> +——————+ >>> >>> Strange part is that there is no indices map in the hashtags array, so >> no idea why it shows up when pointing to the first lament in an empty array. >>> >>> >>> —Andries >>> >>> >>> >>> >>> >>> >>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <h...@maprtech.com> wrote: >>> >>>> I just noticed that the result of "hashtags" is just an array with >> only 1 >>>> element. >>>> So take your example: >>>> [root@maprdemo tmp]# cat d.json >>>> { >>>> "entities": { >>>> "trends": [], >>>> "symbols": [], >>>> "urls": [], >>>> "hashtags": [], >>>> "user_mentions": [] >>>> }, >>>> "entities": { >>>> "trends": [1,2,3], >>>> "symbols": [4,5,6], >>>> "urls": [7,8,9], >>>> "hashtags": [ >>>> { >>>> "text": "GoPatriots", >>>> "indices": [] >>>> } >>>> ], >>>> "user_mentions": [] >>>> } >>>> } >>>> >>>> Now we can do this to achieve the results: >>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ; >>>> +------------+ >>>> | EXPR$0 | >>>> +------------+ >>>> | [{"text":"GoPatriots"}] | >>>> +------------+ >>>> 1 row selected (0.09 seconds) >>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from >> dfs.tmp.`d.json` t ; >>>> +------------+ >>>> | EXPR$0 | >>>> +------------+ >>>> | GoPatriots | >>>> +------------+ >>>> 1 row selected (0.108 seconds) >>>> >>>> Thanks, >>>> Hao >>>> >>>> >>>> >>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht < >>>> aengelbre...@maprtech.com> wrote: >>>> >>>>> When I run the query on a larger dataset it actually show the empty >>>>> records. >>>>> >>>>> select t.entities.hashtags from `twitter.json` t limit 10; >>>>> >>>>> +------------+ >>>>> | EXPR$0 | >>>>> +------------+ >>>>> | [] | >>>>> | [{"text":"SportsNews","indices":[0,11]}] | >>>>> | [] | >>>>> | [{"text":"SportsNews","indices":[0,11]}] | >>>>> | [] | >>>>> | [] | >>>>> | [] | >>>>> | [] | >>>>> | [] | >>>>> | [{"text":"CARvsSEA","indices":[36,45]}] | >>>>> +------------+ >>>>> 10 rows selected (2.899 seconds) >>>>> >>>>> >>>>> >>>>> However having the output as maps is not very useful, unless i can >> filter >>>>> out the records with empty arrays and then drill deeper into the ones >> with >>>>> data in the arrays. >>>>> >>>>> BTW: Hao I would have expiated your query to return both rows, one >> with an >>>>> empty array as above and the other with the array data. >>>>> >>>>> >>>>> —Andries >>>>> >>>>> >>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <h...@maprtech.com> wrote: >>>>> >>>>>> I am not sure if below is expected behavior. >>>>>> If we only select "hashtags", and it will return only 1 row ignoring >> the >>>>>> null value. >>>>>> However then if we try to get "hashtags.text", it fails...which >> means it >>>>> is >>>>>> still trying to read the NULL value. >>>>>> I am thinking it may confuse the SQL developers. >>>>>> >>>>>> >>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ; >>>>>> +------------+ >>>>>> | EXPR$0 | >>>>>> +------------+ >>>>>> | [{"text":"GoPatriots"}] | >>>>>> +------------+ >>>>>> 1 row selected (0.109 seconds) >>>>>> >>>>>> >>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from >> dfs.tmp.`d.json` t ; >>>>>> >>>>>> Query failed: Query failed: Failure while running fragment., >>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be >> cast to >>>>>> org.apache.drill.exec.vector.complex.MapVector [ >>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ] >>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ] >>>>>> >>>>>> >>>>>> Error: exception while executing query: Failure while executing >> query. >>>>>> (state=,code=0) >>>>>> >>>>>> Thanks, >>>>>> Hao >>>>>> >>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht < >>>>>> aengelbre...@maprtech.com> wrote: >>>>>> >>>>>>> Now try on hashtags with the following: >>>>>>> >>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from >> `/twitter.json` t >>>>>>> where t.entities.hashtags is not null limit 10; >>>>>>> >>>>>>> Query failed: Query failed: Failure while running fragment., >>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be >> cast to >>>>>>> org.apache.drill.exec.vector.complex.MapVector [ >>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ] >>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ] >>>>>>> >>>>>>> >>>>>>> Error: exception while executing query: Failure while executing >> query. >>>>>>> (state=,code=0) >>>>>>> >>>>>>> >>>>>>> { >>>>>>> "entities": { >>>>>>> "trends": [], >>>>>>> "symbols": [], >>>>>>> "urls": [], >>>>>>> "hashtags": [], >>>>>>> "user_mentions": [] >>>>>>> }, >>>>>>> "entities": { >>>>>>> "trends": [1,2,3], >>>>>>> "symbols": [4,5,6], >>>>>>> "urls": [7,8,9], >>>>>>> "hashtags": [ >>>>>>> { >>>>>>> "text": "GoPatriots", >>>>>>> "indices": [] >>>>>>> } >>>>>>> ], >>>>>>> "user_mentions": [] >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> The issue seems to be that if some records have arrays with maps in >> them >>>>>>> and others are empty. >>>>>>> >>>>>>> —Andries >>>>>>> >>>>>>> >>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <h...@maprtech.com> wrote: >>>>>>> >>>>>>>> Seems it works for below json file: >>>>>>>> { >>>>>>>> "entities": { >>>>>>>> "trends": [], >>>>>>>> "symbols": [], >>>>>>>> "urls": [], >>>>>>>> "hashtags": [ >>>>>>>> { >>>>>>>> "text": "GoPatriots", >>>>>>>> "indices": [ >>>>>>>> 83, >>>>>>>> 94 >>>>>>>> ] >>>>>>>> } >>>>>>>> ], >>>>>>>> "user_mentions": [] >>>>>>>> }, >>>>>>>> "entities": { >>>>>>>> "trends": [1,2,3], >>>>>>>> "symbols": [4,5,6], >>>>>>>> "urls": [7,8,9], >>>>>>>> "hashtags": [ >>>>>>>> { >>>>>>>> "text": "GoPatriots", >>>>>>>> "indices": [] >>>>>>>> } >>>>>>>> ], >>>>>>>> "user_mentions": [] >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t >> where >>>>>>>> t.entities.urls is not null; >>>>>>>> +------------+ >>>>>>>> | EXPR$0 | >>>>>>>> +------------+ >>>>>>>> | [7,8,9] | >>>>>>>> +------------+ >>>>>>>> 1 row selected (0.139 seconds) >>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t >> where >>>>>>>> t.entities.urls is null; >>>>>>>> +------------+ >>>>>>>> | EXPR$0 | >>>>>>>> +------------+ >>>>>>>> +------------+ >>>>>>>> No rows selected (0.158 seconds) >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Hao >>>>>>>> >>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <adityakish...@gmail.com> >>>>> wrote: >>>>>>>> >>>>>>>>> I believe that this works if the array contains homogeneous >> primitive >>>>>>>>> types. In your example, it appears from the error, the array field >>>>>>> 'member' >>>>>>>>> contained maps for at least one record. >>>>>>>>> >>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta < >> cma...@mapr.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Trying that locally did not work for me (drill 0.7.0): >>>>>>>>>> >>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from >>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0; >>>>>>>>>> Query failed: Query stopped., Failure while trying to materialize >>>>>>>>> incoming schema. Errors: >>>>>>>>>> >>>>>>>>>> Error in expression at index -1. Error: Missing function >>>>>>>>> implementation: [repeated_count(MAP-REPEATED)]. Full expression: >>>>>>> --UNKNOWN >>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on >>>>>>> bullseye-3:31010 ] >>>>>>>>>> >>>>>>>>>> Error: exception while executing query: Failure while executing >>>>> query. >>>>>>>>> (state=,code=0) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Chris Matta >>>>>>>>>> cma...@mapr.com >>>>>>>>>> 215-701-3146 >>>>>>>>>> >>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <adityakish...@gmail.com >>> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> repeated_count('entities.urls') > 0 >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht < >>>>>>>>>>> aengelbre...@maprtech.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> How do you filter out records with an empty array in drill? >>>>>>>>>>>> i.e some records have "url":[] and some will have an array >> with >>>>> data >>>>>>>>> in >>>>>>>>>>>> it. When trying to read records with data in the array drill >> fails >>>>>>> due >>>>>>>>>>> to >>>>>>>>>>>> records missing any data in the array. Trying a filter with/* >> where >>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is >> not >>>>>>>>> null. >>>>>>>>>>>> >>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as an >>>>>>>>> example >>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[] and >>>>>>> others >>>>>>>>>>> will >>>>>>>>>>>> look similar to what is listed below. >>>>>>>>>>>> >>>>>>>>>>>> "entities": { >>>>>>>>>>>> "trends": [], >>>>>>>>>>>> "symbols": [], >>>>>>>>>>>> "urls": [], >>>>>>>>>>>> "hashtags": [ >>>>>>>>>>>> { >>>>>>>>>>>> "text": "GoPatriots", >>>>>>>>>>>> "indices": [ >>>>>>>>>>>> 83, >>>>>>>>>>>> 94 >>>>>>>>>>>> ] >>>>>>>>>>>> } >>>>>>>>>>>> ], >>>>>>>>>>>> "user_mentions": [] >>>>>>>>>>>> }, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> —Andries >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >>> <z.json> >> >>