Re: Filter out empty arrays in JSON

Andries Engelbrecht Wed, 21 Jan 2015 16:36:44 -0800

In my case it returns the empty records when flatten is not used.

0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as hashtags from 
`twitter.json` t limit 10;
+------------+
|  hashtags  |
+------------+
| []         |
| [{"text":"SportsNews","indices":[0,11]}] |
| []         |
| [{"text":"SportsNews","indices":[0,11]}] |
| []         |
| []         |
| []         |
| []         |
| []         |
| [{"text":"CARvsSEA","indices":[36,45]}] |
+------------+


On Jan 21, 2015, at 4:26 PM, Hao Zhu <h...@maprtech.com> wrote:

> Actually not due to flatten, if you directly query the file, it will only
> show the non-null values.
> 
> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
> +------------+
> |   EXPR$0   |
> +------------+
> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
> +------------+
> 1 row selected (0.123 seconds)
> 
> Thanks,
> Hao
> 
> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
> 
>> Very interesting, flatten seems to bypass empty records. Not sure if that
>> is an ideal result for all use cases, but certainly usable in this case.
>> 
>> Thanks
>> —Andries
>> 
>> 
>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <h...@maprtech.com> wrote:
>> 
>>> I can also fetch non-null values for attached json file which contains 6
>> "entities", 3 of them are null, 3 of them have "text" value.
>>> 
>>> Could you share your complete "twitter.json"?
>>> 
>>> 0: jdbc:drill:> with tmp as
>>> . . . . . . . > (
>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>> dfs.tmp.`z.json` t
>>> . . . . . . . > )
>>> . . . . . . . > select tmp.c.text from tmp;
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | GoPatriots |
>>> | aaa        |
>>> | bbb        |
>>> +------------+
>>> 3 rows selected (0.122 seconds)
>>> 
>>> Thanks,
>>> Hao
>>> 
>>> 
>>> 
>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>> aengelbre...@maprtech.com> wrote:
>>> The sample data I posted only has 1 element, but some records have
>> multiple elements in them.
>>> 
>>> Interestingly enough though
>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit 10;
>>> 
>>> Produces
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | null       |
>>> | SportsNews |
>>> | null       |
>>> | SportsNews |
>>> | null       |
>>> | null       |
>>> | null       |
>>> | null       |
>>> | null       |
>>> | CARvsSEA   |
>>> +——————+
>>> 
>>> 
>>> And
>>> 
>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>> 
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> | {"indices":[]} |
>>> | {"text":"SportsNews","indices":[90,99]} |
>>> | {"indices":[]} |
>>> | {"text":"SportsNews","indices":[90,99]} |
>>> | {"indices":[]} |
>>> | {"indices":[]} |
>>> | {"indices":[]} |
>>> | {"indices":[]} |
>>> | {"indices":[]} |
>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>> +——————+
>>> 
>>> Strange part is that there is no indices map in the hashtags array, so
>> no idea why it shows up when pointing to the first lament in an empty array.
>>> 
>>> 
>>> —Andries
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <h...@maprtech.com> wrote:
>>> 
>>>> I just noticed that the result of "hashtags" is just an array with
>> only 1
>>>> element.
>>>> So take your example:
>>>> [root@maprdemo tmp]# cat d.json
>>>> {
>>>> "entities": {
>>>>  "trends": [],
>>>>  "symbols": [],
>>>>  "urls": [],
>>>>  "hashtags": [],
>>>>  "user_mentions": []
>>>> },
>>>> "entities": {
>>>>  "trends": [1,2,3],
>>>>  "symbols": [4,5,6],
>>>>  "urls": [7,8,9],
>>>>  "hashtags": [
>>>>    {
>>>>      "text": "GoPatriots",
>>>>      "indices": []
>>>>    }
>>>>  ],
>>>>  "user_mentions": []
>>>> }
>>>> }
>>>> 
>>>> Now we can do this to achieve the results:
>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>> +------------+
>>>> |   EXPR$0   |
>>>> +------------+
>>>> | [{"text":"GoPatriots"}] |
>>>> +------------+
>>>> 1 row selected (0.09 seconds)
>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>> dfs.tmp.`d.json` t ;
>>>> +------------+
>>>> |   EXPR$0   |
>>>> +------------+
>>>> | GoPatriots |
>>>> +------------+
>>>> 1 row selected (0.108 seconds)
>>>> 
>>>> Thanks,
>>>> Hao
>>>> 
>>>> 
>>>> 
>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>> aengelbre...@maprtech.com> wrote:
>>>> 
>>>>> When I run the query on a larger dataset it actually show the empty
>>>>> records.
>>>>> 
>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>> 
>>>>> +------------+
>>>>> |   EXPR$0   |
>>>>> +------------+
>>>>> | []         |
>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>> | []         |
>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>> | []         |
>>>>> | []         |
>>>>> | []         |
>>>>> | []         |
>>>>> | []         |
>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>> +------------+
>>>>> 10 rows selected (2.899 seconds)
>>>>> 
>>>>> 
>>>>> 
>>>>> However having the output as maps is not very useful, unless i can
>> filter
>>>>> out the records with empty arrays and then drill deeper into the ones
>> with
>>>>> data in the arrays.
>>>>> 
>>>>> BTW: Hao I would have expiated your query to return both rows, one
>> with an
>>>>> empty array as above and the other with the array data.
>>>>> 
>>>>> 
>>>>> —Andries
>>>>> 
>>>>> 
>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <h...@maprtech.com> wrote:
>>>>> 
>>>>>> I am not sure if below is expected behavior.
>>>>>> If we only select "hashtags", and it will return only 1 row ignoring
>> the
>>>>>> null value.
>>>>>> However then if we try to get "hashtags.text", it fails...which
>> means it
>>>>> is
>>>>>> still trying to read the NULL value.
>>>>>> I am thinking it may confuse the SQL developers.
>>>>>> 
>>>>>> 
>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json` t ;
>>>>>> +------------+
>>>>>> |   EXPR$0   |
>>>>>> +------------+
>>>>>> | [{"text":"GoPatriots"}] |
>>>>>> +------------+
>>>>>> 1 row selected (0.109 seconds)
>>>>>> 
>>>>>> 
>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>> dfs.tmp.`d.json` t ;
>>>>>> 
>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>> cast to
>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>> 
>>>>>> 
>>>>>> Error: exception while executing query: Failure while executing
>> query.
>>>>>> (state=,code=0)
>>>>>> 
>>>>>> Thanks,
>>>>>> Hao
>>>>>> 
>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>> aengelbre...@maprtech.com> wrote:
>>>>>> 
>>>>>>> Now try on hashtags with the following:
>>>>>>> 
>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>> `/twitter.json` t
>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>> 
>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot be
>> cast to
>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>> 
>>>>>>> 
>>>>>>> Error: exception while executing query: Failure while executing
>> query.
>>>>>>> (state=,code=0)
>>>>>>> 
>>>>>>> 
>>>>>>> {
>>>>>>> "entities": {
>>>>>>> "trends": [],
>>>>>>> "symbols": [],
>>>>>>> "urls": [],
>>>>>>> "hashtags": [],
>>>>>>> "user_mentions": []
>>>>>>> },
>>>>>>> "entities": {
>>>>>>> "trends": [1,2,3],
>>>>>>> "symbols": [4,5,6],
>>>>>>> "urls": [7,8,9],
>>>>>>> "hashtags": [
>>>>>>>   {
>>>>>>>     "text": "GoPatriots",
>>>>>>>     "indices": []
>>>>>>>   }
>>>>>>> ],
>>>>>>> "user_mentions": []
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> The issue seems to be that if some records have arrays with maps in
>> them
>>>>>>> and others are empty.
>>>>>>> 
>>>>>>> —Andries
>>>>>>> 
>>>>>>> 
>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <h...@maprtech.com> wrote:
>>>>>>> 
>>>>>>>> Seems it works for below json file:
>>>>>>>> {
>>>>>>>> "entities": {
>>>>>>>> "trends": [],
>>>>>>>> "symbols": [],
>>>>>>>> "urls": [],
>>>>>>>> "hashtags": [
>>>>>>>>   {
>>>>>>>>     "text": "GoPatriots",
>>>>>>>>     "indices": [
>>>>>>>>       83,
>>>>>>>>       94
>>>>>>>>     ]
>>>>>>>>   }
>>>>>>>> ],
>>>>>>>> "user_mentions": []
>>>>>>>> },
>>>>>>>> "entities": {
>>>>>>>> "trends": [1,2,3],
>>>>>>>> "symbols": [4,5,6],
>>>>>>>> "urls": [7,8,9],
>>>>>>>> "hashtags": [
>>>>>>>>   {
>>>>>>>>     "text": "GoPatriots",
>>>>>>>>     "indices": []
>>>>>>>>   }
>>>>>>>> ],
>>>>>>>> "user_mentions": []
>>>>>>>> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>> where
>>>>>>>> t.entities.urls is not null;
>>>>>>>> +------------+
>>>>>>>> |   EXPR$0   |
>>>>>>>> +------------+
>>>>>>>> | [7,8,9]    |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json` as t
>> where
>>>>>>>> t.entities.urls is null;
>>>>>>>> +------------+
>>>>>>>> |   EXPR$0   |
>>>>>>>> +------------+
>>>>>>>> +------------+
>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Hao
>>>>>>>> 
>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <adityakish...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I believe that this works if the array contains homogeneous
>> primitive
>>>>>>>>> types. In your example, it appears from the error, the array field
>>>>>>> 'member'
>>>>>>>>> contained maps for at least one record.
>>>>>>>>> 
>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>> cma...@mapr.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members` from
>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>> Query failed: Query stopped., Failure while trying to materialize
>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>> 
>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full expression:
>>>>>>> --UNKNOWN
>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>> bullseye-3:31010 ]
>>>>>>>>>> 
>>>>>>>>>> Error: exception while executing query: Failure while executing
>>>>> query.
>>>>>>>>> (state=,code=0)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Chris Matta
>>>>>>>>>> cma...@mapr.com
>>>>>>>>>> 215-701-3146
>>>>>>>>>> 
>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <adityakish...@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>> aengelbre...@maprtech.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> How do you filter out records with an empty array in drill?
>>>>>>>>>>>> i.e some records have "url":[]  and some will have an array
>> with
>>>>> data
>>>>>>>>> in
>>>>>>>>>>>> it. When trying to read records with data in the array drill
>> fails
>>>>>>> due
>>>>>>>>>>> to
>>>>>>>>>>>> records missing any data in the array. Trying a filter with/*
>> where
>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying url is
>> not
>>>>>>>>> null.
>>>>>>>>>>>> 
>>>>>>>>>>>> Note some of the arrays contains maps, using twitter data as an
>>>>>>>>> example
>>>>>>>>>>>> below. Some records have an empty array with “hashtags”:[]  and
>>>>>>> others
>>>>>>>>>>> will
>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>> 
>>>>>>>>>>>> "entities": {
>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>   {
>>>>>>>>>>>>     "text": "GoPatriots",
>>>>>>>>>>>>     "indices": [
>>>>>>>>>>>>       83,
>>>>>>>>>>>>       94
>>>>>>>>>>>>     ]
>>>>>>>>>>>>   }
>>>>>>>>>>>> ],
>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>> },
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> —Andries
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> <z.json>
>> 
>>

Re: Filter out empty arrays in JSON

Reply via email to