Re: Filter out empty arrays in JSON

Andries Engelbrecht Thu, 05 Feb 2015 23:32:18 -0800

Good catch, seem to work when pointing to a specific map in the array.

Strange thing is that I have actually been using that to filter out retweets, 
by using using a specific map to retweeted_status.id is not null.


Good find though.

Also flatten on hashtags works to filter out the empty arrays, but you may not 
always want to use flatten.

When pointing to a specific map Drill seems to consistently produce a null if 
the map does not exist,which is very handy.

--Andries 



> On Feb 5, 2015, at 7:36 PM, Christopher Matta <[email protected]> wrote:
> 
> You missed something:
> 
> select t.entities.hashtags from dfs.twitter./nflt where
> t.entities.hashtags[0].*text* is not null limit 10;
> 
> Add the .text to the end of the where.
> 
> 
> Chris Matta
> [email protected]
> 215-701-3146
> 
> On Thu, Feb 5, 2015 at 10:05 PM, Andries Engelbrecht <
> [email protected]> wrote:
> 
>> Chris
>> 
>> Thanks, but I tried it before without success.
>> 
>> I still get this on Drill 0.7.
>> 
>> select t.entities.hashtags from dfs.twitter.`/nfl` t where
>> t.entities.hashtags[0] is not null limit 10;
>> 
>> Error: exception while executing query: Failure while executing query.
>> (state=,code=0)
>> Query failed: Query failed: Failure while running fragment., Failure while
>> trying to materialize incoming schema.  Errors:
>> 
>> Error in expression at index 0.  Error: Missing function implementation:
>> [isnotnull(MAP-REQUIRED)].  Full expression: null.. [
>> 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
>> [ 0dab71b6-1860-4739-b752-1458ed29b671 on drilldemo:31010 ]
>> 
>> 
>> Mine seems finicky about even using the first element in the array
>> 
>> select t.entities.hashtags[0] from dfs.twitter.`/nfl` t limit 10;
>> 
>> Query failed: Query failed: Failure while running fragment., index: 4,
>> length: 4 (expected: range(0, 4)) [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on
>> drilldemo:31010 ]
>> [ 62cd2d3a-546e-41c8-a0c5-4da2b81c2ef8 on drilldemo:31010 ]
>> 
>> 
>> However flatten seems to work. (as long as you don’t have thousands of
>> small json files)
>> 
>> select flatten(t.entities.hashtags) from dfs.twitter.`/nfl` t limit 10;
>> 
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | {"text":"PantherPride","indices":["12","25"]} |
>> | {"text":"KeepPounding","indices":["48","61"]} |
>> | {"text":"vegas","indices":["107","113"]} |
>> | {"text":"nfl","indices":["23","27"]} |
>> | {"text":"Cowboys","indices":["7","15"]} |
>> | {"text":"Cowboys","indices":["7","15"]} |
>> | {"text":"Ebay","indices":["52","57"]} |
>> | {"text":"NFL","indices":["58","62"]} |
>> | {"text":"Christmas","indices":["63","73"]} |
>> | {"text":"Gift","indices":["74","79"]} |
>> +------------+
>> 
>> 
>> —Andries
>> 
>> 
>> 
>> 
>>> On Feb 5, 2015, at 6:41 PM, Christopher Matta <[email protected]> wrote:
>>> 
>>> Andreas, I think I may have solved your issue:
>>> 
>>>> select t.entities.hashtags FROM mfs.cmatta.`tweets/blackfriday` t WHERE
>> t.entities.hashtags[0].text is not null limit 10;
>>> +------------+
>>> |   EXPR$0   |
>>> +------------+
>>> |
>> [{"text":"Blackfriday","indices":[11,23]},{"text":"facebook","indices":[56,65]},{"text":"christmas","indices":[66,76]}]
>>> |
>>> |
>> [{"text":"BlackFriday","indices":[65,77]},{"text":"ViernesNegro","indices":[105,118]},{"text":"ofertas","indices":[122,130]}]
>>> |
>>> |
>> [{"text":"StarWars","indices":[21,30]},{"text":"BlackFriday","indices":[71,83]}]
>>> |
>>> | [{"text":"NotOneDime","indices":[14,25]}] |
>>> | [{"text":"BlackFriday","indices":[43,55]}] |
>>> | [{"text":"JRStudio","indices":[93,102]}] |
>>> |
>> [{"text":"luv","indices":[38,42]},{"text":"luvmanicures","indices":[43,56]},{"text":"luvpedicures","indices":[57,70]},{"text":"luvroyaloak","indices":[71,83]},{"text":"woodwardave","indices":[84,96]}]
>>> |
>>> | [{"text":"BlackFriday","indices":[0,12]}] |
>>> |
>> [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
>>> |
>>> |
>> [{"text":"BlackFriday","indices":[18,30]},{"text":"SOE","indices":[65,69]},{"text":"DoubleSC","indices":[82,91]}]
>>> |
>>> +------------+
>>> 10 rows selected (1.697 seconds)
>>> 
>>> I noticed when only extracting the first text item in the hashtag array (
>>> t.entities.hashtags[0].text) it was returning null for those tweets that
>>> didn’t have hashtags, so filtering out that seems to work.
>>> 
>>> 
>>> Chris Matta
>>> [email protected]
>>> 215-701-3146
>>> 
>>> On Mon, Jan 26, 2015 at 6:43 PM, Jason Altekruse <
>> [email protected]>
>>> wrote:
>>> 
>>>> As Aditya commented before this will work if the lists only contain
>> scalars
>>>> 
>>>> repeated_count('entities.urls') > 0
>>>> 
>>>> If the lists contain maps unfortunately this is not available today.
>> There
>>>> is an enhancement request open for this feature. I have marked it for a
>> fix
>>>> in 0.9 as it is more of a feature request than a bug and we are working
>> on
>>>> closing a large number of bugs for 0.8 before we get to issues like
>> this.
>>>> 
>>>> https://issues.apache.org/jira/browse/DRILL-1650
>>>> 
>>>> -Jason
>>>> 
>>>> On Mon, Jan 26, 2015 at 3:26 PM, Andries Engelbrecht <
>>>> [email protected]> wrote:
>>>> 
>>>>> Unfortunately it seems that with larger data sets the use of flatten
>>>> seems
>>>>> to produce an error.
>>>>> 
>>>>> Any other options to filter out JSON records (objects) where an array
>> is
>>>>> empty?
>>>>> 
>>>>> Thanks
>>>>> —Andries
>>>>> 
>>>>> 
>>>>> On Jan 21, 2015, at 5:18 PM, Andries Engelbrecht <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> It was interesting to see flatten bypass the records with an empty
>>>> array.
>>>>>> 
>>>>>> Unfortunately some arrays are much more complex than the example here,
>>>>> and it is still useful to have the ability to filter out ones that are
>>>>> empty.
>>>>>> 
>>>>>> —Andries
>>>>>> 
>>>>>>> On Jan 21, 2015, at 5:04 PM, Hao Zhu <[email protected]> wrote:
>>>>>>> 
>>>>>>> I figured out the differences after getting the json file from
>>>> Andries.
>>>>>>> 
>>>>>>> My json file is like:
>>>>>>> {“entities”:{xxx},“entities”:{yyy}... }
>>>>>>> 
>>>>>>> Andries' json file is like:
>>>>>>> {“entities”:{xxx}}
>>>>>>> {“entities”:{yyy}}
>>>>>>> ...
>>>>>>> 
>>>>>>> So basically Andries' json file is contains multiple json files.
>>>>>>> 
>>>>>>> Fortunately, we can use the same SQL to get the same results:
>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z2.json` t;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | []         |
>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>> | [{"text":"aaa"}] |
>>>>>>> | []         |
>>>>>>> | [{"text":"bbb"}] |
>>>>>>> +------------+
>>>>>>> 5 rows selected (0.136 seconds)
>>>>>>> 0: jdbc:drill:> with tmp as
>>>>>>> . . . . . . . > (
>>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>>>> dfs.tmp.`z2.json` t
>>>>>>> . . . . . . . > )
>>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>>> +------------+
>>>>>>> |   EXPR$0   |
>>>>>>> +------------+
>>>>>>> | GoPatriots |
>>>>>>> | aaa        |
>>>>>>> | bbb        |
>>>>>>> +------------+
>>>>>>> 3 rows selected (0.115 seconds)
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Hao
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jan 21, 2015 at 4:34 PM, Andries Engelbrecht <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> In my case it returns the empty records when flatten is not used.
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:zk=drilldemo:5181> select t.entities.hashtags as
>>>> hashtags
>>>>>>>> from `twitter.json` t limit 10;
>>>>>>>> +------------+
>>>>>>>> |  hashtags  |
>>>>>>>> +------------+
>>>>>>>> | []         |
>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>> | []         |
>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | []         |
>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>>> +------------+
>>>>>>>> 
>>>>>>>>> On Jan 21, 2015, at 4:26 PM, Hao Zhu <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> Actually not due to flatten, if you directly query the file, it
>> will
>>>>> only
>>>>>>>>> show the non-null values.
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`z.json` t;
>>>>>>>>> +------------+
>>>>>>>>> |   EXPR$0   |
>>>>>>>>> +------------+
>>>>>>>>> | [{"text":"GoPatriots"},{"text":"aaa"},{"text":"bbb"}] |
>>>>>>>>> +------------+
>>>>>>>>> 1 row selected (0.123 seconds)
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Hao
>>>>>>>>> 
>>>>>>>>> On Wed, Jan 21, 2015 at 4:18 PM, Andries Engelbrecht <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> Very interesting, flatten seems to bypass empty records. Not sure
>>>> if
>>>>>>>> that
>>>>>>>>>> is an ideal result for all use cases, but certainly usable in this
>>>>> case.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> —Andries
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jan 21, 2015, at 3:45 PM, Hao Zhu <[email protected]> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I can also fetch non-null values for attached json file which
>>>>> contains
>>>>>>>> 6
>>>>>>>>>> "entities", 3 of them are null, 3 of them have "text" value.
>>>>>>>>>>> 
>>>>>>>>>>> Could you share your complete "twitter.json"?
>>>>>>>>>>> 
>>>>>>>>>>> 0: jdbc:drill:> with tmp as
>>>>>>>>>>> . . . . . . . > (
>>>>>>>>>>> . . . . . . . > select flatten(t.entities.hashtags) as c from
>>>>>>>>>> dfs.tmp.`z.json` t
>>>>>>>>>>> . . . . . . . > )
>>>>>>>>>>> . . . . . . . > select tmp.c.text from tmp;
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | GoPatriots |
>>>>>>>>>>> | aaa        |
>>>>>>>>>>> | bbb        |
>>>>>>>>>>> +------------+
>>>>>>>>>>> 3 rows selected (0.122 seconds)
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Hao
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jan 21, 2015 at 3:35 PM, Andries Engelbrecht <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>> The sample data I posted only has 1 element, but some records
>> have
>>>>>>>>>> multiple elements in them.
>>>>>>>>>>> 
>>>>>>>>>>> Interestingly enough though
>>>>>>>>>>> select t.entities.hashtags[0].`text` from `twitter.json` t limit
>>>> 10;
>>>>>>>>>>> 
>>>>>>>>>>> Produces
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | null       |
>>>>>>>>>>> | SportsNews |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | SportsNews |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | null       |
>>>>>>>>>>> | CARvsSEA   |
>>>>>>>>>>> +——————+
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> And
>>>>>>>>>>> 
>>>>>>>>>>> select t.entities.hashtags[0] from `twitter.json` t limit 10;
>>>>>>>>>>> 
>>>>>>>>>>> +------------+
>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>> +------------+
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"text":"SportsNews","indices":[90,99]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"indices":[]} |
>>>>>>>>>>> | {"text":"CARvsSEA","indices":[90,99]} |
>>>>>>>>>>> +——————+
>>>>>>>>>>> 
>>>>>>>>>>> Strange part is that there is no indices map in the hashtags
>>>> array,
>>>>> so
>>>>>>>>>> no idea why it shows up when pointing to the first lament in an
>>>> empty
>>>>>>>> array.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> —Andries
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 21, 2015, at 3:24 PM, Hao Zhu <[email protected]> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I just noticed that the result of "hashtags" is just an array
>>>> with
>>>>>>>>>> only 1
>>>>>>>>>>>> element.
>>>>>>>>>>>> So take your example:
>>>>>>>>>>>> [root@maprdemo tmp]# cat d.json
>>>>>>>>>>>> {
>>>>>>>>>>>> "entities": {
>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>> "hashtags": [],
>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>> },
>>>>>>>>>>>> "entities": {
>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>> {
>>>>>>>>>>>>  "text": "GoPatriots",
>>>>>>>>>>>>  "indices": []
>>>>>>>>>>>> }
>>>>>>>>>>>> ],
>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> Now we can do this to achieve the results:
>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from dfs.tmp.`d.json`
>>>> t
>>>>> ;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.09 seconds)
>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags[0].text from
>>>>>>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> | GoPatriots |
>>>>>>>>>>>> +------------+
>>>>>>>>>>>> 1 row selected (0.108 seconds)
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Hao
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jan 21, 2015 at 3:01 PM, Andries Engelbrecht <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> When I run the query on a larger dataset it actually show the
>>>>> empty
>>>>>>>>>>>>> records.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> select t.entities.hashtags from `twitter.json` t limit 10;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | [{"text":"SportsNews","indices":[0,11]}] |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | []         |
>>>>>>>>>>>>> | [{"text":"CARvsSEA","indices":[36,45]}] |
>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>> 10 rows selected (2.899 seconds)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However having the output as maps is not very useful, unless i
>>>> can
>>>>>>>>>> filter
>>>>>>>>>>>>> out the records with empty arrays and then drill deeper into
>> the
>>>>> ones
>>>>>>>>>> with
>>>>>>>>>>>>> data in the arrays.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> BTW: Hao I would have expiated your query to return both rows,
>>>> one
>>>>>>>>>> with an
>>>>>>>>>>>>> empty array as above and the other with the array data.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jan 21, 2015, at 2:56 PM, Hao Zhu <[email protected]>
>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am not sure if below is expected behavior.
>>>>>>>>>>>>>> If we only select "hashtags", and it will return only 1 row
>>>>> ignoring
>>>>>>>>>> the
>>>>>>>>>>>>>> null value.
>>>>>>>>>>>>>> However then if we try to get "hashtags.text", it
>> fails...which
>>>>>>>>>> means it
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> still trying to read the NULL value.
>>>>>>>>>>>>>> I am thinking it may confuse the SQL developers.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags from
>>>> dfs.tmp.`d.json`
>>>>> t ;
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> | [{"text":"GoPatriots"}] |
>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>> 1 row selected (0.109 seconds)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.hashtags.text from
>>>>>>>>>> dfs.tmp.`d.json` t ;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
>>>> be
>>>>>>>>>> cast to
>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>>>>>> 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>>>>>>> [ 7ab63d4e-8a1d-4e23-8853-a879db7e8a5f on maprdemo:31010 ]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>> executing
>>>>>>>>>> query.
>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Hao
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:43 PM, Andries Engelbrecht <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Now try on hashtags with the following:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> drilldemo:5181> select t.entities.hashtags.`text` from
>>>>>>>>>> `/twitter.json` t
>>>>>>>>>>>>>>> where t.entities.hashtags is not null limit 10;
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Query failed: Query failed: Failure while running fragment.,
>>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.RepeatedMapVector cannot
>>>> be
>>>>>>>>>> cast to
>>>>>>>>>>>>>>> org.apache.drill.exec.vector.complex.MapVector [
>>>>>>>>>>>>>>> 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>>>>>>> [ 6fe7f918-d1a7-4fc6-b24d-44ff9186f59e on drilldemo:31010 ]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>>>> executing
>>>>>>>>>> query.
>>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>> "hashtags": [],
>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>> "indices": []
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The issue seems to be that if some records have arrays with
>>>>> maps in
>>>>>>>>>> them
>>>>>>>>>>>>>>> and others are empty.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> —Andries
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jan 21, 2015, at 2:34 PM, Hao Zhu <[email protected]>
>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Seems it works for below json file:
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>>> "indices": [
>>>>>>>>>>>>>>>>   83,
>>>>>>>>>>>>>>>>   94
>>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>>> "trends": [1,2,3],
>>>>>>>>>>>>>>>> "symbols": [4,5,6],
>>>>>>>>>>>>>>>> "urls": [7,8,9],
>>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>>> "indices": []
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
>>>>> as t
>>>>>>>>>> where
>>>>>>>>>>>>>>>> t.entities.urls is not null;
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> | [7,8,9]    |
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> 1 row selected (0.139 seconds)
>>>>>>>>>>>>>>>> 0: jdbc:drill:> select t.entities.urls from dfs.tmp.`a.json`
>>>>> as t
>>>>>>>>>> where
>>>>>>>>>>>>>>>> t.entities.urls is null;
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> |   EXPR$0   |
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> +------------+
>>>>>>>>>>>>>>>> No rows selected (0.158 seconds)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Hao
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 2:01 PM, Aditya <
>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I believe that this works if the array contains homogeneous
>>>>>>>>>> primitive
>>>>>>>>>>>>>>>>> types. In your example, it appears from the error, the
>> array
>>>>>>>> field
>>>>>>>>>>>>>>> 'member'
>>>>>>>>>>>>>>>>> contained maps for at least one record.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:57 PM, Christopher Matta <
>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Trying that locally did not work for me (drill 0.7.0):
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 0: jdbc:drill:zk=local> select `id`, `name`, `members`
>> from
>>>>>>>>>>>>>>>>> `Downloads/test.json` where repeated_count(`members`) > 0;
>>>>>>>>>>>>>>>>>> Query failed: Query stopped., Failure while trying to
>>>>>>>> materialize
>>>>>>>>>>>>>>>>> incoming schema.  Errors:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Error in expression at index -1.  Error: Missing function
>>>>>>>>>>>>>>>>> implementation: [repeated_count(MAP-REPEATED)].  Full
>>>>> expression:
>>>>>>>>>>>>>>> --UNKNOWN
>>>>>>>>>>>>>>>>> EXPRESSION--.. [ 47142fa4-7e6a-48cb-be6a-676e885ede11 on
>>>>>>>>>>>>>>> bullseye-3:31010 ]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Error: exception while executing query: Failure while
>>>>> executing
>>>>>>>>>>>>> query.
>>>>>>>>>>>>>>>>> (state=,code=0)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Chris Matta
>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>> 215-701-3146
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 4:50 PM, Aditya <
>>>>>>>> [email protected]
>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> repeated_count('entities.urls') > 0
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Jan 21, 2015 at 1:46 PM, Andries Engelbrecht <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> How do you filter out records with an empty array in
>>>> drill?
>>>>>>>>>>>>>>>>>>>> i.e some records have "url":[]  and some will have an
>>>> array
>>>>>>>>>> with
>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> it. When trying to read records with data in the array
>>>>> drill
>>>>>>>>>> fails
>>>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> records missing any data in the array. Trying a filter
>>>>> with/*
>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>> "url":[0] is not null */ fails, also fails if applying
>>>> url
>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>>>>>>> null.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Note some of the arrays contains maps, using twitter
>> data
>>>>> as
>>>>>>>> an
>>>>>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>>>> below. Some records have an empty array with
>>>> “hashtags”:[]
>>>>>>>> and
>>>>>>>>>>>>>>> others
>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> look similar to what is listed below.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> "entities": {
>>>>>>>>>>>>>>>>>>>> "trends": [],
>>>>>>>>>>>>>>>>>>>> "symbols": [],
>>>>>>>>>>>>>>>>>>>> "urls": [],
>>>>>>>>>>>>>>>>>>>> "hashtags": [
>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>> "text": "GoPatriots",
>>>>>>>>>>>>>>>>>>>> "indices": [
>>>>>>>>>>>>>>>>>>>>   83,
>>>>>>>>>>>>>>>>>>>>   94
>>>>>>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>> ],
>>>>>>>>>>>>>>>>>>>> "user_mentions": []
>>>>>>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> —Andries
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> <z.json>
>> 
>>

Re: Filter out empty arrays in JSON

Reply via email to