Joe, Not sure I fully get the picture on your data, but you may want to see if pointing at the first element in an array and looking for nulls works in your case. I have used that in the past to eliminate records or elements from the result set in a subquery.
Perhaps try something like this SELECT par.check FROM <workspace.dir> WHERE par.check.detections.list[0] IS NOT NULL FLATTEN may be very expensive for sparsely populated data. Might be good to capture statistics and file a JIRA to see if the function can be optimized for this use case. --Andries > On May 17, 2016, at 5:54 PM, Joe Harris <joeharri...@gmail.com> wrote: > > I'm querying parquet data which contains large numbers of empty maps/arrays > that I'd like to eliminate from the query as early as possible. The events > I'm querying are very sparse in the data. > >> SELECT par.check.detections.list AS detections FROM >> `/my.parquet/year=2016/month=5` par LIMIT 10; > 10 rows selected (3.157 seconds) > +-----------------------------+ > | {"detections":{"list":[]}} | > +-----------------------------+ > > > I've tried to use the map directly but the query fails. > >> SELECT par.check AS check >> FROM `/my.parquet/year=2016/month=5` par >> WHERE par.check <> '{"detections":{"list":[]}}' ; > Error in expression at index -1. Error: Missing function implementation: > [equal(MAP-REQUIRED, VARCHAR-REQUIRED)]. Full expression: --UNKNOWN > EXPRESSION--.. > > > I get a similar result if I reference down to the empty array. > >> SELECT par.check.detections.list AS detections >> FROM `/my.parquet/year=2016/month=5` par >> WHERE par.check.detections.list <> '[]' ; > Error in expression at index -1. Error: Missing function implementation: > [equal(MAP-REPEATED, VARCHAR-REQUIRED)]. Full expression: --UNKNOWN > EXPRESSION--.. > > > I'm also unable to refer to an array element without using FLATTEN(). > >> SELECT par.check.detections.list[0] >> FROM `/my.parquet/year=2016/month=5` par; > Error: Unexpected RuntimeException: java.lang.NullPointerException > (state=,code=0) > > > When I use FLATTEN() it automatically eliminates empty arrays and query time > goes way up. It seems to be doing quite a bit of work on each empty array > before eliminating it. > >> SELECT FLATTEN(par.check.detections.list) AS detections >> FROM `/my.parquet/year=2016/month=5` par >> LIMIT 10; > No rows selected (290.475 seconds) > > > I can see in the docs that map and array are internal only data types that > aren't exposed to the user > (https://drill.apache.org/docs/supported-data-types/#composite-types). Is > this on the roadmap somewhere to be added? > > > > > Thanks, > Joe > >