I've noticed drill offers a REPEATED_CONTAINS which can be applied to
fields which are arrays.

https://drill.apache.org/docs/repeated-contains/

I have a schema stored in parquet files which contain a repeated field
containing a key and a value. However such structures can't be queried
using the REPEATED_CONTAINS. I was thinking of writing a user defined
function to look through it.

My question is: is it worth it? Will it be faster than doing this?

{"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
{"name":"flour","cal":300} ] }

SELECT flat.fill FROM (SELECT FLATTEN(t.fillings) AS fill FROM
dfs.flatten.`test.json` t) flat WHERE flat.fill.name like 'sug%';

Specifically what's the cost of using FLATTEN compared to iterating over
the array right in a UDF?

Thanks
Jean-Claude

Reply via email to