Re: REPEATED_CONTAINS

Jacques Nadeau Tue, 29 Mar 2016 21:23:44 -0700

I think the best answer is to test it and share your findings.
Hypothesizing about performance in complicated systems is also suspect :)

That said, I'll make a guess...

In general, I would expect the flatten to be faster in your example since a
flatten without a cartesian is trivial operation and can be done in
vectorized fashion because of the shape of how data is held in memory. This
is different than how complex UDFs are written today (using the FieldReader
model). These UDFs are object-based execution, record by record. So,
vectorized and full runtime code generation

That being said, if you changed your code to be something more like [select
a,b,c,d,e,f,g, flatten(t.fillings) as fill], you might see the two be
closer together. This is because this would then require a cartesian copy
of all the fields abcdefg, which then have to be filtered out. In this
case, the extra cost of the copies might be more expensive than the object
overhead required for traversing the complex object structure.

In general, start with the methodology that works. If we don't see the
performance to satisfy your usecase, we can see if we can suggest some
things. (For example, supporting operation pushdowns that push through
FLATTEN would probably be very helpful.)

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Tue, Mar 29, 2016 at 6:37 PM, Jean-Claude Cote <jcc...@gmail.com> wrote:

> I've noticed drill offers a REPEATED_CONTAINS which can be applied to
> fields which are arrays.
>
> https://drill.apache.org/docs/repeated-contains/
>
> I have a schema stored in parquet files which contain a repeated field
> containing a key and a value. However such structures can't be queried
> using the REPEATED_CONTAINS. I was thinking of writing a user defined
> function to look through it.
>
> My question is: is it worth it? Will it be faster than doing this?
>
> {"name":"classic","fillings":[ {"name":"sugar","cal":500} ,
> {"name":"flour","cal":300} ] }
>
> SELECT flat.fill FROM (SELECT FLATTEN(t.fillings) AS fill FROM
> dfs.flatten.`test.json` t) flat WHERE flat.fill.name like 'sug%';
>
> Specifically what's the cost of using FLATTEN compared to iterating over
> the array right in a UDF?
>
> Thanks
> Jean-Claude
>

Re: REPEATED_CONTAINS

Reply via email to