adriangb commented on issue #19241:
URL: https://github.com/apache/datafusion/issues/19241#issuecomment-3633749886

   There is a related bit of work to untangle with is the ScalarValue 
references that `InListExpr` is forced to use even if we are starting from an 
array. It uses these to look up in bloom filters, do predicate pruning, etc. We 
could make all of the relevant APIs work with an enum of `Vec<ScalarValue>` 
(heterogenous lists) or `ArrayRef`s (homogenous lists) and that would avoid 
converting array -> ScalarValue and then in some places back to an array (deep 
in pruning code iirc). Some of this is planning time / build time stuff so it 
is amortized over the data scans, but some of it happens for each file opened. 
It's not as big as for each row, but it adds up.
   
   
   A second thing is that `col IN (...)` is inefficient when it hits a bloom 
filter on `col` if the list is large because it loops over the values in the 
list. I'm not sure how or where we would do this but in theory we could build a 
bloom filter out of the InListExpr and then do a binary operation between that 
bloom filter and Parque's bloom filter instead of looping over each item in the 
list and looking it up in the columns bloom filter. At the very least if we did 
the point above and pushed down an array we could probably be more efficient 
about converting all of the values into something we can look up in the bloom 
filter (currently it goes through ScalarValue).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to