[GitHub] [arrow-rs] thinkharderdev commented on issue #2270: Changes to ParquetRecordBatchStream to support row filtering in DataFusion

GitBox Mon, 01 Aug 2022 16:14:47 -0700


thinkharderdev commented on issue #2270:
URL: https://github.com/apache/arrow-rs/issues/2270#issuecomment-1201828410


   > This system will only be able to pushdown to eliminate decode overheads, 
i.e. it will be unable to eliminate IO to fetch data (which is fine we have the 
page index for that)
   
   Agreed, the ideal scenario would be to use the page index to avoid the IO 
altogether. I think the tradeoff there will be around when we fetch the data. 
Using buffered prefetch has been a big performance improvement for us and 
avoiding the eager IO in order to possible fetch less data could be a net loss 
depending on how much you can reduce the data fetched. 
   
   > I wonder if it would be simpler to push the predicate into the 
ParquetRecordBatchReader that way you don't need to futz around with async, and 
would also potentially eventually allow predicate evaluation on encoded data
   
   Yeah, it would be nice to be able to just pass a `FnOnce(RecordBatch) -> 
Result<(RecordBatch,VecDequeue<RowSelection>)>` into `ParquetRecordBatchStream` 
and let it handle all the details. 
   
   > We probably need to work on a way to represent predicates within parquet 
directly so that all the various pruning, skipping logic can be centralised and 
not spread across two repos
   
   Agreed, it would be great to just be able to pass a predicate in an 
expression language defined in `arrow-rs`. That said, it would be good to 
expose an arbitrary computation as well, so you can use things like scalar 
functions for filtering. 
   
   > The nature of parquet is such that skipping runs of rows less than the 
normal batch_size may in fact be slower than just reading them normally. This 
means if we don't determine the ranges up front, we'll need some way to bail 
out if it gets too expensive
   
   Interesting, when is this the case? Would there be situations where the 
decoder couldn't optimize based on the number of values to skip? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] thinkharderdev commented on issue #2270: Changes to ParquetRecordBatchStream to support row filtering in DataFusion

Reply via email to