thinkharderdev commented on issue #2270: URL: https://github.com/apache/arrow-rs/issues/2270#issuecomment-1201828410
> This system will only be able to pushdown to eliminate decode overheads, i.e. it will be unable to eliminate IO to fetch data (which is fine we have the page index for that) Agreed, the ideal scenario would be to use the page index to avoid the IO altogether. I think the tradeoff there will be around when we fetch the data. Using buffered prefetch has been a big performance improvement for us and avoiding the eager IO in order to possible fetch less data could be a net loss depending on how much you can reduce the data fetched. > I wonder if it would be simpler to push the predicate into the ParquetRecordBatchReader that way you don't need to futz around with async, and would also potentially eventually allow predicate evaluation on encoded data Yeah, it would be nice to be able to just pass a `FnOnce(RecordBatch) -> Result<(RecordBatch,VecDequeue<RowSelection>)>` into `ParquetRecordBatchStream` and let it handle all the details. > We probably need to work on a way to represent predicates within parquet directly so that all the various pruning, skipping logic can be centralised and not spread across two repos Agreed, it would be great to just be able to pass a predicate in an expression language defined in `arrow-rs`. That said, it would be good to expose an arbitrary computation as well, so you can use things like scalar functions for filtering. > The nature of parquet is such that skipping runs of rows less than the normal batch_size may in fact be slower than just reading them normally. This means if we don't determine the ranges up front, we'll need some way to bail out if it gets too expensive Interesting, when is this the case? Would there be situations where the decoder couldn't optimize based on the number of values to skip? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org