tustvold commented on issue #1191: URL: https://github.com/apache/arrow-rs/issues/1191#issuecomment-1015431173
> I probably don't fully understand what "evaluating predicates against encoded data" means compared to what you have proposed in this ticket. In this ticket I propose passing a bitmask down to the scan, the parquet crate would have no involvement in generating this mask, nor would it understand the predicates involved in generating it. Think of it like the take kernel, you give it a mask and it returns those rows, but it has no idea why the query engine is requesting those rows. This would mean streaming out data as an arrow RecordBatch (or Array) in order to evaluate predicates, however, my hope is with the dictionary preservation this will be relatively cheap, and successive predicate evaluations will retrieve successively less rows. A further optimisation would then be to actually evaluate the predicates directly on the underlying parquet data, without first decoding to an arrow representation. This is what I'm wondering if I should create a ticket for? This still needs the "take" support in order for the generated masks to be useful, but it likely will speed up their generation vs decoding to an arrow representation first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
