[GitHub] [arrow-rs] tustvold commented on issue #1191: Parquet Scan Filter

GitBox Tue, 18 Jan 2022 05:51:42 -0800


tustvold commented on issue #1191:
URL: https://github.com/apache/arrow-rs/issues/1191#issuecomment-1015431173



   > I probably don't fully understand what "evaluating predicates against 
encoded data" means compared to what you have proposed in this ticket.
   
   In this ticket I propose passing a bitmask down to the scan, the parquet 
crate would have no involvement in generating this mask, nor would it 
understand the predicates involved in generating it. Think of it like the take 
kernel, you give it a mask and it returns those rows, but it has no idea why 
the query engine is requesting those rows.
   
   This would mean streaming out data as an arrow RecordBatch (or Array) in 
order to evaluate predicates, however, my hope is with the dictionary 
preservation this will be relatively cheap, and successive predicate 
evaluations will retrieve successively less rows.
   
   A further optimisation would then be to actually evaluate the predicates 
directly on the underlying parquet data, without first decoding to an arrow 
representation. This is what I'm wondering if I should create a ticket for? 
This still needs the "take" support in order for the generated masks to be 
useful, but it likely will speed up their generation vs decoding to an arrow 
representation first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on issue #1191: Parquet Scan Filter

Reply via email to