hhhizzz commented on issue #10140:
URL: https://github.com/apache/arrow-rs/issues/10140#issuecomment-4732963433

   thanks @alamb.  I think the change is useful, and the synthetic numbers 
shared earlier look plausible to me. They match the expected behavior for a 
caller that already has a bitmap-style row mask: fragmented selections can 
explode into a very large `Vec<RowSelector>`, while keeping the selection as a 
bitmap avoids that cost.
   
   However, after checking the current DataFusion paths, I think this is hard 
to exercise with the existing TPC-DS / ClickBench SQL benchmarks. Those 
benchmarks can exercise Parquet row filtering, but they do not appear to 
construct a bitmap-backed `RowSelection`. The row-filter path still goes 
through `RowSelection::from_filters`, and page/access-plan selections generally 
use `RowSelection::from(Vec<RowSelector>)`. So I was not able to make TPC-DS or 
ClickBench naturally hit the core `RowSelection::from_boolean_buffer` / 
`RowSelectionInner::Mask -> new_mask_from_buffer` path.
   
   My understanding is that the main trigger condition is a more specific 
caller shape: an external index / FTS / bitmap-index layer that already has a 
row-level bitmap and passes it directly into Parquet as a `RowSelection`.
   
   Given that, I think it would be helpful to add targeted coverage for the 
intended path, for example:
   
   - a microbenchmark comparing `from_filters` vs `from_boolean_buffer` on 
fragmented, clustered, and random masks;
   - an Arrow reader benchmark using 
`with_row_selection(RowSelection::from_boolean_buffer(mask))`;
   - possibly a DataFusion external `ParquetAccessPlan` test/benchmark that 
passes a bitmap-backed `RowSelection` directly.
   
   That would make the practical benefit easier to validate. I would expect 
fragmented masks with many short runs to benefit the most, while clustered 
masks may still favor the existing selector/RLE representation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to