hhhizzz commented on issue #10140:
URL: https://github.com/apache/arrow-rs/issues/10140#issuecomment-4768760679

   I think the benchmark results are directionally useful, but I want to 
separate two questions:
   
   1. Is bitmap-backed `RowSelection` faster when the bitmap already exists?
   2. Where does that bitmap-backed `RowSelection` come from in real execution 
paths?
   
   For (1), the PR benchmark does show clear benefits in the intended case. For 
fragmented/random masks at common selectivities, bitmap-backed selection can be 
much faster. But the result is shape-dependent: very sparse or clustered 
selections can still favor selector/RLE representation.
   
   For (2), this is where I think the current practical limitation is. In my 
DataFusion checks, I could not find a common SQL path that naturally produces 
`RowSelection::from_boolean_buffer`:
   
   - row-filter predicates produce `BooleanArray`, but the first predicate 
selection still goes through `RowSelection::from_filters`;
   - page/access-plan pruning generally produces selector/RLE-style 
`RowSelection`;
   - TPC-DS / ClickBench do not seem to naturally construct bitmap-backed 
`RowSelection`.
   
   I also tried an Arrow-side experiment where I explicitly 
constructed/preserved bitmap-backed row selections to force this path. Even 
then, the broad end-to-end result was limited: TPC-DS SF10 full 99-query runs 
were basically neutral, within roughly <0.5% geomean. Some targeted Arrow 
row-filter microbenchmarks showed small single-digit wins, for example around 
6%-7% in a couple of sync row-filter cases, but I would not treat those as 
evidence that current DataFusion workloads will naturally benefit from this PR.
   
   So my current understanding is:
   
   - this PR is useful when an upstream caller already has a row-level bitmap, 
such as an external index / FTS / bitmap-index integration;
   - for current DataFusion TPC-DS / ClickBench style workloads, the optimized 
path is very hard to trigger;
   - from a performance perspective, **the more important missing piece may be 
the producer side: how to create or preserve bitmap-backed `RowSelection` in 
real scan paths.**
   
   It may be worth documenting this scope, and possibly adding an 
integration-style benchmark that starts from an actual bitmap-producing access 
path. That would make it clearer that the PR optimizes preserving/consuming an 
existing bitmap, rather than making the common DataFusion SQL path faster by 
itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to