hhhizzz commented on issue #10140: URL: https://github.com/apache/arrow-rs/issues/10140#issuecomment-4732963433
thanks @alamb. I think the change is useful, and the synthetic numbers shared earlier look plausible to me. They match the expected behavior for a caller that already has a bitmap-style row mask: fragmented selections can explode into a very large `Vec<RowSelector>`, while keeping the selection as a bitmap avoids that cost. However, after checking the current DataFusion paths, I think this is hard to exercise with the existing TPC-DS / ClickBench SQL benchmarks. Those benchmarks can exercise Parquet row filtering, but they do not appear to construct a bitmap-backed `RowSelection`. The row-filter path still goes through `RowSelection::from_filters`, and page/access-plan selections generally use `RowSelection::from(Vec<RowSelector>)`. So I was not able to make TPC-DS or ClickBench naturally hit the core `RowSelection::from_boolean_buffer` / `RowSelectionInner::Mask -> new_mask_from_buffer` path. My understanding is that the main trigger condition is a more specific caller shape: an external index / FTS / bitmap-index layer that already has a row-level bitmap and passes it directly into Parquet as a `RowSelection`. Given that, I think it would be helpful to add targeted coverage for the intended path, for example: - a microbenchmark comparing `from_filters` vs `from_boolean_buffer` on fragmented, clustered, and random masks; - an Arrow reader benchmark using `with_row_selection(RowSelection::from_boolean_buffer(mask))`; - possibly a DataFusion external `ParquetAccessPlan` test/benchmark that passes a bitmap-backed `RowSelection` directly. That would make the practical benefit easier to validate. I would expect fragmented masks with many short runs to benefit the most, while clustered masks may still favor the existing selector/RLE representation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
