alamb opened a new issue, #8846: URL: https://github.com/apache/arrow-rs/issues/8846
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** After the great work from @hhhizzz in https://github.com/apache/arrow-rs/pull/8733, we will (finally) have the ability to use a Bitmask filter representation when applying filters *during* Parquet decode. At the moment, the code relies on a simple threshold strategy to pick between representations https://github.com/apache/arrow-rs/blob/911331aafa13f5e230440cf5d02feb245985c64e/parquet/src/arrow/arrow_reader/read_plan.rs#L107-L130 However, as @hhhizzz mentions in https://github.com/apache/arrow-rs/pull/8733#discussion_r2506343981 > Yes, my charts indicate that there are many rules for setting the RowSelectionStrategy, like the column type, column count, string length, and their combinations... We can create tickets and collaborate on improving these over time. **Describe the solution you'd like** I would like better heuristics for selecting between the stratgies **Describe alternatives you've considered** @hhhizzz has some good suggestions, and the charts from https://github.com/apache/arrow-rs/pull/8733#issuecomment-3468441165 offer some good ideas: > For how I get the the average length to use the mask, here's some statistic, you can checkout to (https://github.com/hhhizzz/arrow-rs/tree/rowselectionempty-charts) and run `python3 dev/row_selection_analysis.py` on your local machine, this is the results on my x86 PC: > # One column `int32`, different distribution type: > <img width="768" height="576" alt="scenario-dense80-dense80" src="https://github.com/user-attachments/assets/30582512-ba85-43d7-9c20-02632020a08b" /> > <img width="768" height="576" alt="scenario-sparse20-sparse20" src="https://github.com/user-attachments/assets/dcbcf7ff-2a4a-4039-bcf4-3d821f8d39a1" /> > <img width="768" height="576" alt="scenario-spread50-spread50" src="https://github.com/user-attachments/assets/ff585f4e-959a-40cf-a580-911720c5fef8" /> > <img width="768" height="576" alt="scenario-uniform50-uniform50" src="https://github.com/user-attachments/assets/a581877a-15e4-445c-9b04-f4b9d25a4749" /> > > # Different column type: > <img width="768" height="576" alt="dtype-int32-uniform50" src="https://github.com/user-attachments/assets/732befae-e6d2-4d01-9c88-f7d1e0a31269" /> > <img width="768" height="576" alt="dtype-utf8view-uniform50" src="https://github.com/user-attachments/assets/d097c6e2-3099-4001-8ec7-3e6b4ec9be79" /> > <img width="768" height="576" alt="dtype-float64-uniform50" src="https://github.com/user-attachments/assets/ca738a33-750a-472e-ae65-627a4e8387ef" /> > > # Different column counts: > <img width="768" height="576" alt="columns-C02-uniform50" src="https://github.com/user-attachments/assets/d0230ed3-a80e-470c-b9f8-72d012486430" /> > <img width="768" height="576" alt="columns-C04-uniform50" src="https://github.com/user-attachments/assets/7f5a252d-5ce3-44b8-a7af-c98bc9426dd5" /> > <img width="768" height="576" alt="columns-C08-uniform50" src="https://github.com/user-attachments/assets/ff658cec-6f6f-41ea-87fd-599096ab6d41" /> > <img width="768" height="576" alt="columns-C16-uniform50" src="https://github.com/user-attachments/assets/d903cf45-0336-4fdd-a1f0-c3b6bd331bdd" /> > <img width="768" height="576" alt="columns-C32-uniform50" src="https://github.com/user-attachments/assets/4bde5045-4391-446e-a678-675109b5e193" /> **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
