alamb opened a new issue, #8846:
URL: https://github.com/apache/arrow-rs/issues/8846

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   After the great work from @hhhizzz  in 
https://github.com/apache/arrow-rs/pull/8733, we will (finally) have the 
ability to use a Bitmask filter representation when applying filters *during* 
Parquet decode. 
   
   At the moment, the code relies on a simple threshold strategy to pick 
between representations
   
   
https://github.com/apache/arrow-rs/blob/911331aafa13f5e230440cf5d02feb245985c64e/parquet/src/arrow/arrow_reader/read_plan.rs#L107-L130
   
   However, as @hhhizzz mentions in 
https://github.com/apache/arrow-rs/pull/8733#discussion_r2506343981
   
   > Yes, my charts indicate that there are many rules for setting the 
RowSelectionStrategy, like the column type, column count, string length, and 
their combinations... We can create tickets and collaborate on improving these 
over time.
   
   **Describe the solution you'd like**
   I would like better heuristics for selecting between the stratgies
   
   **Describe alternatives you've considered**
   @hhhizzz  has some good suggestions, and the charts from 
https://github.com/apache/arrow-rs/pull/8733#issuecomment-3468441165 offer some 
good ideas:
   
   > For how I get the the average length to use the mask, here's some 
statistic, you can checkout to 
(https://github.com/hhhizzz/arrow-rs/tree/rowselectionempty-charts) and run 
`python3  dev/row_selection_analysis.py` on your local machine, this is the 
results on my x86 PC:
   > # One column `int32`, different distribution type:
   > <img width="768" height="576" alt="scenario-dense80-dense80" 
src="https://github.com/user-attachments/assets/30582512-ba85-43d7-9c20-02632020a08b";
 />
   > <img width="768" height="576" alt="scenario-sparse20-sparse20" 
src="https://github.com/user-attachments/assets/dcbcf7ff-2a4a-4039-bcf4-3d821f8d39a1";
 />
   > <img width="768" height="576" alt="scenario-spread50-spread50" 
src="https://github.com/user-attachments/assets/ff585f4e-959a-40cf-a580-911720c5fef8";
 />
   > <img width="768" height="576" alt="scenario-uniform50-uniform50" 
src="https://github.com/user-attachments/assets/a581877a-15e4-445c-9b04-f4b9d25a4749";
 />
   > 
   > # Different column type:
   > <img width="768" height="576" alt="dtype-int32-uniform50" 
src="https://github.com/user-attachments/assets/732befae-e6d2-4d01-9c88-f7d1e0a31269";
 />
   > <img width="768" height="576" alt="dtype-utf8view-uniform50" 
src="https://github.com/user-attachments/assets/d097c6e2-3099-4001-8ec7-3e6b4ec9be79";
 />
   > <img width="768" height="576" alt="dtype-float64-uniform50" 
src="https://github.com/user-attachments/assets/ca738a33-750a-472e-ae65-627a4e8387ef";
 />
   >
   > # Different column counts:
   > <img width="768" height="576" alt="columns-C02-uniform50" 
src="https://github.com/user-attachments/assets/d0230ed3-a80e-470c-b9f8-72d012486430";
 />
   > <img width="768" height="576" alt="columns-C04-uniform50" 
src="https://github.com/user-attachments/assets/7f5a252d-5ce3-44b8-a7af-c98bc9426dd5";
 />
   > <img width="768" height="576" alt="columns-C08-uniform50" 
src="https://github.com/user-attachments/assets/ff658cec-6f6f-41ea-87fd-599096ab6d41";
 />
   > <img width="768" height="576" alt="columns-C16-uniform50" 
src="https://github.com/user-attachments/assets/d903cf45-0336-4fdd-a1f0-c3b6bd331bdd";
 />
   > <img width="768" height="576" alt="columns-C32-uniform50" 
src="https://github.com/user-attachments/assets/4bde5045-4391-446e-a678-675109b5e193";
 />
   
   
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to