[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3769: Add benchmarks for testing row filtering

GitBox Sat, 08 Oct 2022 21:43:04 -0700


Ted-Jiang commented on PR #3769:
URL: 
https://github.com/apache/arrow-datafusion/pull/3769#issuecomment-1272454236


   @thinkharderdev thanks  for your great bench.
   I run parquet tools in local get (1.0 GB)
   ```
   (venv) yangjiang@LM-SHC-15009782 data % parquet-tools column-index  
./logs.parquet                                                     
   row group 0:
   column index for column service:
   Boudary order: UNORDERED
                         null count  min                                       
max                                     
   page-0                         0  backend                                   
frontend                                
   
   offset index for column service:
                             offset   compressed size       first row index
   page-0                        62               117                     0
   
   column index for column host:
   Boudary order: UNORDERED
                         null count  min                                       
max                                     
   page-0                         0  i-1ec3ca3151468928.ec2.internal           
i-1ec408f54dbd3750.ec2.internal         
   
   offset index for column host:
                             offset   compressed size       first row index
   page-0                       566               125                     0
   
   column index for column pod:
   Boudary order: UNORDERED
                         null count  min                                       
max                                     
   page-0                         0  aejowuublavflbbsvlfozigwpmrxldvhaollk     
zxxlzhdrucrhpicpdgxtfpyuknvviimggtq     
   
   offset index for column pod:
                             offset   compressed size       first row index
   page-0                      6689               602                     0
   
   column index for column container:
   Boudary order: UNORDERED
                         null count  min                                       
max                                     
   page-0                         0  backend_container_0                       
frontend_container_1                    
   
   offset index for column container:
                             offset   compressed size       first row index
   page-0                      7602               593                     0
   ```
   There are at most two pages in one col, I think if we adjust to get more 
pages in one col (like reduce the page size), it will get greater performance 
in  enable `enable_page_index `,  we can get more opportunitys to skip whole 
pages without decoding! 🤔
   
   FYI,  i see impala choose to use fixed row number in one page to do 
benchmark for getting good performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3769: Add benchmarks for testing row filtering

Reply via email to