[GitHub] [spark] wangyum opened a new pull request #31393: [SPARK-34289][SQL] Parquet vectorized reader support column index

GitBox Fri, 29 Jan 2021 00:38:36 -0800


wangyum opened a new pull request #31393:
URL: https://github.com/apache/spark/pull/31393



   ### What changes were proposed in this pull request?
   
   This pr make parquet vectorized reader support [column 
index](https://issues.apache.org/jira/browse/PARQUET-1201).
   
   ### Why are the changes needed?
   
   Improve filter performance. for example: `id = 1`, we only need to read 
`page-0` in `block 1`: 
   
   ```
   block 1:
                        null count  min                                       
max
   page-0                         0  0                                         
99
   page-1                         0  100                                       
199
   page-2                         0  200                                       
299
   page-3                         0  300                                       
399
   page-4                         0  400                                       
449
   
   block 2:
                        null count  min                                       
max
   page-0                         0  450                                       
549
   page-1                         0  550                                       
649
   page-2                         0  650                                       
749
   page-3                         0  750                                       
849
   page-4                         0  850                                       
899
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum opened a new pull request #31393: [SPARK-34289][SQL] Parquet vectorized reader support column index

Reply via email to