[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-25 Thread GitBox


sunchao commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-848390519


   @lxian does it mean that, without this PR, vectorized Parquet reader may 
return incorrect results?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-27 Thread GitBox


sunchao commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-850040241


   @lxian In the current approach we'd have to copy values from one vector to 
another. I think a better and more efficient approach may be to feed the row 
indexes to `VectorizedRleValuesReader#readXXX` and skip rows if they are not in 
the range, so basically we increment both `rowId` and row indexes in parallel. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-28 Thread GitBox


sunchao commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-850570200


   @lxian I'm thinking that the extra cost is just incrementing two indexes at 
the same time, so it should be fairly cheap. You can also refer to how 
[SynchronizingColumnReader](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/SynchronizingColumnReader.java#L89)
 is doing that. 
   
   Porting that logic to Spark is a bit tricky though, especially when it comes 
to handle the RLE-encoded definition levels. Let me try experimenting this idea 
too on my side.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-06-02 Thread GitBox


sunchao commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-853252128


   @lxian Yes. I opened #32753 to demonstrate the idea. It's about 1K LOC but 
mostly because the same code has to be duplicated in several places. This is an 
existing issue in the Parquet vectorized code path but I think it's possible to 
eliminate the duplication.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org