[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-06-02 Thread GitBox
sunchao commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-853252128 @lxian Yes. I opened #32753 to demonstrate the idea. It's about 1K LOC but mostly because the same code has to be duplicated in several places. This is an existing issue in the

[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-28 Thread GitBox
sunchao commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-850570200 @lxian I'm thinking that the extra cost is just incrementing two indexes at the same time, so it should be fairly cheap. You can also refer to how

[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-27 Thread GitBox
sunchao commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-850040241 @lxian In the current approach we'd have to copy values from one vector to another. I think a better and more efficient approach may be to feed the row indexes to

[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-25 Thread GitBox
sunchao commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-848390519 @lxian does it mean that, without this PR, vectorized Parquet reader may return incorrect results? -- This is an automated message from the Apache Git Service. To respond to