[GitHub] [spark] lxian commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-31 Thread GitBox
lxian commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-851435041 @sunchao you are right. it's real tricky and would require a long of changes as well. VectorizedValuesReader currently read and put all values to columns vector for a single data

[GitHub] [spark] lxian commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-28 Thread GitBox
lxian commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-850280282 > @lxian In the current approach we'd have to copy values from one vector to another. I think a better and more efficient approach may be to feed the row indexes to `VectorizedRle

[GitHub] [spark] lxian commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-25 Thread GitBox
lxian commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-848416541 > @lxian does it mean that, without this PR, vectorized Parquet reader may return incorrect results? Yes, the result may be incorrect in cases that data page among columns a

[GitHub] [spark] lxian commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-04-18 Thread GitBox
lxian commented on pull request #31998: URL: https://github.com/apache/spark/pull/31998#issuecomment-822159976 https://gist.github.com/lxian/bba60a0460a74d3427994ce0d60d4c79 I've run a benchmark on tpcds with scale 10 and the impact of column index looks subtle. -- This is an automated m