lxian commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-851435041
@sunchao you are right. it's real tricky and would require a long of changes
as well. VectorizedValuesReader currently read and put all values to columns
vector for a single data
lxian commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-850280282
> @lxian In the current approach we'd have to copy values from one vector to
another. I think a better and more efficient approach may be to feed the row
indexes to `VectorizedRle
lxian commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-848416541
> @lxian does it mean that, without this PR, vectorized Parquet reader may
return incorrect results?
Yes, the result may be incorrect in cases that data page among columns a
lxian commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-822159976
https://gist.github.com/lxian/bba60a0460a74d3427994ce0d60d4c79 I've run a
benchmark on tpcds with scale 10 and the impact of column index looks subtle.
--
This is an automated m