Li Xian created SPARK-34859: ------------------------------- Summary: Vectorized parquet reader needs synchronization among pages for column index Key: SPARK-34859 URL: https://issues.apache.org/jira/browse/SPARK-34859 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Li Xian
the current implementation has a problem. the pages returned by `readNextFilteredRowGroup` may not be aligned, some columns may have more rows than others. Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` with `rowIndexes` to make sure that rows are aligned. Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among pages from different columns. Using `readNextFilteredRowGroup` may result in incorrect result. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org