[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308872#comment-17308872 ]
Dongjoon Hyun commented on SPARK-34859: --------------------------------------- Thank you for creating a new JIRA, [~lxian2]! > Vectorized parquet reader needs synchronization among pages for column index > ---------------------------------------------------------------------------- > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0 > Reporter: Li Xian > Priority: Major > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org