[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309330#comment-17309330 ]
Li Xian commented on SPARK-34859: --------------------------------- [~yumwang] Yes, I'm interested in this issue and I would like to work on it. > Vectorized parquet reader needs synchronization among pages for column index > ---------------------------------------------------------------------------- > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0 > Reporter: Li Xian > Priority: Major > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org