[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307590#comment-17307590 ]
Li Xian commented on SPARK-26345: --------------------------------- [~yumwang] I think the current implementation has a problem. the pages returned by `readNextFilteredRowGroup` may not be aligned, some columns may have more rows than others. Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` with `rowIndexes` to make sure that rows are aligned. Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among pages from different columns. Using `readNextFilteredRowGroup` may result in incorrect result. > Parquet support Column indexes > ------------------------------ > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL > Affects Versions: 3.1.0 > Reporter: Yuming Wang > Assignee: Yuming Wang > Priority: Major > Fix For: 3.2.0 > > > Parquet 1.11 supports column indexing. Spark can supports this feature for > better read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 > > Benchmark result: > [https://github.com/apache/spark/pull/31393#issuecomment-769767724] > This feature is enabled by default, and users can disable it by setting > {{parquet.filter.columnindex.enabled}} to false. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org