[ https://issues.apache.org/jira/browse/SPARK-34859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved SPARK-34859. ----------------------------------- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32753 [https://github.com/apache/spark/pull/32753] > Vectorized parquet reader needs synchronization among pages for column index > ---------------------------------------------------------------------------- > > Key: SPARK-34859 > URL: https://issues.apache.org/jira/browse/SPARK-34859 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0 > Reporter: Li Xian > Priority: Blocker > Labels: correctness > Fix For: 3.2.0 > > Attachments: > part-00000-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet > > > the current implementation has a problem. the pages returned by > `readNextFilteredRowGroup` may not be aligned, some columns may have more > rows than others. > Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` > with `rowIndexes` to make sure that rows are aligned. > Currently `VectorizedParquetRecordReader` doesn't have such synchronizing > among pages from different columns. Using `readNextFilteredRowGroup` may > result in incorrect result. > > I have attache an example parquet file. This file is generated with > `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this > file is listed below. > row group 0 > -------------------------------------------------------------------------------- > _1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED > [more]... > _2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 > ENC:PLAIN,BIT_PACKED [more]... > _1 TV=2000 RL=0 DL=0 > > ---------------------------------------------------------------------------- > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > page 3: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:500 > _2 TV=2000 RL=0 DL=0 > > ---------------------------------------------------------------------------- > page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for > [more]... VC:1000 > > As you can see in the row group 0, column1 has 4 data pages each with 500 > values and column 2 has 2 data pages with 1000 values each. > If we want to filter the rows by values with _1 = 510 using columnindex, > parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 > of column _1 starts with row 500, and page 0 of column _2 starts with row 0, > and it will be incorrect if we simply read the two values as one row. > > As an example, If you try filter with _1 = 510 with column index on in > current version, it will give you the wrong result > +---+---+ > |_1 |_2 | > +---+---+ > |510|10 | > +---+---+ > And if turn columnindex off, you can get the correct result > +---+---+ > |_1 |_2 | > +---+---+ > |510|510| > +---+---+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org