[jira] [Commented] (SPARK-24230) With Parquet 1.10 upgrade has errors in the vectorized reader

Ryan Blue (JIRA) Wed, 09 May 2018 16:15:14 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-24230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469656#comment-16469656
 ]


Ryan Blue commented on SPARK-24230:
-----------------------------------

Looks like I have a fix for this that I missed when submitting the patch for 
1.10.0. Here it is:

{code:java}
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
@@ -148,7 +148,8 @@ public abstract class SpecificParquetRecordReaderBase<T> 
extends RecordReader<Vo
     this.sparkSchema = 
StructType$.MODULE$.fromString(sparkRequestedSchemaString);
     this.reader = new ParquetFileReader(
         configuration, footer.getFileMetaData(), file, blocks, 
requestedSchema.getColumns());
-    for (BlockMetaData block : blocks) {
+    // use the blocks from the reader in case some do not match filters and 
will not be read
+    for (BlockMetaData block : reader.getRowGroups()) {
       this.totalRowCount += block.getRowCount();
     }
 
@@ -224,7 +225,8 @@ public abstract class SpecificParquetRecordReaderBase<T> 
extends RecordReader<Vo
     this.sparkSchema = new 
ParquetToSparkSchemaConverter(config).convert(requestedSchema);
     this.reader = new ParquetFileReader(
         config, footer.getFileMetaData(), file, blocks, 
requestedSchema.getColumns());
-    for (BlockMetaData block : blocks) {
+    // use the blocks from the reader in case some do not match filters and 
will not be read
+    for (BlockMetaData block : reader.getRowGroups()) {
       this.totalRowCount += block.getRowCount();
     }
   }
{code}

I'll submit a PR to fix this. Thanks for reporting it!

> With Parquet 1.10 upgrade has errors in the vectorized reader
> -------------------------------------------------------------
>
>                 Key: SPARK-24230
>                 URL: https://issues.apache.org/jira/browse/SPARK-24230
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Ian O Connell
>            Priority: Major
>
> When reading some parquet files can get an error like:
> java.io.IOException: expecting more rows but reached last block. Read 0 out 
> of 1194236
> This happens when looking for a needle thats pretty rare in a large haystack.
>  
> The issue here I believe is that the total row count is calculated at
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L229]
>  
> But we pass the blocks we filtered via 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups
> to the ParquetFileReader constructor.
>  
> However the ParquetFileReader constructor will filter the list of blocks 
> again using
>  
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L737]
>  
> if a block is filtered out by the latter method, and not the former the 
> vectorized reader will believe it should see more rows than it will.
> the fix I used locally is pretty straight forward:
> {code:java}
> for (BlockMetaData block : blocks) {
> this.totalRowCount += block.getRowCount();
> }
> {code}
> goes to
> {code:java}
> this.totalRowCount = this.reader.getRecordCount();
> {code}
> [~rdblue] do you know if this sounds right? The second filter method in the 
> ParquetFileReader might filter more blocks leading to the count being off? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24230) With Parquet 1.10 upgrade has errors in the vectorized reader

Reply via email to