[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

mallman Mon, 02 Jul 2018 16:01:21 -0700

Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21320#discussion_r199643803
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
 ---
    @@ -71,9 +80,22 @@ private[parquet] class ParquetReadSupport(val convertTz: 
Option[TimeZone])
           StructType.fromString(schemaString)
         }
     
    -    val parquetRequestedSchema =
    +    val clippedParquetSchema =
           ParquetReadSupport.clipParquetSchema(context.getFileSchema, 
catalystRequestedSchema)
     
    +    val parquetRequestedSchema = if (parquetMrCompatibility) {
    +      // Parquet-mr will throw an exception if we try to read a superset 
of the file's schema.
    +      // Therefore, we intersect our clipped schema with the underlying 
file's schema
    +      ParquetReadSupport.intersectParquetGroups(clippedParquetSchema, 
context.getFileSchema)
    +        .map(intersectionGroup =>
    +          new MessageType(intersectionGroup.getName, 
intersectionGroup.getFields))
    +        .getOrElse(ParquetSchemaConverter.EMPTY_MESSAGE)
    +    } else {
    +      // Spark's built-in Parquet reader will throw an exception in some 
cases if the requested
    +      // schema is not the same as the clipped schema
    --- End diff --
    
    I believe the failure occurs because the requested schema and file 
schemaâwhile having columns with identical names and typesâhave columns in 
different order. Of the one test that fails in the `ParquetFilterSuite`, namely 
"Filter applied on merged Parquet schema with new column should work", it 
appears to be the only one for which the order of the columns is changed. These 
are the file and requested schema for that test:
    
    ```
    Parquet file schema:
    message spark_schema {
      required int32 c;
      optional binary b (UTF8);
    }
    
    Parquet requested schema:
    message spark_schema {
      optional binary b (UTF8);
      required int32 c;
    }
    ```
    
    I would say the Spark reader expects identical column order, whereas the 
parquet-mr reader accepts different column order but identical (or compatible) 
column names. That's my supposition at least.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

Reply via email to