Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21320#discussion_r199643803
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
 ---
    @@ -71,9 +80,22 @@ private[parquet] class ParquetReadSupport(val convertTz: 
Option[TimeZone])
           StructType.fromString(schemaString)
         }
     
    -    val parquetRequestedSchema =
    +    val clippedParquetSchema =
           ParquetReadSupport.clipParquetSchema(context.getFileSchema, 
catalystRequestedSchema)
     
    +    val parquetRequestedSchema = if (parquetMrCompatibility) {
    +      // Parquet-mr will throw an exception if we try to read a superset 
of the file's schema.
    +      // Therefore, we intersect our clipped schema with the underlying 
file's schema
    +      ParquetReadSupport.intersectParquetGroups(clippedParquetSchema, 
context.getFileSchema)
    +        .map(intersectionGroup =>
    +          new MessageType(intersectionGroup.getName, 
intersectionGroup.getFields))
    +        .getOrElse(ParquetSchemaConverter.EMPTY_MESSAGE)
    +    } else {
    +      // Spark's built-in Parquet reader will throw an exception in some 
cases if the requested
    +      // schema is not the same as the clipped schema
    --- End diff --
    
    I believe the failure occurs because the requested schema and file 
schema—while having columns with identical names and types—have columns in 
different order. Of the one test that fails in the `ParquetFilterSuite`, namely 
"Filter applied on merged Parquet schema with new column should work", it 
appears to be the only one for which the order of the columns is changed. These 
are the file and requested schema for that test:
    
    ```
    Parquet file schema:
    message spark_schema {
      required int32 c;
      optional binary b (UTF8);
    }
    
    Parquet requested schema:
    message spark_schema {
      optional binary b (UTF8);
      required int32 c;
    }
    ```
    
    I would say the Spark reader expects identical column order, whereas the 
parquet-mr reader accepts different column order but identical (or compatible) 
column names. That's my supposition at least.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to