cloud-fan commented on a change in pull request #29045:
URL: https://github.com/apache/spark/pull/29045#discussion_r454475026



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala
##########
@@ -160,12 +160,12 @@ class OrcFileFormat
     }
 
     val resultSchema = StructType(requiredSchema.fields ++ 
partitionSchema.fields)
+    val actualSchema = StructType(dataSchema.fields ++ partitionSchema.fields)
     val sqlConf = sparkSession.sessionState.conf
     val enableVectorizedReader = supportBatch(sparkSession, resultSchema)
     val capacity = sqlConf.orcVectorizedReaderBatchSize
 
-    val resultSchemaString = OrcUtils.orcTypeDescriptionString(resultSchema)
-    OrcConf.MAPRED_INPUT_SCHEMA.setString(hadoopConf, resultSchemaString)

Review comment:
       After reading the code more, I think this is the real problem. When the 
physical file schema doesn't match the table schema, this `resultSchemaString` 
is wrong.
   
   I think we should set this config in the executor side, where we know the 
physical file schema. The `OrcUtils.requestedColumnIds` should not only report 
the column indices, but also the actual `requiredSchema`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to