[ https://issues.apache.org/jira/browse/SPARK-35010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318347#comment-17318347 ]
Yuming Wang commented on SPARK-35010: ------------------------------------- Yes. It is an issue: https://github.com/apache/spark/pull/31993 > nestedSchemaPruning causes issue when reading hive generated Orc files > ---------------------------------------------------------------------- > > Key: SPARK-35010 > URL: https://issues.apache.org/jira/browse/SPARK-35010 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0, 3.1.0 > Reporter: Baohe Zhang > Priority: Critical > > In spark3, we have spark.sql.orc.imple=native and > spark.sql.optimizer.nestedSchemaPruning.enabled=true as the default settings. > And these would cause issues when query struct field of hive-generated orc > files. > For example, we got an error when running this query in spark3 > {code:java} > spark.table("testtable").filter(col("utc_date") === > "20210122").select(col("open_count.d35")).show(false) > {code} > The error is > {code:java} > Caused by: java.lang.AssertionError: assertion failed: The given data schema > struct<open_count:struct<d35:map<string,double>>> has less fields than the > actual ORC physical schema, no idea which columns were dropped, fail to read. > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:153) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > I think the reason is that we apply the nestedSchemaPruning to the > dataSchema. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L75] > This nestedSchemaPruning not only prunes the unused fields of the struct, it > also prunes the unused columns. In my test, the dataSchema originally has 48 > columns, but after nested schema pruning, the dataSchema is pruned to 1 > column. This pruning result in an assertion error in > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L159] > because column pruning in hive generated orc files is not supported. > This issue seems also related to the hive version, we use hive 1.2, and it > doesn't contain field names in the physical schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org