[ https://issues.apache.org/jira/browse/SPARK-38285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
L. C. Hsieh resolved SPARK-38285. --------------------------------- Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 35749 [https://github.com/apache/spark/pull/35749] > ClassCastException: GenericArrayData cannot be cast to InternalRow > ------------------------------------------------------------------ > > Key: SPARK-38285 > URL: https://issues.apache.org/jira/browse/SPARK-38285 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.1 > Reporter: Alessandro Bacchini > Assignee: L. C. Hsieh > Priority: Major > Fix For: 3.3.0, 3.2.2 > > > The following code with Spark 3.2.1 raises an exception: > {code:python} > import pyspark.sql.functions as F > from pyspark.sql.types import StructType, StructField, ArrayType, StringType > t = StructType([ > StructField('o', > ArrayType( > StructType([ > StructField('s', StringType(), False), > StructField('b', ArrayType( > StructType([ > StructField('e', StringType(), False) > ]), > True), > False) > ]), > True), > False)]) > value = { > "o": [ > { > "s": "string1", > "b": [ > { > "e": "string2" > }, > { > "e": "string3" > } > ] > }, > { > "s": "string4", > "b": [ > { > "e": "string5" > }, > { > "e": "string6" > }, > { > "e": "string7" > } > ] > } > ] > } > df = ( > spark.createDataFrame([value], schema=t) > .select(F.explode("o").alias("eo")) > .select("eo.b.e") > ) > df.show() > {code} > The exception message is: > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) > at > org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) > at > org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at > org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) > at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153) > at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at org.apache.spark.scheduler.Task.run(Task.scala:93) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected. > Please note that the issue seems to be related to SPARK-37577: I am using the > same DataFrame schema, but this time I have populated it with non empty value. > I think that this is bug because with the following configuration it works as > expected: > {code:python} > spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False) > {code} > Update: The provided code is working with Spark 3.1.2 without problems, so it > seems an error due to expression pruning. > The expected result is: > {code} > +---------------------------+ > |e | > +---------------------------+ > |[string2, string3] | > |[string5, string6, string7]| > +---------------------------+ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org