[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530850#comment-17530850 ]
L. C. Hsieh commented on SPARK-39070: ------------------------------------- SPARK-38285 is already backported to branch-3.2. It will be in next 3.2 release. > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > ------------------------------------------------------------------------------------------------ > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda > and pip > Reporter: Felix Altenberg > Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Row( > fields=[ > Row(name="field1", value=1), > Row(name="field2", value=2), > ] > ), > Row( > fields=[ > Row(name="field1", value=3), > Row(name="field3", value=4), > ] > ), > ], > ) > ] > df = spark.createDataFrame(data) > df = df.withColumn("ships", sf.explode("ships")).select( > sf.col("other_value"), sf.col("ships.*") > ) > df = df.withColumn("fields", sf.explode("fields")).select("other_value", > "fields.*") > df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} > The "pivot("name")" is what causes the error to be thrown. > Having even one less explode (by manually transforming the input dataframe to > what it would look like after the first explode) causes this error not to > appear. > Adding a .cache() before the groupBy and pivot also causes this error to > dissapear. > Strangely, this error does not appear when running the above code in > Databricks using Databricks Runtime 10.4 which is also using spark version > 3.2.1 > I have attached the full stacktrace of the error below. > I will try to investigate more after work today and will add anything else I > find. > The exception raised is > {noformat} > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 61, Column 78: A method named "numElements" is not declared in any enclosing > class nor any supertype, nor through a static import{noformat} > but seems to be caused by this exception > {noformat} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow{noformat} > The following stackoverflow question seems to pertain to a very similar error > [https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow] -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org