[ https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
L. C. Hsieh resolved SPARK-35290. --------------------------------- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33040 [https://github.com/apache/spark/pull/33040] > unionByName with null filling fails for some nested structs > ----------------------------------------------------------- > > Key: SPARK-35290 > URL: https://issues.apache.org/jira/browse/SPARK-35290 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.1 > Reporter: Adam Binford > Assignee: Adam Binford > Priority: Major > Fix For: 3.2.0 > > > We've encountered a few weird edge cases that seem to fail the new null > filling unionByName (which has been a great addition!). It seems to stem from > the fields being sorted by name and corrupted along the way. The simple > reproduction is: > {code:java} > df = spark.createDataFrame([[]]) > df1 = (df > .withColumn('top', F.struct( > F.struct( > F.lit('ba').alias('ba') > ).alias('b') > )) > ) > df2 = (df > .withColumn('top', F.struct( > F.struct( > F.lit('aa').alias('aa') > ).alias('a'), > F.struct( > F.lit('bb').alias('bb') > ).alias('b'), > )) > ) > df1.unionByName(df2, True).printSchema() > {code} > This results in the exception: > {code:java} > pyspark.sql.utils.AnalysisException: Union can only be performed on tables > with the compatible column types. > struct<a:struct<aa:string>,b:struct<ba:string,bb:string>> <> > struct<a:struct<aa:string>,b:struct<aa:string,bb:string>> at the first column > of the second table; > {code} > You can see in the second schema that it has > {code:java} > b:struct<aa:string,bb:string> > {code} > when it should be > {code:java} > b:struct<ba:string:bb:string> > {code} > It seems to happen somewhere during > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,] > as everything seems correct up to that point from my testing. It's either > modifying one expression during the transformUp then corrupts other > expressions that are then modified, or the ExtractValue before the > addFieldsInto is remembering the ordinal position in the struct that is then > changing and causing issues. > > I found that simply using sortStructFields instead of > sortStructFieldsInWithFields gets things working correctly, but definitely > has a performance impact. The deep expr unionByName test takes ~1-2 seconds > normally but ~12-15 seconds with this change. I assume because the original > method tried to rewrite existing expressions vs the sortStructFields just > adds expressions on top of existing ones to project the new order. > I'm not sure if it makes sense to take the slower but works in the edge cases > method (assuming it doesn't break other cases, all existing tests pass), or > if there's a way to fix the existing method for cases like this. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org