[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Binford updated SPARK-35290:
---------------------------------
    Description: 
We've encountered a few weird edge cases that seem to fail the new null filling 
unionByName (which has been a great addition!). It seems to stem from the 
fields being sorted by name and corrupted along the way. The simple 
reproduction is:
{code:java}
df = spark.createDataFrame([[]])
df1 = (df
    .withColumn('top', F.struct(
        F.struct(
            F.lit('ba').alias('ba')
        ).alias('b')
    ))
)
df2 = (df
    .withColumn('top', F.struct(
        F.struct(
            F.lit('aa').alias('aa')
        ).alias('a'),
        F.struct(
            F.lit('bb').alias('bb')
        ).alias('b'),
    ))
)
df1.unionByName(df2, True).printSchema()
{code}
This results in the exception:
{code:java}
pyspark.sql.utils.AnalysisException: Union can only be performed on tables with 
the compatible column types. 
struct<a:struct<aa:string>,b:struct<ba:string,bb:string>> <> 
struct<a:struct<aa:string>,b:struct<aa:string,bb:string>> at the first column 
of the second table;
{code}
You can see in the second schema that it has 
{code:java}
b:struct<aa:string,bb:string>
{code}
when it should be
{code:java}
b:struct<ba:string:bb:string>
{code}
It seems to happen somewhere during 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
 as everything seems correct up to that point from my testing. It's either 
modifying one expression during the transformUp then corrupts other expressions 
that are then modified, or the ExtractValue before the addFieldsInto is 
remembering the ordinal position in the struct that is then changing and 
causing issues.

 

I found that simply using sortStructFields instead of 
sortStructFieldsInWithFields gets things working correctly, but definitely has 
a performance impact. The deep expr unionByName test takes ~1-2 seconds 
normally but ~12-15 seconds with this change. I assume because the original 
method tried to rewrite existing expressions vs the sortStructFields just adds 
expressions on top of existing ones to project the new order.

I'm not sure if it makes sense to take the slower but works in the edge cases 
method (assuming it doesn't break other cases, all existing tests pass), or if 
there's a way to fix the existing method for cases like this.

 

  was:
We've encountered a few weird edge cases that seem to fail the new null filling 
unionByName (which has been a great addition!). It seems to stem from the 
fields being sorted by name and corrupted along the way. The simple 
reproduction is:

 
{code:java}
df = spark.createDataFrame([[]])
df1 = (df
    .withColumn('top', F.struct(
        F.struct(
            F.lit('ba').alias('ba')
        ).alias('b')
    ))
)
df2 = (df
    .withColumn('top', F.struct(
        F.struct(
            F.lit('aa').alias('aa')
        ).alias('a'),
        F.struct(
            F.lit('bb').alias('bb')
        ).alias('b'),
    ))
)
df1.unionByName(df2, True).printSchema()
{code}
This results in the exception:

 

 
{code:java}
pyspark.sql.utils.AnalysisException: Union can only be performed on tables with 
the compatible column types. 
struct<a:struct<aa:string>,b:struct<ba:string,bb:string>> <> 
struct<a:struct<aa:string>,b:struct<aa:string,bb:string>> at the first column 
of the second table;
{code}
You can see in the second schema that it has 

 

 
{code:java}
b:struct<aa:string,bb:string>
{code}
when it should be

 

 
{code:java}
b:struct<ba:string:bb:string>
{code}
It seems to happen somewhere during 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
 as everything seems correct up to that point from my testing. It's either 
modifying one expression during the transformUp then corrupts other expressions 
that are then modified, or the ExtractValue before the addFieldsInto is 
remembering the ordinal position in the struct that is then changing and 
causing issues.

 

I found that simply using sortStructFields instead of 
sortStructFieldsInWithFields gets things working correctly, but definitely has 
a performance impact. The deep expr unionByName test takes ~1-2 seconds 
normally but ~12-15 seconds with this change. I assume because the original 
method tried to rewrite existing expressions vs the sortStructFields just adds 
expressions on top of existing ones to project the new order.

I'm not sure if it makes sense to take the slower but works in the edge cases 
method (assuming it doesn't break other cases, all existing tests pass), or if 
there's a way to fix the existing method for cases like this.

 


> unionByName with null filling fails for some nested structs
> -----------------------------------------------------------
>
>                 Key: SPARK-35290
>                 URL: https://issues.apache.org/jira/browse/SPARK-35290
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Adam Binford
>            Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
>     .withColumn('top', F.struct(
>         F.struct(
>             F.lit('ba').alias('ba')
>         ).alias('b')
>     ))
> )
> df2 = (df
>     .withColumn('top', F.struct(
>         F.struct(
>             F.lit('aa').alias('aa')
>         ).alias('a'),
>         F.struct(
>             F.lit('bb').alias('bb')
>         ).alias('b'),
>     ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct<a:struct<aa:string>,b:struct<ba:string,bb:string>> <> 
> struct<a:struct<aa:string>,b:struct<aa:string,bb:string>> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct<aa:string,bb:string>
> {code}
> when it should be
> {code:java}
> b:struct<ba:string:bb:string>
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to