[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368070#comment-17368070
 ] 

Apache Spark commented on SPARK-35290:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/33040

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-05-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339962#comment-17339962
 ] 

Apache Spark commented on SPARK-35290:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/32448

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-05-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339961#comment-17339961
 ] 

Apache Spark commented on SPARK-35290:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/32448

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-05-04 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338921#comment-17338921
 ] 

Adam Binford commented on SPARK-35290:
--

Running into some issues trying to get case insensitivity working correctly. I 
was hoping to be able to leave everything on both sides of the union in it's 
existing casing, and just get things in the right order, but I'm running into 
this issue:
{code:java}
>>> from pyspark.sql.functions import *
>>> df1 = spark.range(1).withColumn('top', struct(lit('A').alias('A')))
>>> df2 = spark.range(1).withColumn('top', struct(lit('a').alias('a')))
>>> spark.conf.set('spark.sql.caseSensitive', 'true')
...
pyspark.sql.utils.AnalysisException: Union can only be performed on tables with 
the compatible column types. struct <> struct at the second 
column of the second table;

>>> spark.conf.set('spark.sql.caseSensitive', 'false')
>>> df1.union(df2)
DataFrame[id: bigint, top: struct]
>>> df1.unionByName(df2)
DataFrame[id: bigint, top: struct]
>>> df1.unionByName(df2, True)
DataFrame[id: bigint, top: struct]
{code}
With case sensitivity enabled, it errors out as expected because the two 
structs are different types. However, when case sensitivity is disabled, the 
union is happy because it sees them as the same type, but when the schemas are 
merged, it treats them as two separate fields. I assume it's related to the 
StructType.merge method, but I don't exactly know where that gets called in the 
context of a Union. I don't see anything in that merge function that handles 
case insensitivity. Is that a bug in itself or a feature?

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-05-03 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338565#comment-17338565
 ] 

L. C. Hsieh commented on SPARK-35290:
-

Thanks [~Kimahriman]. 

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-05-03 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338526#comment-17338526
 ] 

Adam Binford commented on SPARK-35290:
--

I've also been playing around with rewriting some of the logic to just directly 
recursively create a named struct and taking out the need for the 
UpdateField/WithField logic. I've gotten all existing tests to pass (including 
a new one for this case), without the 12-15 second overhead mentioned in the 
description for the deeply nested case, but I think there's still some case 
insensitivity I might need to take care of. I can put up a PR soon with what 
that looks like.

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-05-03 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338477#comment-17338477
 ] 

L. C. Hsieh commented on SPARK-35290:
-

I will take a look.

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-05-03 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338476#comment-17338476
 ] 

L. C. Hsieh commented on SPARK-35290:
-

Thanks [~hyukjin.kwon] for ping me.

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35290) unionByName with null filling fails for some nested structs

2021-05-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338287#comment-17338287
 ] 

Hyukjin Kwon commented on SPARK-35290:
--

cc [~viirya] FYI

> unionByName with null filling fails for some nested structs
> ---
>
> Key: SPARK-35290
> URL: https://issues.apache.org/jira/browse/SPARK-35290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> We've encountered a few weird edge cases that seem to fail the new null 
> filling unionByName (which has been a great addition!). It seems to stem from 
> the fields being sorted by name and corrupted along the way. The simple 
> reproduction is:
> {code:java}
> df = spark.createDataFrame([[]])
> df1 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('ba').alias('ba')
> ).alias('b')
> ))
> )
> df2 = (df
> .withColumn('top', F.struct(
> F.struct(
> F.lit('aa').alias('aa')
> ).alias('a'),
> F.struct(
> F.lit('bb').alias('bb')
> ).alias('b'),
> ))
> )
> df1.unionByName(df2, True).printSchema()
> {code}
> This results in the exception:
> {code:java}
> pyspark.sql.utils.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> struct,b:struct> <> 
> struct,b:struct> at the first column 
> of the second table;
> {code}
> You can see in the second schema that it has 
> {code:java}
> b:struct
> {code}
> when it should be
> {code:java}
> b:struct
> {code}
> It seems to happen somewhere during 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala#L73,]
>  as everything seems correct up to that point from my testing. It's either 
> modifying one expression during the transformUp then corrupts other 
> expressions that are then modified, or the ExtractValue before the 
> addFieldsInto is remembering the ordinal position in the struct that is then 
> changing and causing issues.
>  
> I found that simply using sortStructFields instead of 
> sortStructFieldsInWithFields gets things working correctly, but definitely 
> has a performance impact. The deep expr unionByName test takes ~1-2 seconds 
> normally but ~12-15 seconds with this change. I assume because the original 
> method tried to rewrite existing expressions vs the sortStructFields just 
> adds expressions on top of existing ones to project the new order.
> I'm not sure if it makes sense to take the slower but works in the edge cases 
> method (assuming it doesn't break other cases, all existing tests pass), or 
> if there's a way to fix the existing method for cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org