[ 
https://issues.apache.org/jira/browse/SPARK-37829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740726#comment-17740726
 ] 

koert kuipers edited comment on SPARK-37829 at 7/8/23 4:19 PM:
---------------------------------------------------------------

since this (admittedly somewhat weird) behavior of returning a Row with null 
values has been present since spark 3.0.x (a major breaking release, and 3 
years ago) i would argue this is the default behavior and this jira introduces 
a breaking change.

basically i am saying if one argues this was a breaking change in going from 
spark 2.x to 3.x then i agree but a major version can make a breaking change. 
introducing a fix in 3.4.1 that reverts that breaking change is basically 
introducing a breaking change going from 3.4.0 to 3.4.1 which is worse in my 
opinion.

also expressionencoders are used for other purposes than dataset joins and now 
we find nulls popping up in places they should not. this is how i ran into this 
issue.


was (Author: koert):
since this (admittedly somewhat weird) behavior of returning a Row with null 
values has been present since spark 3.0.x (a major breaking release, and 3 
years ago) i would argue this is the default behavior and this jira introduces 
a breaking change.

also expressionencoders are used for other purposes than dataset joins and now 
we find nulls popping up in places they should not.

> An outer-join using joinWith on DataFrames returns Rows with null fields 
> instead of null values
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37829
>                 URL: https://issues.apache.org/jira/browse/SPARK-37829
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0
>            Reporter: Clément de Groc
>            Assignee: Jason Xu
>            Priority: Major
>             Fix For: 3.3.3, 3.4.1, 3.5.0
>
>
> Doing an outer-join using {{joinWith}} on {{{}DataFrame{}}}s used to return 
> missing values as {{null}} in Spark 2.4.8, but returns them as {{Rows}} with 
> {{null}} values in Spark 3+.
> The issue can be reproduced with [the following 
> test|https://github.com/cdegroc/spark/commit/79f4d6a1ec6c69b10b72dbc8f92ab6490d5ef5e5]
>  that succeeds on Spark 2.4.8 but fails starting from Spark 3.0.0.
> The problem only arises when working with DataFrames: Datasets of case 
> classes work as expected as demonstrated by [this other 
> test|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L1200-L1223].
> I couldn't find an explanation for this change in the Migration guide so I'm 
> assuming this is a bug.
> A {{git bisect}} pointed me to [that 
> commit|https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59].
> Reverting the commit solves the problem.
> A similar solution,  but without reverting, is shown 
> [here|https://github.com/cdegroc/spark/commit/684c675bf070876a475a9b225f6c2f92edce4c8a].
> Happy to help if you think of another approach / can provide some guidance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to