[jira] [Resolved] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Nicholas Chammas (Jira) Tue, 14 Dec 2021 13:29:04 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicholas Chammas resolved SPARK-25150.
--------------------------------------
    Fix Version/s: 3.2.0
       Resolution: Fixed

It looks like Spark 3.1.2 exhibits a different sort of broken behavior:
{code:java}
pyspark.sql.utils.AnalysisException: Column State#38 are ambiguous. It's 
probably because you joined several Datasets together, and some of these 
Datasets are the same. This column points to one of the Datasets but Spark is 
unable to figure out which one. Please alias the Datasets with different names 
via `Dataset.as` before joining them, and specify the column using qualified 
name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set 
spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check. {code}
I don't think the join in {{zombie-analysis.py}} is ambiguous, and since this 
now works fine in Spark 3.2.0, that's what I'm going to mark as the "Fix 
Version" for this issue.

The fix must have made it in somewhere between Spark 3.1.2 and 3.2.0.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-25150
>                 URL: https://issues.apache.org/jira/browse/SPARK-25150
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1, 2.4.3
>            Reporter: Nicholas Chammas
>            Priority: Major
>              Labels: correctness
>             Fix For: 3.2.0
>
>         Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Reply via email to