[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Nicholas Chammas (JIRA) Fri, 28 Sep 2018 13:13:36 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632512#comment-16632512
 ]


Nicholas Chammas commented on SPARK-25150:
------------------------------------------

Correct, this isn't a cross join. It's just a plain inner join.

In theory, whether cross joins are enabled or not should have no bearing on the 
result. However, what we're seeing is that without them enabled we get an 
incorrect error and with them enabled we get incorrect results.

If we were actually trying a cross join (i.e. no {{on=(...)}} condition 
specified) I think those results (with the 4 output rows) would still be 
incorrect since you'd expect NH's population to be combined with RI's stats in 
one of the output rows, but that's not the case. You'd also expect MA to show 
up in the output, too.

> The second join joins on a column in {{states}}, but that is not a DataFrame 
> used in that join. Is that the problem?

Not sure what you mean here. Both joins join on {{states}}, which is the first 
DataFrame in the definition of {{analysis}}.

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-25150
>                 URL: https://issues.apache.org/jira/browse/SPARK-25150
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Nicholas Chammas
>            Priority: Major
>         Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Reply via email to