Nicholas Chammas created SPARK-25150:
----------------------------------------

             Summary: Joining DataFrames derived from the same source yields 
confusing/incorrect results
                 Key: SPARK-25150
                 URL: https://issues.apache.org/jira/browse/SPARK-25150
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.1
            Reporter: Nicholas Chammas
         Attachments: output-with-implicit-cross-join.txt, 
output-without-implicit-cross-join.txt, persons.csv, states.csv, 
zombie-analysis.py

I have two DataFrames, A and B. From B, I have derived two additional 
DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
confusing error:
{code:java}
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
{code}
Then, when IĀ configure "spark.sql.crossJoin.enabled=true" as instructed, Spark 
appears to give me incorrect answers.

I am not sure if I am missing something obvious, or if there is some kind of 
bug here. The "join condition is missing" error is confusing and doesn't make 
sense to me, and the seemingly incorrect output is concerning.

I've attached a reproduction, along with the output I'm seeing with and without 
the implicit cross join enabled.

I realize the join I've written is not correct in the sense that it should be 
left outer join instead of an inner join (since some of the aggregates are not 
available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to