[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632485#comment-16632485 ]
Sean Owen commented on SPARK-25150: ----------------------------------- Hm, I am not sure I understand the example yet – help me clarify here. We have three dataframes, really; states, humans, zombies: {code:java} State,Total Population,Total Area RI,1200000,300000 MA,8000000,17000000 NH,3300000,9100000 +-----+-----+ |State|count| +-----+-----+ | RI| 2| | NH| 1| +-----+-----+ +-----+-----+ |State|count| +-----+-----+ | RI| 1| | MA| 1| +-----+-----+{code} You join all three on state: {code:java} analysis = ( states .join( total_humans, on=(states['State'] == total_humans['State']) ) .join( total_zombies, on=(states['State'] == total_zombies['State']) ) .orderBy(states['State'], ascending=True) .select( states['State'], states['Total Population'], total_humans['count'].alias('Total Humans'), total_zombies['count'].alias('Total Zombies'), ) ) {code} and you get {code:java} +-----+----------------+------------+-------------+ |State|Total Population|Total Humans|Total Zombies| +-----+----------------+------------+-------------+ | NH| 3300000| 1| 1| | NH| 3300000| 1| 1| | RI| 1200000| 2| 1| | RI| 1200000| 2| 1| +-----+----------------+------------+-------------+{code} But say you expect {code:java} +-----+----------------+------------+-------------+ |State|Total Population|Total Humans|Total Zombies| +-----+----------------+------------+-------------+ | RI| 1200000| 2| 1| +-----+----------------+------------+-------------+{code} First, this isn't a cross join right? the message says it thinks there is no join condition and wonders if you're really trying to do a cross join, but you're not, so enabling it isn't helping. If these were cross-joins, the output would be correct I think? The second join joins on a column in {{states}}, but that is not a DataFrame used in that join. Is that the problem? > Joining DataFrames derived from the same source yields confusing/incorrect > results > ---------------------------------------------------------------------------------- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1 > Reporter: Nicholas Chammas > Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org