[ https://issues.apache.org/jira/browse/SPARK-24780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
holdenk updated SPARK-24780: ---------------------------- Summary: DataFrame.column_name should resolve to a distinct ref (was: DataFrame.column_name should take into account DataFrame alias for future joins) > DataFrame.column_name should resolve to a distinct ref > ------------------------------------------------------ > > Key: SPARK-24780 > URL: https://issues.apache.org/jira/browse/SPARK-24780 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 2.4.0 > Reporter: holdenk > Priority: Minor > > If we join a dataframe with another dataframe which has the same column name > of the conditions (e.g. shared lineage on one of the conditions) even though > the join condition may be written with the full name, the columns returned > don't have the dataframe alias and as such will create a cross-join. > For example this currently works even if both posts_by_sampled_authors & > mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields. > > {code:java} > posts_with_replies = posts_by_sampled_authors.join( > mailing_list_posts_in_reply_to, > [F.col("mailing_list_posts_in_reply_to.in_reply_to") == > F.col("posts_by_sampled_authors.message_id")], > "inner"){code} > > But a similarly written expression: > {code:java} > posts_with_replies = posts_by_sampled_authors.join( > mailing_list_posts_in_reply_to, > [mailing_list_posts_in_reply_to.in_reply_to == > posts_by_sampled_authors.message_id], > "inner"){code} > will fail. > > I'm not super sure whats going on inside of the resolution that's causing it > to get confused. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org