[jira] [Commented] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824877#comment-17824877 ] Asif commented on SPARK-47320: -- Opened following PR [https://github.com/apache/spark/pull/45446|https://github.com/apache/spark/pull/45446] > Datasets involving self joins behave in an inconsistent and unintuitive > manner > > > Key: SPARK-47320 > URL: https://issues.apache.org/jira/browse/SPARK-47320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > > The behaviour of Datasets involving self joins behave in an unintuitive > manner in terms when AnalysisException is thrown due to ambiguity and when it > works. > Found situations where join order swapping causes query to throw Ambiguity > related exceptions which otherwise passes. Some of the Datasets which from > user perspective are un-ambiguous will result in Analysis Exception getting > thrown. > After testing and fixing a bug , I think the issue lies in inconsistency in > determining what constitutes ambiguous and what is un-ambiguous. > There are two ways to look at resolution regarding ambiguity > 1) ExprId of attributes : This is unintuitive approach as spark users do not > bother with the ExprIds > 2) Column Extraction from the Dataset using df(col) api : Which is the user > visible/understandable Point of View. So determining ambiguity should be > based on this. What is Logically unambiguous from users perspective ( > assuming its is logically correct) , should also be the basis of spark > product, to decide on un-ambiguity. > For Example: > {quote} > val df1 = Seq((1, 2)).toDF("a", "b") > val df2 = Seq((1, 2)).toDF("aa", "bb") > val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), > df2("aa"), df1("b")) > val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === > df1("a")).select(df1("a")) > {quote} > The above code from perspective #1 should throw ambiguity exception, because > the join condition and projection of df3 dataframe, has df1("a) which has > exprId which matches both df1Joindf2 and df1. > But if we look is from perspective of Dataset used to get column, which is > the intent of the user, the expectation is that df1("a) should be resolved > to Dataset df1 being joined, and not > df1Joindf2. If user intended "a" from df1Joindf2, then would have used > df1Joindf2("a") > So In this case , current spark throws Exception as it is using resolution > based on # 1 > But the below Dataframe by the above logic, should also throw Ambiguity > Exception but it passes > {quote} > val df1 = Seq((1, 2)).toDF("a", "b") > val df2 = Seq((1, 2)).toDF("aa", "bb") > val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), > df2("aa"), df1("b")) > df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) > {quote} > The difference in the 2 cases is that in the first case , select is present. > But in the 2nd query, select is not there. > So this implies that in 1st case the df1("a") in projection is causing > ambiguity issue, but same reference in 2nd case, used just in condition, is > considered un-ambiguous. > IMHO , the ambiguity identification criteria should be based totally on #2 > and consistently. > In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of > the tests which are being considered ambiguous ( on # 1 criteria) become > un-ambiguous using (#2) criteria. > for eg: > {quote} > test("SPARK-28344: fail ambiguous self join - column ref in join condition") { > val df1 = spark.range(3) > val df2 = df1.filter($"id" > 0) > @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest > with SharedSparkSession { > withSQLConf( > SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", > SQLConf.CROSS_JOINS_ENABLED.key -> "true") { > assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id"))) > } > } > {quote} > The above test should not have ambiguity exception thrown as df1("id") and > df2("id") are un-ambiguous from perspective of Dataset -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824589#comment-17824589 ] Asif commented on SPARK-47320: -- will be linking the bug to an open PR > Datasets involving self joins behave in an inconsistent and unintuitive > manner > > > Key: SPARK-47320 > URL: https://issues.apache.org/jira/browse/SPARK-47320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > > The behaviour of Datasets involving self joins behave in an unintuitive > manner in terms when AnalysisException is thrown due to ambiguity and when it > works. > Found situations where join order swapping causes query to throw Ambiguity > related exceptions which otherwise passes. Some of the Datasets which from > user perspective are un-ambiguous will result in Analysis Exception getting > thrown. > After testing and fixing a bug , I think the issue lies in inconsistency in > determining what constitutes ambiguous and what is un-ambiguous. > There are two ways to look at resolution regarding ambiguity > 1) ExprId of attributes : This is unintuitive approach as spark users do not > bother with the ExprIds > 2) Column Extraction from the Dataset using df(col) api : Which is the user > visible/understandable Point of View. So determining ambiguity should be > based on this. What is Logically unambiguous from users perspective ( > assuming its is logically correct) , should also be the basis of spark > product, to decide on un-ambiguity. > For Example: > {quote} > val df1 = Seq((1, 2)).toDF("a", "b") > val df2 = Seq((1, 2)).toDF("aa", "bb") > val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), > df2("aa"), df1("b")) > val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === > df1("a")).select(df1("a")) > {quote} > The above code from perspective #1 should throw ambiguity exception, because > the join condition and projection of df3 dataframe, has df1("a) which has > exprId which matches both df1Joindf2 and df1. > But if we look is from perspective of Dataset used to get column, which is > the intent of the user, the expectation is that df1("a) should be resolved > to Dataset df1 being joined, and not > df1Joindf2. If user intended "a" from df1Joindf2, then would have used > df1Joindf2("a") > So In this case , current spark throws Exception as it is using resolution > based on # 1 > But the below Dataframe by the above logic, should also throw Ambiguity > Exception but it passes > {quote} > val df1 = Seq((1, 2)).toDF("a", "b") > val df2 = Seq((1, 2)).toDF("aa", "bb") > val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), > df2("aa"), df1("b")) > df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) > {quote} > The difference in the 2 cases is that in the first case , select is present. > But in the 2nd query, select is not there. > So this implies that in 1st case the df1("a") in projection is causing > ambiguity issue, but same reference in 2nd case, used just in condition, is > considered un-ambiguous. > IMHO , the ambiguity identification criteria should be based totally on #2 > and consistently. > In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of > the tests which are being considered ambiguous ( on # 1 criteria) become > un-ambiguous using (#2) criteria. > for eg: > {quote} > test("SPARK-28344: fail ambiguous self join - column ref in join condition") { > val df1 = spark.range(3) > val df2 = df1.filter($"id" > 0) > @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest > with SharedSparkSession { > withSQLConf( > SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", > SQLConf.CROSS_JOINS_ENABLED.key -> "true") { > assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id"))) > } > } > {quote} > The above test should not have ambiguity exception thrown as df1("id") and > df2("id") are un-ambiguous from perspective of Dataset -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org