[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
Github user mgaido91 closed the pull request at: https://github.com/apache/spark/pull/21184 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21184#discussion_r207859370 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -284,6 +288,80 @@ class Analyzer( } } + /** + * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having + * different exprIds. This is a rare situation which can cause incorrect results. + */ + object DeduplicateAliases extends Rule[LogicalPlan] { --- End diff -- kindly ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21184#discussion_r202334300 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -284,6 +288,80 @@ class Analyzer( } } + /** + * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having + * different exprIds. This is a rare situation which can cause incorrect results. + */ + object DeduplicateAliases extends Rule[LogicalPlan] { --- End diff -- Yes, that is also true. But in many places in the codebase we just compare attributes using `semanticEquals` or in some other cases, even `equals`. Well, if we admit that different attributes can have the same `exprId`, all these places should be checked in order to be sure that the same problem cannot happen there too. Moreover (this is more a nit), the `semanticEquals` or `sameRef` method itself would be wrong according to its semantic, as it may return `true` even when two attributes don't have the same reference. This is the reason why I opted for this solution, which seems to me cleaner as it solves the root cause of the problem. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21184#discussion_r202077386 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -284,6 +288,80 @@ class Analyzer( } } + /** + * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having + * different exprIds. This is a rare situation which can cause incorrect results. + */ + object DeduplicateAliases extends Rule[LogicalPlan] { --- End diff -- I feel the root cause is in `FoldablePropagation`. We should only replace attribute with literal from children, not siblings. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21184#discussion_r201961786 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -284,6 +288,80 @@ class Analyzer( } } + /** + * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having + * different exprIds. This is a rare situation which can cause incorrect results. + */ + object DeduplicateAliases extends Rule[LogicalPlan] { --- End diff -- The main problem which causes the added UT to fail is that `FoldablePropagation` replaces all foldable aliases which are considered to be the same. If the same alias with same exprId is located in different part of the plan (referencing actually different things, despite they have the same id...) this can cause wrong replacement to happen. So in the added UT, the plan is: ``` == Analyzed Logical Plan == a: int, b: int, n: bigint Union :- Project [a#5, b#17, n#19L] : +- Project [a#5, b#17, n#19L, n#19L] : +- Window [count(1) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS n#19L] :+- Project [a#5, b#6 AS b#17] : +- Project [_1#2 AS a#5, _2#3 AS b#6] : +- LocalRelation [_1#2, _2#3] +- Project [a#12, b#17, n#19L] +- Project [a#12, b#17, n#19L, n#19L] +- Window [count(1) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS n#19L] +- Project [a#12, b#14 AS b#17] +- Project [a#12, 0 AS b#14] +- Project [value#10 AS a#12] +- LocalRelation [value#10] ``` Please note that in both the branches of the union we have the same `b#17` attribute, but they are referencing different things. As the lower one is a foldable value which evaluates to 0, all the `b#17` are replace with a literal 0, causing a wrong result. Despite we might fix this specific problem in the related Optimizer rule, I think that in general we assume that items with the same id are the same. So I proposed this solution in order to fix all the possible issues which may arise due to this situation which is not expected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21184#discussion_r201957701 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2265,4 +2266,15 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { val df = spark.range(1).select($"id", new Column(Uuid())) checkAnswer(df, df.collect()) } + + test("SPARK-24051: using the same alias can produce incorrect result") { --- End diff -- yes, without the change the result is: ``` +---+---+---+ | a| b| n| +---+---+---+ | 1| 0| 2| | 2| 0| 2| | 3| 0| 1| +---+---+---+ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21184#discussion_r201886545 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -284,6 +288,80 @@ class Analyzer( } } + /** + * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having + * different exprIds. This is a rare situation which can cause incorrect results. + */ + object DeduplicateAliases extends Rule[LogicalPlan] { --- End diff -- what problem does it try to resolve? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21184#discussion_r201769371 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2265,4 +2266,15 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { val df = spark.range(1).select($"id", new Column(Uuid())) checkAnswer(df, df.collect()) } + + test("SPARK-24051: using the same alias can produce incorrect result") { --- End diff -- This test case failed without your changes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/21184 [WIP][SPARK-24051][SQL] Replace Aliases with the same exprId ## What changes were proposed in this pull request? If a user reuses the same column in different selects, it can happen that we have `Alias` with the same `exprId`. These aliases can of course reference different columns/expressions (as in the use case presented in the JIRA). If any of them is a foldable, all of them are replaced with the foldable value in the Optimizer. The PR proposes to replace the conflicting aliases generating new ids for the conflicting ones. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-24051 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21184.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21184 commit d676b6277a682894d409e314e64ece7857a97841 Author: Marco Gaido Date: 2018-04-25T16:14:55Z [SPARK-24051][SQL] Replace Aliases with the same exprId --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org