[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-09-24 Thread mgaido91
Github user mgaido91 closed the pull request at:

https://github.com/apache/spark/pull/21184


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-08-06 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21184#discussion_r207859370
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -284,6 +288,80 @@ class Analyzer(
 }
   }
 
+  /**
+   * Replaces [[Alias]] with the same exprId but different references with 
[[Alias]] having
+   * different exprIds. This is a rare situation which can cause incorrect 
results.
+   */
+  object DeduplicateAliases extends Rule[LogicalPlan] {
--- End diff --

kindly ping @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-07-13 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21184#discussion_r202334300
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -284,6 +288,80 @@ class Analyzer(
 }
   }
 
+  /**
+   * Replaces [[Alias]] with the same exprId but different references with 
[[Alias]] having
+   * different exprIds. This is a rare situation which can cause incorrect 
results.
+   */
+  object DeduplicateAliases extends Rule[LogicalPlan] {
--- End diff --

Yes, that is also true. But in many places in the codebase we just compare 
attributes using `semanticEquals` or in some other cases, even `equals`. Well, 
if we admit that different attributes can have the same `exprId`, all these 
places should be checked in order to be sure that the same problem cannot 
happen there too. Moreover (this is more a nit), the `semanticEquals` or 
`sameRef` method itself would be wrong according to its semantic, as it may 
return `true` even when two attributes don't have the same reference. This is 
the reason why I opted for this solution, which seems to me cleaner as it 
solves the root cause of the problem. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-07-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21184#discussion_r202077386
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -284,6 +288,80 @@ class Analyzer(
 }
   }
 
+  /**
+   * Replaces [[Alias]] with the same exprId but different references with 
[[Alias]] having
+   * different exprIds. This is a rare situation which can cause incorrect 
results.
+   */
+  object DeduplicateAliases extends Rule[LogicalPlan] {
--- End diff --

I feel the root cause is in `FoldablePropagation`. We should only replace 
attribute with literal from children, not siblings. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-07-12 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21184#discussion_r201961786
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -284,6 +288,80 @@ class Analyzer(
 }
   }
 
+  /**
+   * Replaces [[Alias]] with the same exprId but different references with 
[[Alias]] having
+   * different exprIds. This is a rare situation which can cause incorrect 
results.
+   */
+  object DeduplicateAliases extends Rule[LogicalPlan] {
--- End diff --

The main problem which causes the added UT to fail is that 
`FoldablePropagation` replaces all foldable aliases which are considered to be 
the same. If the same alias with same exprId is located in different part of 
the plan (referencing actually different things, despite they have the same 
id...) this can cause wrong replacement to happen. So in the added UT, the plan 
is:
```
== Analyzed Logical Plan ==
a: int, b: int, n: bigint
Union
:- Project [a#5, b#17, n#19L]
:  +- Project [a#5, b#17, n#19L, n#19L]
: +- Window [count(1) 
windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), 
unboundedfollowing$())) AS n#19L]
:+- Project [a#5, b#6 AS b#17]
:   +- Project [_1#2 AS a#5, _2#3 AS b#6]
:  +- LocalRelation [_1#2, _2#3]
+- Project [a#12, b#17, n#19L]
   +- Project [a#12, b#17, n#19L, n#19L]
  +- Window [count(1) 
windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), 
unboundedfollowing$())) AS n#19L]
 +- Project [a#12, b#14 AS b#17]
+- Project [a#12, 0 AS b#14]
   +- Project [value#10 AS a#12]
  +- LocalRelation [value#10]
```
Please note that in both the branches of the union we have the same `b#17` 
attribute, but they are referencing different things. As the lower one is a 
foldable value which evaluates to 0, all the `b#17` are replace with a literal 
0, causing a wrong result.

Despite we might fix this specific problem in the related Optimizer rule, I 
think that in general we assume that items with the same id are the same. So I 
proposed this solution in order to fix all the possible issues which may arise 
due to this situation which is not expected.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-07-12 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21184#discussion_r201957701
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2265,4 +2266,15 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 val df = spark.range(1).select($"id", new Column(Uuid()))
 checkAnswer(df, df.collect())
   }
+
+  test("SPARK-24051: using the same alias can produce incorrect result") {
--- End diff --

yes, without the change the result is:
```
+---+---+---+
|  a|  b|  n|
+---+---+---+
|  1|  0|  2|
|  2|  0|  2|
|  3|  0|  1|
+---+---+---+
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-07-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21184#discussion_r201886545
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -284,6 +288,80 @@ class Analyzer(
 }
   }
 
+  /**
+   * Replaces [[Alias]] with the same exprId but different references with 
[[Alias]] having
+   * different exprIds. This is a rare situation which can cause incorrect 
results.
+   */
+  object DeduplicateAliases extends Rule[LogicalPlan] {
--- End diff --

what problem does it try to resolve?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-07-11 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21184#discussion_r201769371
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2265,4 +2266,15 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 val df = spark.range(1).select($"id", new Column(Uuid()))
 checkAnswer(df, df.collect())
   }
+
+  test("SPARK-24051: using the same alias can produce incorrect result") {
--- End diff --

This test case failed without your changes? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

2018-04-27 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/21184

[WIP][SPARK-24051][SQL] Replace Aliases with the same exprId

## What changes were proposed in this pull request?

If a user reuses the same column in different selects, it can happen that 
we have `Alias` with the same `exprId`. These aliases can of course reference 
different columns/expressions (as in the use case presented in the JIRA). If 
any of them is a foldable, all of them are replaced with the foldable value in 
the Optimizer.

The PR proposes to replace the conflicting aliases generating new ids for 
the conflicting ones.

## How was this patch tested?

added UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-24051

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21184


commit d676b6277a682894d409e314e64ece7857a97841
Author: Marco Gaido 
Date:   2018-04-25T16:14:55Z

[SPARK-24051][SQL] Replace Aliases with the same exprId




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org