[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...

2018-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23211#discussion_r240097479
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -984,6 +1002,28 @@ object PushDownPredicate extends Rule[LogicalPlan] 
with PredicateHelper {
 
   project.copy(child = Filter(replaceAlias(condition, aliasMap), 
grandChild))
 
+// Similar to the above Filter over Project
+// LeftSemi/LeftAnti over Project
+case join @ Join(p @ Project(pList, grandChild), rightOp, 
LeftSemiOrAnti(joinType), joinCond)
--- End diff --

Shall we create a new rule `PushdownLeftSemaOrAntiJoin`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...

2018-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23211#discussion_r240097255
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -649,13 +664,16 @@ object CollapseProject extends Rule[LogicalPlan] {
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
 case p1 @ Project(_, p2: Project) =>
-  if (haveCommonNonDeterministicOutput(p1.projectList, 
p2.projectList)) {
+  if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) 
||
+ScalarSubquery.hasScalarSubquery(p1.projectList) ||
+ScalarSubquery.hasScalarSubquery(p2.projectList)) {
--- End diff --

why do we allow it before?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...

2018-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23211#discussion_r240092936
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -267,6 +267,17 @@ object ScalarSubquery {
   case _ => false
 }.isDefined
   }
+
+  def hasScalarSubquery(e: Expression): Boolean = {
+e.find {
+  case s: ScalarSubquery => true
+  case _ => false
+}.isDefined
+  }
+
+  def hasScalarSubquery(e: Seq[Expression]): Boolean = {
+e.find(hasScalarSubquery(_)).isDefined
--- End diff --

`e.exists(hasScalarSubquery)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...

2018-12-03 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/23211

[SPARK-19712][SQL] Move PullupCorrelatedPredicates and 
RewritePredicateSubquery after OptimizeSubqueries

Currently predicate subqueries (IN/EXISTS) are converted to Joins at the 
end of optimizer in RewritePredicateSubquery. This change moves the rewrite 
close to beginning of optimizer. The original idea was to keep the subquery 
expressions in Filter form so that we can push them down as deep as possible. 
One disadvantage is that, after the subqueries are rewritten in join form, they 
are not subjected to further optimizations. In this change, we convert the 
subqueries to join form early in the rewrite phase and then add logic to push 
the left-semi and left-anti joins down like we do for normal filter ops. I can 
think of the following advantages : 

1. We will produce consistent optimized plans for subqueries written using 
SQL dialect and data frame apis.
2. Will hopefully make it easier to do the next phase of de-correlations 
when we opens up more cases of de-correlation. In this case, it would be 
beneficial to expose the rewritten queries to all the other optimization rules.
3. We can now hopefully get-rid of PullupCorrelatedPredicates rule and 
combine ths with RewritePredicateSubquery. I haven't tried it. Will take it on 
a followup.

(P.S Thanks to Natt for his original work in 
[here](https://github.com/apache/spark/pull/17520). I have based this pr on his 
work)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-19712-NEW

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23211.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23211


commit f4bb126472eb5a808a3ae94bcfb59e0674e01217
Author: Dilip Biswal 
Date:   2018-12-03T22:06:24Z

[SPARK-19712] Move PullupCorrelatedPredicates and RewritePredicateSubquery 
after OptimizeSubqueries




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org