[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-03-02 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20717

[SPARK-23564][SQL] Add isNotNull check for left anti and outer joins

## What changes were proposed in this pull request?

In order to optimize queries, some conditions can be added to the join 
condition for LEFT ANTI and OUTER joins. Unfortunately, so far this was not 
done since we are using only constraints which can be enforced on the output of 
the operator (in this case of the JOIN).

We can enforce some `isNotNull` conditions on one side, which are not valid 
conditions on the output of the Join, though. The PR adds these conditions in 
the `Optimizer` phase, in order to improve performance in some cases.

## How was this patch tested?

Added UTs

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23564

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20717.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20717


commit 45fbb851e76eeaa45c9926571059274efca2441a
Author: Marco Gaido 
Date:   2018-03-02T16:27:18Z

[SPARK-23564][SQL] Add isNotNull check for left anti and outer joins




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-03-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20717#discussion_r173269537
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -675,6 +676,22 @@ object InferFiltersFromConstraints extends 
Rule[LogicalPlan] with PredicateHelpe
   }
   if (newConditionOpt.isDefined) Join(left, right, joinType, 
newConditionOpt) else join
   }
+
+  /**
+   * Returns additional constraints which are not enforced on the result 
of join operations, but
+   * which can be enforced either on the left or the right side
--- End diff --

why not put this in `Join.validConstraints`? `LogicalPlan.constraints` 
should only contain constraints for the plab output, but 
`LogicalPlan.allConstraints` can contain more.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-03-09 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20717#discussion_r173404206
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -675,6 +676,22 @@ object InferFiltersFromConstraints extends 
Rule[LogicalPlan] with PredicateHelpe
   }
   if (newConditionOpt.isDefined) Join(left, right, joinType, 
newConditionOpt) else join
   }
+
+  /**
+   * Returns additional constraints which are not enforced on the result 
of join operations, but
+   * which can be enforced either on the left or the right side
--- End diff --

I haven't put it there, because `constraints` is created from 
`allConstraints`, so adding them to `validConstraints` could have caused them 
to be part of `constraints` too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-03-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20717#discussion_r173587545
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -675,6 +676,22 @@ object InferFiltersFromConstraints extends 
Rule[LogicalPlan] with PredicateHelpe
   }
   if (newConditionOpt.isDefined) Join(left, right, joinType, 
newConditionOpt) else join
   }
+
+  /**
+   * Returns additional constraints which are not enforced on the result 
of join operations, but
+   * which can be enforced either on the left or the right side
--- End diff --

ah i see the problem. For left-anti join, although `Join.output` reuse the 
attributes from left child output, they are actually different attributes, e.g. 
Join may output null values, so we can't generate these constraints in 
`Join.validConstraints`.

I think we can override both `allConstraints` and `constraints`, to make 
sure these extra constraints appear in `allConstraints`, but not `constraints`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-03-21 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20717#discussion_r176318729
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala
 ---
@@ -91,7 +97,7 @@ trait QueryPlanConstraints { self: LogicalPlan =>
* Recursively explores the expressions which are null intolerant and 
returns all attributes
* in these expressions.
*/
-  private def scanNullIntolerantAttribute(expr: Expression): 
Seq[Attribute] = expr match {
+  protected def scanNullIntolerantAttribute(expr: Expression): 
Seq[Attribute] = expr match {
--- End diff --

Let keep this `private` because this is used only here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-03-21 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20717#discussion_r176319670
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -341,6 +341,26 @@ case class Join(
 case UsingJoin(_, _) => false
 case _ => resolvedExceptNatural
   }
+
+  override protected def constructAllConstraints: Set[Expression] = {
+// additional constraints which are not enforced on the result of join 
operations, but can be
+// enforced either on the left or the right side
+val additionalConstraints = joinType match {
+  case LeftAnti | LeftOuter if condition.isDefined =>
+
splitConjunctivePredicates(condition.get).flatMap(inferIsNotNullConstraints).filter(
+  _.references.subsetOf(right.outputSet))
+  case RightOuter if condition.isDefined =>
+
splitConjunctivePredicates(condition.get).flatMap(inferIsNotNullConstraints).filter(
+  _.references.subsetOf(left.outputSet))
+  case _ => Seq.empty[Expression]
+}
+super.constructAllConstraints ++ additionalConstraints
+  }
+
+  override lazy val constraints: ExpressionSet = ExpressionSet(
+super.constructAllConstraints.filter { c =>
+  c.references.nonEmpty && c.references.subsetOf(outputSet) && 
c.deterministic
+})
--- End diff --

Could you add more test case for this code path?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-03-22 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20717#discussion_r176323419
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromConstraintsSuite.scala
 ---
@@ -204,4 +204,40 @@ class InferFiltersFromConstraintsSuite extends 
PlanTest {
 val optimized = Optimize.execute(originalQuery)
 comparePlans(optimized, correctAnswer)
   }
+
+  test("SPARK-23564: left anti join should filter out null join keys on 
right side") {
+val x = testRelation.subquery('x)
+val y = testRelation.subquery('y)
+val condition = Some("x.a".attr === "y.a".attr)
+val originalQuery = x.join(y, LeftAnti, condition).analyze
+val left = x
+val right = y.where(IsNotNull('a))
+val correctAnswer = left.join(right, LeftAnti, condition).analyze
+val optimized = Optimize.execute(originalQuery)
+comparePlans(optimized, correctAnswer)
+  }
+
+  test("SPARK-23564: left outer join should filter out null join keys on 
right side") {
+val x = testRelation.subquery('x)
+val y = testRelation.subquery('y)
+val condition = Some("x.a".attr === "y.a".attr)
+val originalQuery = x.join(y, LeftOuter, condition).analyze
+val left = x
+val right = y.where(IsNotNull('a))
+val correctAnswer = left.join(right, LeftOuter, condition).analyze
+val optimized = Optimize.execute(originalQuery)
+comparePlans(optimized, correctAnswer)
+  }
+
+  test("SPARK-23564: right outer join should filter out null join keys on 
left side") {
+val x = testRelation.subquery('x)
+val y = testRelation.subquery('y)
+val condition = Some("x.a".attr === "y.a".attr)
+val originalQuery = x.join(y, RightOuter, condition).analyze
+val left = x.where(IsNotNull('a))
+val right = y
+val correctAnswer = left.join(right, RightOuter, condition).analyze
+val optimized = Optimize.execute(originalQuery)
+comparePlans(optimized, correctAnswer)
+  }
--- End diff --

Since this is a simple repetition of the previous `test("SPARK-23405: 
left-semi equal-join should filter out null join keys on both sides"`, what 
about making helper test function and simplify these together at this time?

```scala
  private def testConstraints(
  x: LogicalPlan, y: LogicalPlan, left: LogicalPlan, right: 
LogicalPlan, joinType: JoinType) = {
val condition = Some("x.a".attr === "y.a".attr)
val originalQuery = x.join(y, joinType, condition).analyze
val correctAnswer = left.join(right, joinType, condition).analyze
val optimized = Optimize.execute(originalQuery)
comparePlans(optimized, correctAnswer)
  }

  test("SPARK-23405: left-semi equal-join should filter out null join keys 
on both sides") {
val x = testRelation.subquery('x)
val y = testRelation.subquery('y)
testConstraints(x, y, x.where(IsNotNull('a)), y.where(IsNotNull('a)), 
LeftSemi)
  }

  test("SPARK-23564: left anti join should filter out null join keys on 
right side") {
val x = testRelation.subquery('x)
val y = testRelation.subquery('y)
testConstraints(x, y, x, y.where(IsNotNull('a)), LeftAnti)
  }

  test("SPARK-23564: left outer join should filter out null join keys on 
right side") {
val x = testRelation.subquery('x)
val y = testRelation.subquery('y)
testConstraints(x, y, x, y.where(IsNotNull('a)), LeftOuter)
  }

  test("SPARK-23564: right outer join should filter out null join keys on 
left side") {
val x = testRelation.subquery('x)
val y = testRelation.subquery('y)
testConstraints(x, y, x.where(IsNotNull('a)), y, RightOuter)
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-03-23 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20717#discussion_r176750470
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -341,6 +341,26 @@ case class Join(
 case UsingJoin(_, _) => false
 case _ => resolvedExceptNatural
   }
+
+  override protected def constructAllConstraints: Set[Expression] = {
+// additional constraints which are not enforced on the result of join 
operations, but can be
+// enforced either on the left or the right side
+val additionalConstraints = joinType match {
+  case LeftAnti | LeftOuter if condition.isDefined =>
+
splitConjunctivePredicates(condition.get).flatMap(inferIsNotNullConstraints).filter(
+  _.references.subsetOf(right.outputSet))
+  case RightOuter if condition.isDefined =>
+
splitConjunctivePredicates(condition.get).flatMap(inferIsNotNullConstraints).filter(
+  _.references.subsetOf(left.outputSet))
+  case _ => Seq.empty[Expression]
+}
+super.constructAllConstraints ++ additionalConstraints
+  }
+
+  override lazy val constraints: ExpressionSet = ExpressionSet(
+super.constructAllConstraints.filter { c =>
+  c.references.nonEmpty && c.references.subsetOf(outputSet) && 
c.deterministic
+})
--- End diff --

thanks, I added some statements to the `ConstraintPropagationSuite`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20717: [SPARK-23564][SQL] Add isNotNull check for left a...

2018-04-23 Thread mgaido91
Github user mgaido91 closed the pull request at:

https://github.com/apache/spark/pull/20717


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org