[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/21852 [SPARK-24893] [SQL] Remove the entire CaseWhen if all the outputs are semantic equivalence ## What changes were proposed in this pull request? Similar to SPARK-24890, if all the outputs of `CaseWhen` are semantic equivalence, `CaseWhen` can be removed. ## How was this patch tested? Tests added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark short-circuit-when Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21852.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21852 commit dc8de5fec6efc723ce01bcbaf0496c57b62e0ba2 Author: DB Tsai Date: 2018-07-23T19:15:58Z remove casewhen if possible --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r205303069 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,22 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { +// With previous rules, it's guaranteed that there must be one branch. --- End diff -- Is this comment correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r205303174 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala --- @@ -122,4 +126,25 @@ class SimplifyConditionalSuite extends PlanTest with PredicateHelper { None), CaseWhen(normalBranch :: trueBranch :: Nil, None)) } + + test("remove entire CaseWhen if all the outputs are semantic equivalence") { --- End diff -- We may need test case including non deterministic cond. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r205305691 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala --- @@ -122,4 +126,25 @@ class SimplifyConditionalSuite extends PlanTest with PredicateHelper { None), CaseWhen(normalBranch :: trueBranch :: Nil, None)) } + + test("remove entire CaseWhen if all the outputs are semantic equivalence") { --- End diff -- Yes, I plan to add couple more tests tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r205306098 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,22 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { +// With previous rules, it's guaranteed that there must be one branch. --- End diff -- You're right. I removed the comment. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r205309619 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,21 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { +val list = branches.map(_._2) :+ elseValue +list.tail.forall(list.head.semanticEquals) + } => +// For non-deterministic conditions with side effect, we can not remove it. +// Since the output of all the branches are semantic equivalence, `elseValue` +// is picked for all the branches. +val newBranches = branches.map(_._1).filter(!_.deterministic).map(cond => (cond, elseValue)) --- End diff -- All conds must be deterministic, otherwise a non deterministic one not run before can be run after this rule. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r205599224 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,29 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { +val list = branches.map(_._2) :+ elseValue +list.tail.forall(list.head.semanticEquals) + } => +// For non-deterministic conditions with side effect, we can not remove it, or change +// the ordering. As a result, we try to remove the deterministic conditions from the tail. +val newBranches = branches.map(_._1) --- End diff -- @viirya I think this can address your concern. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r205924448 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,23 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { --- End diff -- can we apply this optimization when there is no elseValue? and have one more case to remove `CaseWhen` if it only has one branch and its value is the same as elseValue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r205946975 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,23 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { --- End diff -- We can not. When no `elseValue`, all the conditions are required to evaluated before hitting the default `elseValue` which is `null`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206000642 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,23 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { --- End diff -- ah i see. Another optimization is: we can remove branches that have the same the condition. We can do it in next PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206000777 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,23 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { +val values = branches.map(_._2) :+ elseValue +values.tail.forall(values.head.semanticEquals) --- End diff -- I think the case is: remove branches that have the same value of `else`: ``` branches.exists(_._2.semanticEquals(elseValue)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206000924 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,23 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { --- End diff -- @cloud-fan It sounds like what this #21904 proposes for? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206266243 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,23 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { +val values = branches.map(_._2) :+ elseValue +values.tail.forall(values.head.semanticEquals) --- End diff -- I think what you suggested is the same as the implemented code in the PR, and `elseValue.deterministic` is not required since it is checked in `semanticEquals`. Being said that, the whole condition can be replaced by `branches.exists(_._2.semanticEquals(elseValue))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206271589 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,23 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if { +val values = branches.map(_._2) :+ elseValue +values.tail.forall(values.head.semanticEquals) --- End diff -- I replaced the cond by `branches.forall(_._2.semanticEquals(elseValue))` which is simpler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206499326 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,21 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if branches +.forall(_._2.semanticEquals(elseValue)) => --- End diff -- code style nit: ``` case e @ CaseWhen(branches, Some(elseValue)) if branches.forall(_._2.semanticEquals(elseValue)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206500527 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,21 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if branches +.forall(_._2.semanticEquals(elseValue)) => +// For non-deterministic conditions with side effect, we can not remove it, or change +// the ordering. As a result, we try to remove the deterministic conditions from the tail. --- End diff -- I think it's more readable to write java style code here ``` var hitNonDetermin = false var i = branches.length while (i > 0 && !hitNonDetermin) { hitNonDetermin = !branches(i - 1).deterministic i -= 1 } if (i == 0) { elseValue } else { e.copy(branches = branches.take(i)) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206695251 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,21 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) if branches +.forall(_._2.semanticEquals(elseValue)) => +// For non-deterministic conditions with side effect, we can not remove it, or change +// the ordering. As a result, we try to remove the deterministic conditions from the tail. --- End diff -- Should be ```scala hitNonDetermin = !branches(i - 1).deterministic if (!hitNonDetermin) { i -= 1 } ``` Personally, I like functional style more, but it's more efficient to use Java style here. I updated as you suggested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21852#discussion_r206739216 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -416,6 +416,24 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { // these branches can be pruned away val (h, t) = branches.span(_._1 != TrueLiteral) CaseWhen( h :+ t.head, None) + + case e @ CaseWhen(branches, Some(elseValue)) + if branches.forall(_._2.semanticEquals(elseValue)) => +// For non-deterministic conditions with side effect, we can not remove it, or change +// the ordering. As a result, we try to remove the deterministic conditions from the tail. +var hitNonDeterministicCond = false +var i = branches.length +while (i > 0 && !hitNonDeterministicCond) { + hitNonDeterministicCond = !branches(i - 1)._1.deterministic + if (!hitNonDeterministicCond) { --- End diff -- nit: we can avoid this per-iteration `if` check by updating the final step ``` if (i == 0 && !hitNonDeterministicCond) { elseValue } else { e.copy(branches = branches.take(i + 1).map(branch => (branch._1, elseValue))) } ``` feel free to change it in your next PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21852: [SPARK-24893] [SQL] Remove the entire CaseWhen if...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21852 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org