[GitHub] [spark] viirya commented on a change in pull request #32699: [SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions

2021-06-01 Thread GitBox


viirya commented on a change in pull request #32699:
URL: https://github.com/apache/spark/pull/32699#discussion_r643254798



##
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##
@@ -2882,6 +2882,31 @@ class DataFrameSuite extends QueryTest
 df2.collect()
 assert(accum.value == 15)
   }
+
+  test("SPARK-35560: Remove redundant subexpression evaluation in nested 
subexpressions") {
+Seq(1, Int.MaxValue).foreach { splitThreshold =>
+  withSQLConf(SQLConf.CODEGEN_METHOD_SPLIT_THRESHOLD.key -> 
splitThreshold.toString) {
+val accum = sparkContext.longAccumulator("call")
+val simpleUDF = udf((s: String) => {
+  accum.add(1)
+  s
+})
+
+// Common exprs:
+//  1. simpleUDF($"id")
+//  2. functions.length(simpleUDF($"id"))

Review comment:
   Yes, this is actually the cases this logic to deal with. Previous common 
expression gen-ed codes will put into the map. The code generator looks up into 
the map when generating code for later common expressions to replace the 
semantic-equal expression with gen-ed code value.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32699: [SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions

2021-05-31 Thread GitBox


viirya commented on a change in pull request #32699:
URL: https://github.com/apache/spark/pull/32699#discussion_r642617321



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
##
@@ -1039,21 +1039,25 @@ class CodegenContext extends Logging {
   def subexpressionEliminationForWholeStageCodegen(expressions: 
Seq[Expression]): SubExprCodes = {
 // Create a clear EquivalentExpressions and SubExprEliminationState mapping
 val equivalentExpressions: EquivalentExpressions = new 
EquivalentExpressions
-val localSubExprEliminationExprs = mutable.HashMap.empty[Expression, 
SubExprEliminationState]
+val localSubExprEliminationExprsForNonSplit =
+  mutable.HashMap.empty[Expression, SubExprEliminationState]
 
 // Add each expression tree and compute the common subexpressions.
 expressions.foreach(equivalentExpressions.addExprTree(_))
 
 // Get all the expressions that appear at least twice and set up the state 
for subexpression
 // elimination.
 val commonExprs = equivalentExpressions.getAllEquivalentExprs(1)
-lazy val commonExprVals = commonExprs.map(_.head.genCode(this))
 
-lazy val nonSplitExprCode = {
-  commonExprs.zip(commonExprVals).map { case (exprs, eval) =>
-// Generate the code for this expression tree.
-val state = SubExprEliminationState(eval.isNull, eval.value)
-exprs.foreach(localSubExprEliminationExprs.put(_, state))
+val nonSplitExprCode = {
+  commonExprs.map { exprs =>
+val eval = 
withSubExprEliminationExprs(localSubExprEliminationExprsForNonSplit.toMap) {

Review comment:
   For each set of common expressions, `withSubExprEliminationExprs` only 
called once so I think it is not actually a recursive call?
   
   `withSubExprEliminationExprs` takes the given map used for subexpression 
elimination to replace common expression during expression codegen in the 
closure. It returns evaluated expression code (value/isNull/code).
   
   For the two subexpressions as example:
   
   1. `simpleUDF($"id")`
   2. `functions.length(simpleUDF($"id"))`
   
   1st round `withSubExprEliminationExprs`:
   
   The map is empty.
   Gen code for `simpleUDF($"id")`.
   Put it into the map => (`simpleUDF($"id")` -> gen-ed code)
   
   2nd round `withSubExprEliminationExprs`:
   
   Gen code for `functions.length(simpleUDF($"id"))`.
   Looking at the map and replace common expression `simpleUDF($"id")` as 
gen-ed code.
   Put it into the map => (`simpleUDF($"id")` -> gen-ed code, 
`functions.length(simpleUDF($"id"))` -> gen-ed code)
   
   The map will be used later for subexpression elimination.
   
   
   
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32699: [SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions

2021-05-31 Thread GitBox


viirya commented on a change in pull request #32699:
URL: https://github.com/apache/spark/pull/32699#discussion_r642314197



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
##
@@ -1068,11 +1072,19 @@ class CodegenContext extends Logging {
 }.unzip
 
 val splitThreshold = SQLConf.get.methodSplitThreshold
-val codes = if (commonExprVals.map(_.code.length).sum > splitThreshold) {
+
+val (codes, subExprsMap, exprCodes) = if 
(nonSplitExprCode.map(_.length).sum > splitThreshold) {
   if 
(inputVarsForAllFuncs.map(calculateParamLengthFromExprValues).forall(isValidParamLength))
 {
-commonExprs.zipWithIndex.map { case (exprs, i) =>
+val localSubExprEliminationExprs =
+  mutable.HashMap.empty[Expression, SubExprEliminationState]
+
+val splitCodes = commonExprs.zipWithIndex.map { case (exprs, i) =>
   val expr = exprs.head
-  val eval = commonExprVals(i)
+  val eval = 
withSubExprEliminationExprs(localSubExprEliminationExprs.toMap) {
+Seq(expr.genCode(this))
+  }.head
+
+  val value = addMutableState(javaType(expr.dataType), "subExprValue")

Review comment:
   Use the example in the description to explain. For the two 
subexpressions:
   
   1. `simpleUDF($"id")`
   2. `functions.length(simpleUDF($"id"))`
   
   
   Previously we evaluate them independently, i.e.,
   
   ```
   String subExpr1 = simpleUDF($"id");
   Int subExpr2 = functions.length(simpleUDF($"id"));
   ```
   
   Now we remove redundant evaluation of nested subexpressions:
   ```
   String subExpr1 = simpleUDF($"id");
   Int subExpr2 = functions.length(subExpr1);
   ```
   
   If we need to split the functions, when we evaluate `functions.length`, it 
needs access of `subExpr1`. We have two choices. One is to add `subExpr1` to 
the function parameter list of the split function for `functions.length`. 
Another one is to use mutable state.
   
   To add it to parameter list will complicate the way we compute parameter 
length. That's said we need to link nested subexpression relations and get the 
correct parameters. Seems to me it is not worth doing that.
   
   Currently I choose the simpler approach that is to use mutable state.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org