viirya commented on a change in pull request #32699:
URL: https://github.com/apache/spark/pull/32699#discussion_r642314197



##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
##########
@@ -1068,11 +1072,19 @@ class CodegenContext extends Logging {
     }.unzip
 
     val splitThreshold = SQLConf.get.methodSplitThreshold
-    val codes = if (commonExprVals.map(_.code.length).sum > splitThreshold) {
+
+    val (codes, subExprsMap, exprCodes) = if 
(nonSplitExprCode.map(_.length).sum > splitThreshold) {
       if 
(inputVarsForAllFuncs.map(calculateParamLengthFromExprValues).forall(isValidParamLength))
 {
-        commonExprs.zipWithIndex.map { case (exprs, i) =>
+        val localSubExprEliminationExprs =
+          mutable.HashMap.empty[Expression, SubExprEliminationState]
+
+        val splitCodes = commonExprs.zipWithIndex.map { case (exprs, i) =>
           val expr = exprs.head
-          val eval = commonExprVals(i)
+          val eval = 
withSubExprEliminationExprs(localSubExprEliminationExprs.toMap) {
+            Seq(expr.genCode(this))
+          }.head
+
+          val value = addMutableState(javaType(expr.dataType), "subExprValue")

Review comment:
       Use the example in the description to explain. For the two 
subexpressions:
   
   1. `simpleUDF($"id")`
   2. `functions.length(simpleUDF($"id"))`
   
   
   Previously we evaluate them independently, i.e.,
   
   ```
   String subExpr1 = simpleUDF($"id");
   Int subExpr2 = functions.length(simpleUDF($"id"));
   ```
   
   Now we remove redundant evaluation of nested subexpressions:
   ```
   String subExpr1 = simpleUDF($"id");
   Int subExpr2 = functions.length(subExpr1);
   ```
   
   If we need to split the functions, when we evaluate `functions.length`, it 
needs access of `subExpr1`. We have two choices. One is to add `subExpr1` to 
the function parameter list of the split function for `functions.length`. 
Another one is to use mutable state.
   
   To add it to parameter list will complicate the way we compute parameter 
length. That's said we need to link nested subexpression relations and get the 
correct parameters. Seems to me it is not worth doing that.
   
   Currently I choose the simpler approach that is to use mutable state.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to