[ https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wan Kun updated SPARK-42551: ---------------------------- Description: h1. *Design Sketch* h2. How to support more subexpressions elimination cases * Get all common expressions from input expressions of the current physical operator to current CodeGenContext. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit* to indicate whether it has already been evaluated. * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* variables to *false* before the physical operators begin to evaluate the input row. * Add a new wrapper subExpr function for each common subexpression. |private void subExpr_n(${argList}) { if (!subExprInit) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } }| h1. ** * When performing gen code of the input expression, if the input expression is in the common expressions of the current CodeGenContext, the corresponding subExpr function will be called. After the first function call, *subExprInit* will be set to true, and the subsequent function calls will be skipped. h2. Why should we support whole-stage subexpression elimination Right now each spark physical operator shares nothing but the input row, so the same expressions may be evaluated multiple times across different operators. For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter operators. We can reuse the expression results across different operators such as Project and Filter. h2. How to support whole-stage subexpression elimination * Add two properties in CodegenSupport trait, the reusable expressions and the the output attributes, we can reuse the expression results only if the output attributes are the same. * Visit all operators from top to bottom, bound the candidate expressions with the output attributes and add to the current candidate reusable expressions. * Visit all operators from bottom to top, collect all the common expressions to the current operator, and add the initialize code to the current operator if the common expressions have not been initialized. * Replace the common expressions code when generating codes for the physical operators. h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* {code:java} SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v) {code} We can reuse the result of expression *v + 1* {code:java} SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* {code:java} SELECT * FROM ( SELECT v * v + 1 v1 from values(1) as t2(v) ) t where v1 > 5 and v1 < 10 {code} We can reuse the result of expression *v* * *v* *+* *1* * h2. *Support subexpression elimination in JoinExec* {code:java} SELECT * FROM values(1, 1) as t1(a, b) join values(1, 2) as t2(x, y) ON b * y between 2 and 3{code} We can reuse the result of expression *b* * *y* * h2. *Support subexpression elimination in ExpandExec* {code:java} SELECT a, count(b), count(distinct case when b > 1 then b + c else null end) as count_bc_1, count(distinct case when b < 0 then b + c else null end) as count_bc_2 FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c was: h1. *Design Sketch* * Get all common expressions from input expressions. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression. * For each common expression: * Add a new boolean variable *subExprInit_n* to indicate whether we have already evaluated the common expression, and reset it to *false* at the start of operator.consume() * Add a new wrapper subExpr function for common subexpression. {code:java} private void subExpr_n(${argList.mkString(", ")}) { if (!subExprInit_n) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } } {code} * Replace all the common subexpression with the wrapper function *subExpr_n(argList)*. h1. *New support subexpression elimination patterns* * h2. *Support subexpression elimination with conditional expressions* {code:java} SELECT case when v + 2 > 1 then 1 when v + 1 > 2 then 2 when v + 1 > 3 then 3 END vv FROM values(1) as t2(v) {code} We can reuse the result of expression *v + 1* {code:java} SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c * h2. *Support subexpression elimination in FilterExec* {code:java} SELECT * FROM ( SELECT v * v + 1 v1 from values(1) as t2(v) ) t where v1 > 5 and v1 < 10 {code} We can reuse the result of expression *v* * *v* *+* *1* * h2. *Support subexpression elimination in JoinExec* {code:java} SELECT * FROM values(1, 1) as t1(a, b) join values(1, 2) as t2(x, y) ON b * y between 2 and 3{code} We can reuse the result of expression *b* * *y* * h2. *Support subexpression elimination in ExpandExec* {code:java} SELECT a, count(b), count(distinct case when b > 1 then b + c else null end) as count_bc_1, count(distinct case when b < 0 then b + c else null end) as count_bc_2 FROM values(1, 1, 1) as t(a, b, c) GROUP BY a {code} We can reuse the result of expression b + c > Support more subexpression elimination cases > -------------------------------------------- > > Key: SPARK-42551 > URL: https://issues.apache.org/jira/browse/SPARK-42551 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.3.2 > Reporter: Wan Kun > Priority: Major > > h1. *Design Sketch* > h2. How to support more subexpressions elimination cases > * Get all common expressions from input expressions of the current physical > operator to current CodeGenContext. Recursively visits all subexpressions > regardless of whether the current expression is a conditional expression. > * For each common expression: > * Add a new boolean variable *subExprInit* to indicate whether it has > already been evaluated. > * Add a new code block in CodeGenSupport trait, and reset those > *subExprInit* variables to *false* before the physical operators begin to > evaluate the input row. > * Add a new wrapper subExpr function for each common subexpression. > |private void subExpr_n(${argList}) { > if (!subExprInit) { > ${eval.code} > subExprInit_n = true; > subExprIsNull_n = ${eval.isNull}; > subExprValue_n = ${eval.value}; > } > }| > h1. > ** > * When performing gen code of the input expression, if the input expression > is in the common expressions of the current CodeGenContext, the corresponding > subExpr function will be called. After the first function call, *subExprInit* > will be set to true, and the subsequent function calls will be skipped. > h2. Why should we support whole-stage subexpression elimination > Right now each spark physical operator shares nothing but the input row, so > the same expressions may be evaluated multiple times across different > operators. For example, the expression udf(c1, c2) in plan Project [udf(c1, > c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project > and Filter operators. We can reuse the expression results across different > operators such as Project and Filter. > h2. How to support whole-stage subexpression elimination > * Add two properties in CodegenSupport trait, the reusable expressions and > the the output attributes, we can reuse the expression results only if the > output attributes are the same. > * Visit all operators from top to bottom, bound the candidate expressions > with the output attributes and add to the current candidate reusable > expressions. > * Visit all operators from bottom to top, collect all the common expressions > to the current operator, and add the initialize code to the current operator > if the common expressions have not been initialized. > * Replace the common expressions code when generating codes for the > physical operators. > h1. *New support subexpression elimination patterns* > * > h2. *Support subexpression elimination with conditional expressions* > {code:java} > SELECT case when v + 2 > 1 then 1 > when v + 1 > 2 then 2 > when v + 1 > 3 then 3 END vv > FROM values(1) as t2(v) > {code} > We can reuse the result of expression *v + 1* > {code:java} > SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) > min_bc > FROM values(1, 1, 1) as t(a, b, c) > GROUP BY a > {code} > We can reuse the result of expression b + c > * > h2. *Support subexpression elimination in FilterExec* > > {code:java} > SELECT * FROM ( > SELECT v * v + 1 v1 from values(1) as t2(v) > ) t > where v1 > 5 and v1 < 10 > {code} > We can reuse the result of expression *v* * *v* *+* *1* > * > h2. *Support subexpression elimination in JoinExec* > > {code:java} > SELECT * > FROM values(1, 1) as t1(a, b) > join values(1, 2) as t2(x, y) > ON b * y between 2 and 3{code} > > We can reuse the result of expression *b* * *y* > * > h2. *Support subexpression elimination in ExpandExec* > {code:java} > SELECT a, count(b), > count(distinct case when b > 1 then b + c else null end) as count_bc_1, > count(distinct case when b < 0 then b + c else null end) as count_bc_2 > FROM values(1, 1, 1) as t(a, b, c) > GROUP BY a > {code} > We can reuse the result of expression b + c -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org