[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

Wan Kun (Jira) Mon, 08 May 2023 03:23:07 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-42551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wan Kun updated SPARK-42551:
----------------------------
    Description: 
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
 * Get all common expressions from input expressions of the current physical 
operator to current CodeGenContext. Recursively visits all subexpressions 
regardless of whether the current expression is a conditional expression.
 * For each common expression:
 ** Add a new boolean variable *subExprInit* to indicate whether it has  
already been evaluated. 
 ** Add a new code block in CodeGenSupport trait, and reset those *subExprInit* 
variables to *false* before the physical operators begin to evaluate the input 
row.
 ** Add a new wrapper subExpr function for each common subexpression.

|private void subExpr_n(${argList}) {
 if (!subExprInit) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|

 
 * When generating the input expression code,  if the input expression is a 
common expression, the expression code will be replaced with the corresponding 
subExpr function. When the subExpr function is called for the first time, 
*subExprInit* will be set to true, and the subsequent function calls will do 
nothing.

h2. Why should we support whole-stage subexpression elimination

Right now each spark physical operator shares nothing but the input row, so the 
same expressions may be evaluated multiple times across different operators. 
For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter 
[udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter 
operators.  We can reuse the expression results across different operators such 
as Project and Filter.
h2. How to support whole-stage subexpression elimination
 * Add two properties in CodegenSupport trait, the reusable expressions and the 
the output attributes, we can reuse the expression results only if the output 
attributes are the same.
 * Visit all operators from top to bottom, bound the candidate expressions with 
the output attributes and add to the current candidate reusable expressions.
 * Visit all operators from bottom to top, collect all the common expressions 
to the current operator, and add the initialize code to the current operator if 
the common expressions have not been initialized.
 * Replace the common expressions code when generating codes for  the physical 
operators.

h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

{code:java}
SELECT case when v + 2 > 1 then 1
            when v + 1 > 2 then 2
            when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}
We can reuse the result of expression  *v + 1*
{code:java}
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
{code:java}
SELECT * FROM (
  SELECT v * v + 1 v1 from values(1) as t2(v)
) t
where v1 > 5 and v1 < 10
{code}
We can reuse the result of expression  *v* * *v* *+* *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
{code:java}
SELECT * 
FROM values(1, 1) as t1(a, b) 
join values(1, 2) as t2(x, y)

ON b * y between 2 and 3{code}
 
We can reuse the result of expression  *b* * *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

{code:java}
SELECT a, count(b),
        count(distinct case when b > 1 then b + c else null end) as count_bc_1,
        count(distinct case when b < 0 then b + c else null end) as count_bc_2
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c

  was:
h1. *Design Sketch*
h2. How to support more subexpressions elimination cases
 * Get all common expressions from input expressions of the current physical 
operator to current CodeGenContext. Recursively visits all subexpressions 
regardless of whether the current expression is a conditional expression.
 * For each common expression:
 * Add a new boolean variable *subExprInit* to indicate whether it has  already 
been evaluated. 
 * Add a new code block in CodeGenSupport trait, and reset those *subExprInit* 
variables to *false* before the physical operators begin to evaluate the input 
row.
 * Add a new wrapper subExpr function for each common subexpression.

|private void subExpr_n(${argList}) {
 if (!subExprInit) {
   ${eval.code}
   subExprInit_n = true;
   subExprIsNull_n = ${eval.isNull};
   subExprValue_n = ${eval.value};
 }
}|

 
 * When performing gen code of the input expression,  if the input expression 
is in the common expressions of the current CodeGenContext, the corresponding 
subExpr function will be called. After the first function call, *subExprInit* 
will be set to true, and the subsequent function calls will be skipped.

h2. Why should we support whole-stage subexpression elimination

Right now each spark physical operator shares nothing but the input row, so the 
same expressions may be evaluated multiple times across different operators. 
For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter 
[udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter 
operators.  We can reuse the expression results across different operators such 
as Project and Filter.
h2. How to support whole-stage subexpression elimination
 * Add two properties in CodegenSupport trait, the reusable expressions and the 
the output attributes, we can reuse the expression results only if the output 
attributes are the same.
 * Visit all operators from top to bottom, bound the candidate expressions with 
the output attributes and add to the current candidate reusable expressions.
 * Visit all operators from bottom to top, collect all the common expressions 
to the current operator, and add the initialize code to the current operator if 
the common expressions have not been initialized.
 * Replace the common expressions code when generating codes for  the physical 
operators.

h1. *New support subexpression elimination patterns*
 * 
h2. *Support subexpression elimination with conditional expressions*

{code:java}
SELECT case when v + 2 > 1 then 1
            when v + 1 > 2 then 2
            when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
{code}
We can reuse the result of expression  *v + 1*
{code:java}
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c
 * 
h2. *Support subexpression elimination in FilterExec*

 
{code:java}
SELECT * FROM (
  SELECT v * v + 1 v1 from values(1) as t2(v)
) t
where v1 > 5 and v1 < 10
{code}
We can reuse the result of expression  *v* * *v* *+* *1*
 * 
h2. *Support subexpression elimination in JoinExec*

 
{code:java}
SELECT * 
FROM values(1, 1) as t1(a, b) 
join values(1, 2) as t2(x, y)

ON b * y between 2 and 3{code}
 
We can reuse the result of expression  *b* * *y*
 * 
h2. *Support subexpression elimination in ExpandExec*

{code:java}
SELECT a, count(b),
        count(distinct case when b > 1 then b + c else null end) as count_bc_1,
        count(distinct case when b < 0 then b + c else null end) as count_bc_2
FROM values(1, 1, 1) as t(a, b, c)
GROUP BY a
{code}
We can reuse the result of expression  b + c


> Support more subexpression elimination cases
> --------------------------------------------
>
>                 Key: SPARK-42551
>                 URL: https://issues.apache.org/jira/browse/SPARK-42551
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.2
>            Reporter: Wan Kun
>            Priority: Major
>
> h1. *Design Sketch*
> h2. How to support more subexpressions elimination cases
>  * Get all common expressions from input expressions of the current physical 
> operator to current CodeGenContext. Recursively visits all subexpressions 
> regardless of whether the current expression is a conditional expression.
>  * For each common expression:
>  ** Add a new boolean variable *subExprInit* to indicate whether it has  
> already been evaluated. 
>  ** Add a new code block in CodeGenSupport trait, and reset those 
> *subExprInit* variables to *false* before the physical operators begin to 
> evaluate the input row.
>  ** Add a new wrapper subExpr function for each common subexpression.
> |private void subExpr_n(${argList}) {
>  if (!subExprInit) {
>    ${eval.code}
>    subExprInit_n = true;
>    subExprIsNull_n = ${eval.isNull};
>    subExprValue_n = ${eval.value};
>  }
> }|
>  
>  * When generating the input expression code,  if the input expression is a 
> common expression, the expression code will be replaced with the 
> corresponding subExpr function. When the subExpr function is called for the 
> first time, *subExprInit* will be set to true, and the subsequent function 
> calls will do nothing.
> h2. Why should we support whole-stage subexpression elimination
> Right now each spark physical operator shares nothing but the input row, so 
> the same expressions may be evaluated multiple times across different 
> operators. For example, the expression udf(c1, c2) in plan Project [udf(c1, 
> c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project 
> and Filter operators.  We can reuse the expression results across different 
> operators such as Project and Filter.
> h2. How to support whole-stage subexpression elimination
>  * Add two properties in CodegenSupport trait, the reusable expressions and 
> the the output attributes, we can reuse the expression results only if the 
> output attributes are the same.
>  * Visit all operators from top to bottom, bound the candidate expressions 
> with the output attributes and add to the current candidate reusable 
> expressions.
>  * Visit all operators from bottom to top, collect all the common expressions 
> to the current operator, and add the initialize code to the current operator 
> if the common expressions have not been initialized.
>  * Replace the common expressions code when generating codes for  the 
> physical operators.
> h1. *New support subexpression elimination patterns*
>  * 
> h2. *Support subexpression elimination with conditional expressions*
> {code:java}
> SELECT case when v + 2 > 1 then 1
>             when v + 1 > 2 then 2
>             when v + 1 > 3 then 3 END vv
> FROM values(1) as t2(v)
> {code}
> We can reuse the result of expression  *v + 1*
> {code:java}
> SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) 
> min_bc
> FROM values(1, 1, 1) as t(a, b, c)
> GROUP BY a
> {code}
> We can reuse the result of expression  b + c
>  * 
> h2. *Support subexpression elimination in FilterExec*
>  
> {code:java}
> SELECT * FROM (
>   SELECT v * v + 1 v1 from values(1) as t2(v)
> ) t
> where v1 > 5 and v1 < 10
> {code}
> We can reuse the result of expression  *v* * *v* *+* *1*
>  * 
> h2. *Support subexpression elimination in JoinExec*
>  
> {code:java}
> SELECT * 
> FROM values(1, 1) as t1(a, b) 
> join values(1, 2) as t2(x, y)
> ON b * y between 2 and 3{code}
>  
> We can reuse the result of expression  *b* * *y*
>  * 
> h2. *Support subexpression elimination in ExpandExec*
> {code:java}
> SELECT a, count(b),
>       count(distinct case when b > 1 then b + c else null end) as count_bc_1,
>       count(distinct case when b < 0 then b + c else null end) as count_bc_2
> FROM values(1, 1, 1) as t(a, b, c)
> GROUP BY a
> {code}
> We can reuse the result of expression  b + c



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42551) Support more subexpression elimination cases

Reply via email to