[spark] branch branch-3.1 updated: [SPARK-37290][SQL] - Exponential planning time in case of non-deterministic function

wenchen Mon, 21 Feb 2022 21:54:08 -0800

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.1 by this push:
     new adc9c82  [SPARK-37290][SQL] - Exponential planning time in case of 
non-deterministic function
adc9c82 is described below

commit adc9c82de84176be084c941d0c29aefdf8f600e4
Author: Franck Thang <stel...@outlook.com>
AuthorDate: Tue Feb 22 13:49:32 2022 +0800

    [SPARK-37290][SQL] - Exponential planning time in case of non-deterministic 
function
    
    ### What changes were proposed in this pull request?
    
    When using non-deterministic function, the method getAllValidConstraints 
can throw an OOM
    
    ```
      protected def getAllValidConstraints(projectList: Seq[NamedExpression]): 
ExpressionSet = {
        var allConstraints = child.constraints
        projectList.foreach {
          case a  Alias(l: Literal, _) =>
            allConstraints += EqualNullSafe(a.toAttribute, l)
          case a  Alias(e, _) =>
            // For every alias in `projectList`, replace the reference in 
constraints by its attribute.
            allConstraints ++= allConstraints.map(_ transform {
              case expr: Expression if expr.semanticEquals(e) =>
                a.toAttribute
            })
            allConstraints += EqualNullSafe(e, a.toAttribute)
          case _ => // Don't change.
        }
    
        allConstraints
      }
    ```
    In particular, this line `allConstraints ++= allConstraints.map(...)` can 
generate an exponential number of expressions
    This is because non deterministic functions are considered unique in a 
ExpressionSet
    Therefore, the number of non-deterministic expressions double every time we 
go through this line
    
    We can filter and keep only deterministic expression because
    1 - the `semanticEquals` automatically discard non deterministic expressions
    2 - this method is only used in one code path, and we keep only determinic 
expressions
    ```
    lazy val constraints: ExpressionSet = {
        if (conf.constraintPropagationEnabled) {
          validConstraints
            .union(inferAdditionalConstraints(validConstraints))
            .union(constructIsNotNullConstraints(validConstraints, output))
            .filter { c =>
              c.references.nonEmpty && c.references.subsetOf(outputSet) && 
c.deterministic
            }
        } else {
          ExpressionSet()
        }
      }
    ```
    
    ### Why are the changes needed?
    It can lead to an exponential number of expressions and / or OOM
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    
    Local test
    
    Closes #35233 from Stelyus/SPARK-37290.
    
    Authored-by: Franck Thang <stel...@outlook.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit 881f562f7b6a2bed76b01f956bc02c4b87ad6b80)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
---
 .../scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index 3ea79b3..f66ef43 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -186,7 +186,7 @@ abstract class UnaryNode extends LogicalPlan {
     projectList.foreach {
       case a @ Alias(l: Literal, _) =>
         allConstraints += EqualNullSafe(a.toAttribute, l)
-      case a @ Alias(e, _) =>
+      case a @ Alias(e, _) if e.deterministic =>
         // For every alias in `projectList`, replace the reference in 
constraints by its attribute.
         allConstraints ++= allConstraints.map(_ transform {
           case expr: Expression if expr.semanticEquals(e) =>

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-37290][SQL] - Exponential planning time in case of non-deterministic function

Reply via email to