[ 
https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24983:
------------------------------------

    Assignee: Apache Spark

> Collapsing multiple project statements with dependent When-Otherwise 
> statements on the same column can OOM the driver
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24983
>                 URL: https://issues.apache.org/jira/browse/SPARK-24983
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 2.3.1
>            Reporter: David Vogelbacher
>            Assignee: Apache Spark
>            Priority: Major
>
> I noticed that writing a spark job that includes many sequential 
> {{when-otherwise}} statements on the same column can easily OOM the driver 
> while generating the optimized plan because the project node will grow 
> exponentially in size.
> Example:
> {noformat}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> val df = Seq("a", "b", "c", "1").toDF("text")
> df: org.apache.spark.sql.DataFrame = [text: string]
> scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
> dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
> string]
> scala> for( a <- 1 to 5) {
>      |     dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
> lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
>      | }
> scala> dfCaseWhen.queryExecution.analyzed
> res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
> +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
>    +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
>       +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
>          +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
>             +- Filter NOT (text#3 = 0)
>                +- Project [value#1 AS text#3]
>                   +- LocalRelation [value#1]
> scala> dfCaseWhen.queryExecution.optimizedPlan
> res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE 
> WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va...
> {noformat}
> As one can see the optimized plan grows exponentially in the number of 
> {{when-otherwise}} statements here.
> I can see that this comes from the {{CollapseProject}} optimizer rule.
> Maybe we should put a limit on the resulting size of the project node after 
> collapsing and only collapse if we stay under the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to