[ https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-24983: ------------------------------------ Assignee: Apache Spark > Collapsing multiple project statements with dependent When-Otherwise > statements on the same column can OOM the driver > --------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-24983 > URL: https://issues.apache.org/jira/browse/SPARK-24983 > Project: Spark > Issue Type: Bug > Components: Optimizer > Affects Versions: 2.3.1 > Reporter: David Vogelbacher > Assignee: Apache Spark > Priority: Major > > I noticed that writing a spark job that includes many sequential > {{when-otherwise}} statements on the same column can easily OOM the driver > while generating the optimized plan because the project node will grow > exponentially in size. > Example: > {noformat} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> val df = Seq("a", "b", "c", "1").toDF("text") > df: org.apache.spark.sql.DataFrame = [text: string] > scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) > dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: > string] > scala> for( a <- 1 to 5) { > | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === > lit(a.toString), lit("r" + a.toString)).otherwise($"text")) > | } > scala> dfCaseWhen.queryExecution.analyzed > res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] > +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] > +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] > +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] > +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] > +- Filter NOT (text#3 = 0) > +- Project [value#1 AS text#3] > +- LocalRelation [value#1] > scala> dfCaseWhen.queryExecution.optimizedPlan > res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE > WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va... > {noformat} > As one can see the optimized plan grows exponentially in the number of > {{when-otherwise}} statements here. > I can see that this comes from the {{CollapseProject}} optimizer rule. > Maybe we should put a limit on the resulting size of the project node after > collapsing and only collapse if we stay under the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org