[ 
https://issues.apache.org/jira/browse/SPARK-37392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452656#comment-17452656
 ] 

Josh Rosen commented on SPARK-37392:
------------------------------------

When I ran this in {{spark-shell}} it triggered an OOM in 
{{{}LogicalPlan.constraints(){}}}. It In the heap dump I spotted an 
{{ExpressionSet}} with over 100,000 expressions.

Based on the constraints that I saw I think that they were introduced by the 
{{InferFiltersFromGenerate}} rule and that some sort of unexpected rule 
interaction is resulting in a huge blowup of derived constraints in 
{{{}PruneFilters{}}}. The {{InferFiltersFromGenerate}} rule was introduced in 
SPARK-32295 / Spark 3.1.0, which could explain why this issue isn't 
reproducible in Spark 2.4.1.

Looking at the constraints in the huge {{{}ExpressionSet{}}}, it looks like the 
vast majority (> 99%) of the constraints are {{GreaterThan}} or {{LessThan}} 
constraints of the form:
 * {{GreaterThan(Size(CreateArray(...)), Literal(0))}}
 * {{LessThan(Literal(0), Size(CreateArray(...)))}}

I think the {{GreaterThan}} comes from {{InferFiltersFromGenerate}} and suspect 
that the {{LessThan}} equivalents are introduced via expression 
canonicalization.

We'll need to dig a bit deeper to figure out what's leading to this buildup of 
duplicate constraints. Perhaps it's some sort of interaction between this 
particular shape of constraint, the constraint propagation system, and 
canonicalization? I'm not sure yet.

> Catalyst optimizer very time-consuming and memory-intensive with some 
> "explode(array)" 
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-37392
>                 URL: https://issues.apache.org/jira/browse/SPARK-37392
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 3.1.2, 3.2.0
>            Reporter: Francois MARTIN
>            Priority: Major
>
> The problem occurs with the simple code below:
> {code:java}
> import session.implicits._
> Seq(
>   (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", 
> "x", "x", "x", "x", "x", "x")
> ).toDF()
>   .checkpoint() // or save and reload to truncate lineage
>   .createOrReplaceTempView("sub")
> session.sql("""
>   SELECT
>     *
>   FROM
>   (
>     SELECT
>       EXPLODE( ARRAY( * ) ) result
>     FROM
>     (
>       SELECT
>         _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, 
> _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
>       FROM
>         sub
>     )
>   )
>   WHERE
>     result != ''
>   """).show() {code}
> It takes several minutes and a very high Java heap usage, when it should be 
> immediate.
> It does not occur when replacing the unique integer value (1) with a string 
> value ({_}"x"{_}).
> All the time is spent in the _PruneFilters_ optimization rule.
> Not reproduced in Spark 2.4.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to