[jira] [Commented] (SPARK-50992) OOMs and performance issues with AQE in large plans

Raju Ansari (Jira) Wed, 10 Sep 2025 09:05:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-50992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019385#comment-18019385
 ]


Raju Ansari commented on SPARK-50992:
-------------------------------------

I had a similar issue while executing a count query when queryExectuion tries 
the plan explainString at `queryExecution.explainString(planDescriptionMode)`
 
{code:java}
java.lang.OutOfMemoryError at 
java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) at 
java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155) at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125)
 at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at 
java.lang.StringBuilder.append(StringBuilder.java:141) at 
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:203) at 
scala.collection.immutable.Stream.addString(Stream.scala:691) at 
scala.collection.TraversableOnce.mkString(TraversableOnce.scala:377) at 
scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:376) at 
scala.collection.immutable.Stream.mkString(Stream.scala:760) at 
org.apache.spark.sql.catalyst.util.package$.truncatedString(package.scala:179) 
at 
org.apache.spark.sql.catalyst.expressions.Expression.toString(Expression.scala:307)
 at java.lang.String.valueOf(String.java:2994) at 
java.lang.StringBuilder.append(StringBuilder.java:136)   at 
org.apache.spark.sql.catalyst.expressions.If.toString(conditionalExpressions.scala:105)
   at java.lang.String.valueOf(String.java:2994)   at 
java.lang.StringBuilder.append(StringBuilder.java:136)   at 
org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:213)
 at org.apache.spark.sql.catalyst.trees.TreeNode.formatArg(TreeNode.scala:918) 
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$formatArg$1(TreeNode.scala:911)
 at scala.collection.immutable.List.map(List.scala:297) at 
org.apache.spark.sql.catalyst.trees.TreeNode.formatArg(TreeNode.scala:911) at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$argString$1(TreeNode.scala:931)
 at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at 
scala.collection.Iterator.foreach(Iterator.scala:943) at 
scala.collection.Iterator.foreach$(Iterator.scala:943) at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at 
scala.collection.TraversableOnce.addString(TraversableOnce.scala:424) at 
scala.collection.TraversableOnce.addString$(TraversableOnce.scala:407) at 
scala.collection.AbstractIterator.addString(Iterator.scala:1431) at 
scala.collection.TraversableOnce.mkString(TraversableOnce.scala:377) at 
scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:376) at 
scala.collection.AbstractIterator.mkString(Iterator.scala:1431) at 
scala.collection.TraversableOnce.mkString(TraversableOnce.scala:379) at 
scala.collection.TraversableOnce.mkString$(TraversableOnce.scala:379) at 
scala.collection.AbstractIterator.mkString(Iterator.scala:1431) at 
org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:949){code}
 

> OOMs and performance issues with AQE in large plans
> ---------------------------------------------------
>
>                 Key: SPARK-50992
>                 URL: https://issues.apache.org/jira/browse/SPARK-50992
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.3, 3.5.4, 4.0.0
>            Reporter: Ángel Álvarez Pascua
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Main.scala
>
>
> When AQE is enabled, Spark triggers update events to the internal listener 
> bus whenever a plan changes. These events include a plain-text description of 
> the plan, which is computationally expensive to generate for large plans.
> {*}Key Issues:{*}{*}{{*}}
> *1. High Cost of Plan String Calculation:*
>  * Generating the string description for large physical plans is a costly 
> operation.
>  * This impacts performance, particularly in complex workflows with frequent 
> plan updates (e.g. persisting DataFrames).
> *2. Out-of-Memory (OOM) Errors:*
>  * Events are stored in the listener bus as {{SQLExecutionUIData}} objects 
> and retained until a threshold is reached.
>  * This retention behavior can lead to memory exhaustion when processing 
> large plans, causing OOM errors.
>  
> *Current Workarounds Are Ineffective:*
>  * *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}): 
> Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan 
> string calculations.
>  * *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}): 
> Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate 
> OOMs but does not eliminate the overhead of string generation.
>  * *Available Explain Modes:* All existing explain modes are verbose and 
> computationally expensive, failing to resolve these issues.
>  
> *Proposed Solution:*
> Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation 
> of plan string descriptions.
>  * When this mode is enabled, Spark skips the calculation of plan 
> descriptions altogether.
>  * This resolves OOM errors and restores performance parity with non-AQE 
> execution.
>  
> *Impact of Proposed Solution:*
>  * Eliminates OOMs in large plans with AQE enabled.
>  * Reduces the performance overhead associated with plan string generation.
>  * Ensures Spark scales better in environments with large, complex plans.
>  
> *Reproducibility:*
> A test reproducing the issue has been attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-50992) OOMs and performance issues with AQE in large plans

Reply via email to