Ziqi Liu created SPARK-46995: -------------------------------- Summary: Allow AQE coalesce final stage in SQL cached plan Key: SPARK-46995 URL: https://issues.apache.org/jira/browse/SPARK-46995 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Ziqi Liu
[https://github.com/apache/spark/pull/43435] and [https://github.com/apache/spark/pull/43760] are fixing a correctness issue which will be triggered when AQE applied on cached query plan, specifically, when AQE coalescing the final result stage of the cached plan. The current semantic of {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}} ([source code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L403-L411]): * when true, we enable AQE, but disable coalescing final stage ({*}default{*}) * when false, we disable AQE But let’s revisit the semantic of this config: actually for caller the only thing that matters is whether we change the output partitioning of the cached plan. And we should only try to apply AQE if possible. Thus we want to modify the semantic of {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}} * when true, we enable AQE and allow coalescing final: this might lead to perf regression, because it introduce extra shuffle * when false, we enable AQE, but disable coalescing final stage. *(this is actually the `true` semantic of old behavior)* Also, to keep the default behavior unchanged, we might want to flip the default value of {{spark.sql.optimizer.canChangeCachedPlanOutputPartitioning}} to `false` -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org