[
https://issues.apache.org/jira/browse/SPARK-56355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56355:
-----------------------------------
Labels: pull-request-available (was: )
> Improve join stats estimation when equi-join keys lack column statistics
> ------------------------------------------------------------------------
>
> Key: SPARK-56355
> URL: https://issues.apache.org/jira/browse/SPARK-56355
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Wechar Yu
> Priority: Major
> Labels: pull-request-available
>
> {{JoinEstimation}} use cartesian product for equal-join if there's no column
> stats. This often leads to significant overestimation and can cause the
> optimizer to choose suboptimal join strategies.
> This issue is particularly evident under AQE, where statistics from
> {{ExchangeQueryStageExec}} only include {{sizeInBytes}} and {{{}rowCount{}}},
> without column-level statistics. As a result, AQE re-optimization may rely on
> inflated estimates and fail to select the optimal physical plan.
> Here is an example with 4 joins:
> {code:bash}
> Join Inner, (shop_id#2217L = shop_id#1968L)
> :- LogicalQueryStage Project [...], ShuffleQueryStage 42 Stats(249GB,
> 935242042)
> +- Aggregate [...], [shop_id#2217L]
> +- Project [...]
> +- Project [...]
> +- Join LeftOuter, (shop_id#2217L = shop_id#43L)
> :- Join LeftOuter, (shop_id#2217L = shop_id#2953L)
> : :- Project [...]
> : : +- Join LeftOuter, (shop_id#2217L = shop_id#2903L)
> : : :- LogicalQueryStage Join LeftOuter ShuffleQueryStage 92
> Stats(6.6MB, 398114)
> : : +- LogicalQueryStage Project [], ShuffleQueryStage 48
> Stats(1124.4MB, 92067282)
> : +- LogicalQueryStage LogicalQueryStage Aggregate
> BroadcastQueryStage 94 Stats(2.1MB, 60062)
> +- LogicalQueryStage LogicalQueryStage Aggregate
> BroadcastQueryStage 96 Stats(4.2KB, 533)
> {code}
> The left size of top join is 249GB, but since there are no join keys' column
> stats, the right size of top join is estimated by cartesian product *
> avg_row_size, around 398114 * 92067282 * 60062 * 533 * avg_row_size = 1.17e21
> * av_row_size >> 249GB (2.67e11)
> The actual right side's output is Stats(118.9MB, 99761), which is much
> smaller than the estimated.
> In Gluten it choose optimal build side of ShuffleHashJoin by the Spark's
> estimated stats, and will choose the wrong side that cause a decrease in
> performance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]