[ 
https://issues.apache.org/jira/browse/SPARK-56355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56355:
-----------------------------------
    Labels: pull-request-available  (was: )

> Improve join stats estimation when equi-join keys lack column statistics
> ------------------------------------------------------------------------
>
>                 Key: SPARK-56355
>                 URL: https://issues.apache.org/jira/browse/SPARK-56355
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Wechar Yu
>            Priority: Major
>              Labels: pull-request-available
>
> {{JoinEstimation}} use cartesian product for equal-join if there's no column 
> stats. This often leads to significant overestimation and can cause the 
> optimizer to choose suboptimal join strategies.
> This issue is particularly evident under AQE, where statistics from 
> {{ExchangeQueryStageExec}} only include {{sizeInBytes}} and {{{}rowCount{}}}, 
> without column-level statistics. As a result, AQE re-optimization may rely on 
> inflated estimates and fail to select the optimal physical plan.
>  Here is an example with 4 joins:
> {code:bash}
> Join Inner, (shop_id#2217L = shop_id#1968L)
> :- LogicalQueryStage Project [...], ShuffleQueryStage 42 Stats(249GB, 
> 935242042)
> +- Aggregate [...], [shop_id#2217L]
>    +- Project [...]
>       +- Project [...]
>          +- Join LeftOuter, (shop_id#2217L = shop_id#43L)
>             :- Join LeftOuter, (shop_id#2217L = shop_id#2953L)
>             :  :- Project [...]
>             :  :  +- Join LeftOuter, (shop_id#2217L = shop_id#2903L)
>             :  :     :- LogicalQueryStage Join LeftOuter ShuffleQueryStage 92 
> Stats(6.6MB, 398114)
>             :  :     +- LogicalQueryStage Project [], ShuffleQueryStage 48 
> Stats(1124.4MB, 92067282)
>             :  +- LogicalQueryStage LogicalQueryStage Aggregate 
> BroadcastQueryStage 94 Stats(2.1MB, 60062)
>             +- LogicalQueryStage LogicalQueryStage Aggregate 
> BroadcastQueryStage 96 Stats(4.2KB, 533)
> {code}
> The left size of top join is 249GB, but since there are no join keys' column 
> stats, the right size of top join is estimated by cartesian product * 
> avg_row_size, around 398114 * 92067282 * 60062 * 533 * avg_row_size = 1.17e21 
> * av_row_size >> 249GB (2.67e11)
> The actual right side's output is Stats(118.9MB, 99761), which is much 
> smaller than the estimated.
> In Gluten it choose optimal build side of ShuffleHashJoin by the Spark's 
> estimated stats, and will choose the wrong side that cause a decrease in 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to