Wechar Yu created SPARK-56355:
---------------------------------

             Summary: Improve join stats estimation when equi-join keys lack 
column statistics
                 Key: SPARK-56355
                 URL: https://issues.apache.org/jira/browse/SPARK-56355
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.1.1
            Reporter: Wechar Yu


{{JoinEstimation}} use cartesian product for equal-join if there's no column 
stats. This often leads to significant overestimation and can cause the 
optimizer to choose suboptimal join strategies.

This issue is particularly evident under AQE, where statistics from 
{{ExchangeQueryStageExec}} only include {{sizeInBytes}} and {{{}rowCount{}}}, 
without column-level statistics. As a result, AQE re-optimization may rely on 
inflated estimates and fail to select the optimal physical plan.

 Here is an example with 4 joins:
{code:bash}
Join Inner, (shop_id#2217L = shop_id#1968L)
:- LogicalQueryStage Project [...], ShuffleQueryStage 42 Stats(249GB, 935242042)
+- Aggregate [...], [shop_id#2217L]
   +- Project [...]
      +- Project [...]
         +- Join LeftOuter, (shop_id#2217L = shop_id#43L)
            :- Join LeftOuter, (shop_id#2217L = shop_id#2953L)
            :  :- Project [...]
            :  :  +- Join LeftOuter, (shop_id#2217L = shop_id#2903L)
            :  :     :- LogicalQueryStage Join LeftOuter ShuffleQueryStage 92 
Stats(6.6MB, 398114)
            :  :     +- LogicalQueryStage Project [], ShuffleQueryStage 48 
Stats(1124.4MB, 92067282)
            :  +- LogicalQueryStage LogicalQueryStage Aggregate 
BroadcastQueryStage 94 Stats(2.1MB, 60062)
            +- LogicalQueryStage LogicalQueryStage Aggregate 
BroadcastQueryStage 96 Stats(4.2KB, 533)
{code}
The left size of top join is 249GB, but since there are no join keys' column 
stats, the right size of top join is estimated by cartesian product * 
avg_row_size, around 398114 * 92067282 * 60062 * 533 * avg_row_size = 1.17e21 * 
av_row_size >> 249GB (2.67e11)

The actual right side's output is Stats(118.9MB, 99761), which is much smaller 
than the estimated.

In Gluten it choose optimal build side of ShuffleHashJoin by the Spark's 
estimated stats, and will choose the wrong side that cause a decrease in 
performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to