Wechar Yu created SPARK-56355:
---------------------------------
Summary: Improve join stats estimation when equi-join keys lack
column statistics
Key: SPARK-56355
URL: https://issues.apache.org/jira/browse/SPARK-56355
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.1.1
Reporter: Wechar Yu
{{JoinEstimation}} use cartesian product for equal-join if there's no column
stats. This often leads to significant overestimation and can cause the
optimizer to choose suboptimal join strategies.
This issue is particularly evident under AQE, where statistics from
{{ExchangeQueryStageExec}} only include {{sizeInBytes}} and {{{}rowCount{}}},
without column-level statistics. As a result, AQE re-optimization may rely on
inflated estimates and fail to select the optimal physical plan.
Here is an example with 4 joins:
{code:bash}
Join Inner, (shop_id#2217L = shop_id#1968L)
:- LogicalQueryStage Project [...], ShuffleQueryStage 42 Stats(249GB, 935242042)
+- Aggregate [...], [shop_id#2217L]
+- Project [...]
+- Project [...]
+- Join LeftOuter, (shop_id#2217L = shop_id#43L)
:- Join LeftOuter, (shop_id#2217L = shop_id#2953L)
: :- Project [...]
: : +- Join LeftOuter, (shop_id#2217L = shop_id#2903L)
: : :- LogicalQueryStage Join LeftOuter ShuffleQueryStage 92
Stats(6.6MB, 398114)
: : +- LogicalQueryStage Project [], ShuffleQueryStage 48
Stats(1124.4MB, 92067282)
: +- LogicalQueryStage LogicalQueryStage Aggregate
BroadcastQueryStage 94 Stats(2.1MB, 60062)
+- LogicalQueryStage LogicalQueryStage Aggregate
BroadcastQueryStage 96 Stats(4.2KB, 533)
{code}
The left size of top join is 249GB, but since there are no join keys' column
stats, the right size of top join is estimated by cartesian product *
avg_row_size, around 398114 * 92067282 * 60062 * 533 * avg_row_size = 1.17e21 *
av_row_size >> 249GB (2.67e11)
The actual right side's output is Stats(118.9MB, 99761), which is much smaller
than the estimated.
In Gluten it choose optimal build side of ShuffleHashJoin by the Spark's
estimated stats, and will choose the wrong side that cause a decrease in
performance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]