Wenzhe Zhou created IMPALA-14263:
------------------------------------
Summary: Broadcast cost in planner is skewed by the number of
nodes comparing to partition cost
Key: IMPALA-14263
URL: https://issues.apache.org/jira/browse/IMPALA-14263
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Reporter: Wenzhe Zhou
broadCast Cost = dataPayload + hashTblBuildCost = 2 x (rhsDataSize *
leftChildNodes)
partition Cost = Math.round(lhsNetworkCost + rhsNetworkCost + rhsDataSize)
The number of nodes skews broadcast cost on bigger clusters, which makes
broadcast cost much bigger than partitioned join cost, e.g. planner favor
partition strategy for big cluster.
We probably need to introduce new heuristics to join strategy decision, like
including number of nodes in partitioned join cost model. We also need a way to
check for the degree of skew on the join key during the planning phase. If the
skew is on the higher side, we would want to bias the cost model towards
broadcast.
Adding join hints in the query is the recommended workaround to force broadcast
join in the cases where join keys are skewed, especially for larger clusters.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)