Wenzhe Zhou created IMPALA-14263:
------------------------------------

             Summary: Broadcast cost in planner is skewed by the number of 
nodes comparing to partition cost
                 Key: IMPALA-14263
                 URL: https://issues.apache.org/jira/browse/IMPALA-14263
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
            Reporter: Wenzhe Zhou


broadCast Cost = dataPayload + hashTblBuildCost = 2 x (rhsDataSize * 
leftChildNodes)
partition Cost = Math.round(lhsNetworkCost + rhsNetworkCost + rhsDataSize)

The number of nodes skews broadcast cost on bigger clusters, which makes 
broadcast cost much bigger than partitioned join cost, e.g. planner favor 
partition strategy for big cluster.

We probably need to introduce new heuristics to join strategy decision, like 
including number of nodes in partitioned join cost model. We also need a way to 
check for the degree of skew on the join key during the planning phase. If the 
skew is on the higher side, we would want to bias the cost model towards 
broadcast.

Adding join hints in the query is the recommended workaround to force broadcast 
join in the cases where join keys are skewed, especially for larger clusters. 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to