Araika Singh created HIVE-29613:
-----------------------------------
Summary: Cross-product join falls back to single-reducer shuffle
merge when small-side row estimate marginally exceeds
hive.xprod.mapjoin.small.table.rows
Key: HIVE-29613
URL: https://issues.apache.org/jira/browse/HIVE-29613
Project: Hive
Issue Type: Bug
Reporter: Araika Singh
Assignee: Araika Singh
*Background:*
{color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join
into a broadcast map join or leave it as a shuffle merge join. For joins the
optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp)
== true{color}), an additional row-count gate is applied: the build side's
estimated row count must not exceed
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When the
estimate exceeds this threshold, the cross-product map-join conversion is
rejected outright and the join falls back through
{color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a
shuffle merge join.
In practice, the build side of a tiny lookup-style table can end up estimated
at 2 or 3 rows after CBO pushes down predicates and applies NDV-based
selectivity, even when its actual byte footprint is a few hundred bytes —
easily within the existing broadcast byte budget
{color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}. NDV-driven
filter estimates routinely overshoot the row threshold by a small margin on
tiny tables. When this happens the row-only gate rejects a broadcast that would
have been entirely safe by the same byte budget that map-join conversion
already uses elsewhere.
The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with
{color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product offers
nothing real to shuffle on, the shuffle collapses to a single reduce key, and
the entire output of the big side is materialised on a single reducer task. The
downstream symptom on real-world data is a job pinned indefinitely at ~98–99%
with one reducer running while the rest finish. We have observed this on a join
of a ~29 Million-row, ~9 GB big-side table against a build side filtered down
to 2 estimated rows / a few hundred bytes {color:#4c9aff}onlineDataSize{color}
— the small side fits the byte budget by ~7 orders of magnitude, yet the
row-only gate rejects the broadcast because 2 > 1:
{code:java}
ConvertJoinMapJoin#getMapJoinConversion
if (parentStats.getNumRows() >
HiveConf.getIntVar(context.conf,
HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
// if any of smaller side is estimated to generate more than
// threshold rows we would disable mapjoin
return null;
} {code}
At compile time the optimizer emits, three times in a row, the two log lines:
{code:java}
INFO optimizer.ConvertJoinMapJoin - Could not get a valid join position.
Defaulting to position 0
INFO optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
....
INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2'
is a cross product
{code}
The root cause is that the cross-product gate consults only the row count of
the build side, not its byte footprint, and NDV-driven row estimates on tiny
tables are exactly the case where the byte footprint is small but the row count
overshoots the threshold by 1.
*Proposed Fix:*
Relax the cross-product gate — when the row estimate exceeds
{color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte check
against {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}
(the same budget map-join conversion already uses). If the small side's
computeOnlineDataSize fits the budget, allow the conversion. If it doesn't,
reject as before.
The log line is emitted at compile time when the byte-fallback branch admits,
so the new path is observable in HS2 logs:
{code:java}
INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 2'
is a cross product{code}
*Risk and scope:*
- The change is confined to the cross-product branch of
{color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}.
Non-cross-product joins are unaffected.
- For cross products whose build side is genuinely large in bytes, the new
check rejects with the same outcome as today.
- The change is gated by
{color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag) —
clusters that have cartesian-product edges disabled don't enter this branch at
all.
- No public API or configuration surface is added.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)