[
https://issues.apache.org/jira/browse/HIVE-29613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-29613:
----------------------------------
Labels: pull-request-available (was: )
> Cross-product join falls back to single-reducer shuffle merge when small-side
> row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29613
> URL: https://issues.apache.org/jira/browse/HIVE-29613
> Project: Hive
> Issue Type: Bug
> Reporter: Araika Singh
> Assignee: Araika Singh
> Priority: Major
> Labels: pull-request-available
>
> *Background:*
> {color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join
> into a broadcast map join or leave it as a shuffle merge join. For joins the
> optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp)
> == true{color}), an additional row-count gate is applied: the build side's
> estimated row count must not exceed
> {color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When
> the estimate exceeds this threshold, the cross-product map-join conversion is
> rejected outright and the join falls back through
> {color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a
> shuffle merge join.
> In practice, the build side of a tiny lookup-style table can end up estimated
> at 2 or 3 rows after CBO pushes down predicates and applies NDV-based
> selectivity, even when its actual byte footprint is a few hundred bytes —
> easily within the existing broadcast byte budget
> {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}.
> NDV-driven filter estimates routinely overshoot the row threshold by a small
> margin on tiny tables. When this happens the row-only gate rejects a
> broadcast that would have been entirely safe by the same byte budget that
> map-join conversion already uses elsewhere.
> The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with
> {color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product
> offers nothing real to shuffle on, the shuffle collapses to a single reduce
> key, and the entire output of the big side is materialised on a single
> reducer task. The downstream symptom on real-world data is a job pinned
> indefinitely at ~98–99% with one reducer running while the rest finish. We
> have observed this on a join of a ~29 Million-row, ~9 GB big-side table
> against a build side filtered down to 2 estimated rows / a few hundred bytes
> {color:#4c9aff}onlineDataSize{color} — the small side fits the byte budget by
> ~7 orders of magnitude, yet the row-only gate rejects the broadcast because 2
> > 1:
> {code:java}
> ConvertJoinMapJoin#getMapJoinConversion
> if (parentStats.getNumRows() >
> HiveConf.getIntVar(context.conf,
> HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
> // if any of smaller side is estimated to generate more than
> // threshold rows we would disable mapjoin
> return null;
> } {code}
> At compile time the optimizer emits, three times in a row, the two log lines:
> {code:java}
> INFO optimizer.ConvertJoinMapJoin - Could not get a valid join position.
> Defaulting to position 0
> INFO optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
> ....
> INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer
> 2' is a cross product
> {code}
> The root cause is that the cross-product gate consults only the row count of
> the build side, not its byte footprint, and NDV-driven row estimates on tiny
> tables are exactly the case where the byte footprint is small but the row
> count overshoots the threshold by 1.
> *Proposed Fix:*
> Relax the cross-product gate — when the row estimate exceeds
> {color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte
> check against
> {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color} (the same
> budget map-join conversion already uses). If the small side's
> computeOnlineDataSize fits the budget, allow the conversion. If it doesn't,
> reject as before.
> The log line is emitted at compile time when the byte-fallback branch admits,
> so the new path is observable in HS2 logs:
> {code:java}
> INFO SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer
> 2' is a cross product{code}
> *Risk and scope:*
> - The change is confined to the cross-product branch of
> {color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}.
> Non-cross-product joins are unaffected.
> - For cross products whose build side is genuinely large in bytes, the new
> check rejects with the same outcome as today.
> - The change is gated by
> {color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag)
> — clusters that have cartesian-product edges disabled don't enter this branch
> at all.
> - No public API or configuration surface is added.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)