[ 
https://issues.apache.org/jira/browse/HIVE-29613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-29613:
----------------------------------
    Labels: pull-request-available  (was: )

> Cross-product join falls back to single-reducer shuffle merge when small-side 
> row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29613
>                 URL: https://issues.apache.org/jira/browse/HIVE-29613
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Araika Singh
>            Assignee: Araika Singh
>            Priority: Major
>              Labels: pull-request-available
>
> *Background:*
> {color:#4c9aff}ConvertJoinMapJoin{color} decides whether to convert each join 
> into a broadcast map join or leave it as a shuffle merge join. For joins the 
> optimizer classifies as cross products ({color:#4c9aff}isCrossProduct(joinOp) 
> == true{color}), an additional row-count gate is applied: the build side's 
> estimated row count must not exceed 
> {color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color} (default 1). When 
> the estimate exceeds this threshold, the cross-product map-join conversion is 
> rejected outright and the join falls back through 
> {color:#4c9aff}fallbackToReduceSideJoin / fallbackToMergeJoin{color} into a 
> shuffle merge join.
> In practice, the build side of a tiny lookup-style table can end up estimated 
> at 2 or 3 rows after CBO pushes down predicates and applies NDV-based 
> selectivity, even when its actual byte footprint is a few hundred bytes — 
> easily within the existing broadcast byte budget 
> {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color}. 
> NDV-driven filter estimates routinely overshoot the row threshold by a small 
> margin on tiny tables. When this happens the row-only gate rejects a 
> broadcast that would have been entirely safe by the same byte budget that 
> map-join conversion already uses elsewhere.
> The resulting plan is a {color:#4c9aff}MERGEJOIN{color} with 
> {color:#4c9aff}XPROD_EDGE{color} inputs. Because a keyless cross-product 
> offers nothing real to shuffle on, the shuffle collapses to a single reduce 
> key, and the entire output of the big side is materialised on a single 
> reducer task. The downstream symptom on real-world data is a job pinned 
> indefinitely at ~98–99% with one reducer running while the rest finish. We 
> have observed this on a join of a ~29 Million-row, ~9 GB big-side table 
> against a build side filtered down to 2 estimated rows / a few hundred bytes 
> {color:#4c9aff}onlineDataSize{color} — the small side fits the byte budget by 
> ~7 orders of magnitude, yet the row-only gate rejects the broadcast because 2 
> > 1:
> {code:java}
> ConvertJoinMapJoin#getMapJoinConversion
> if (parentStats.getNumRows() >
>        HiveConf.getIntVar(context.conf, 
> HiveConf.ConfVars.XPRODSMALLTABLEROWSTHRESHOLD)) {
>      // if any of smaller side is estimated to generate more than
>      // threshold rows we would disable mapjoin
>      return null;
>    } {code}
> At compile time the optimizer emits, three times in a row, the two log lines:
> {code:java}
> INFO  optimizer.ConvertJoinMapJoin - Could not get a valid join position. 
> Defaulting to position 0
> INFO  optimizer.ConvertJoinMapJoin - Fallback to common merge join operator
> ....
> INFO  SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 
> 2' is a cross product
> {code}
> The root cause is that the cross-product gate consults only the row count of 
> the build side, not its byte footprint, and NDV-driven row estimates on tiny 
> tables are exactly the case where the byte footprint is small but the row 
> count overshoots the threshold by 1.
> *Proposed Fix:*
> Relax the cross-product gate — when the row estimate exceeds 
> {color:#4c9aff}hive.xprod.mapjoin.small.table.rows{color}, consult a byte 
> check against 
> {color:#4c9aff}hive.auto.convert.join.noconditionaltask.size{color} (the same 
> budget map-join conversion already uses). If the small side's 
> computeOnlineDataSize fits the budget, allow the conversion. If it doesn't, 
> reject as before.
> The log line is emitted at compile time when the byte-fallback branch admits, 
> so the new path is observable in HS2 logs:
> {code:java}
> INFO  SessionState - Warning: Shuffle Join MERGEJOIN[141] in Stage 'Reducer 
> 2' is a cross product{code}
> *Risk and scope:*
>  - The change is confined to the cross-product branch of 
> {color:#4c9aff}ConvertJoinMapJoin.getMapJoinConversion{color}. 
> Non-cross-product joins are unaffected.
>  - For cross products whose build side is genuinely large in bytes, the new 
> check rejects with the same outcome as today.
>  - The change is gated by 
> {color:#4c9aff}hive.tez.cartesian-product.enabled{color} (the existing flag) 
> — clusters that have cartesian-product edges disabled don't enter this branch 
> at all.
>  - No public API or configuration surface is added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to