Re: [PR] HIVE-29613: [Optimizer] Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows [hive]

via GitHub Tue, 16 Jun 2026 14:08:42 -0700


konstantinb commented on PR #6484:
URL: https://github.com/apache/hive/pull/6484#issuecomment-4723665150


   @armitage420 this optimization seems to put too much trust in the accuracy 
of statistics _estimates_. Those aren't always accurate — e.g. for 
variable-length / UDF output like `repeat()`, the width is capped at 
`hive.stats.max.variable.length` (100B/row), so `computeOnlineDataSize` can 
fall far below the actual build size and the gate then approves a broadcast 
well over `hive.auto.convert.join.noconditionaltask.size`.
   
   The following test file succeeds on current master but OOMs on this branch 
under `TestMiniTezCliDriver`:
   
   ```sql
   --! qt:dataset:src
   set hive.auto.convert.join=true;
   set hive.vectorized.execution.enabled=false;
   set hive.tez.container.size=512;
   set tez.cartesian-product.max-parallelism=32;
   set tez.cartesian-product.min-ops-per-worker=10000;
   
   create table build_small_side stored as orc as select s1.value as v from src 
s1 cross join src s2 limit 2000;
   create table probe_big   stored as orc as select s1.value as pk from src s1 
cross join src s2 limit 2001;
   
   select length(max(b.w)) as ml
   from probe_big p
   cross join (select repeat(z.v, 31500) as w from build_small_side z) b;
   ```
   
   On master the cross product runs as a distributed shuffle (`XPROD_EDGE`) and 
completes. On this branch the byte-fallback converts it to a broadcast 
map-join, and the build hashtable OOMs during load — ~429MB of build 
(`SHUFFLE_BYTES`) broadcast into a ~416MB task heap (`COMMITTED_HEAP_BYTES`):
   
   ```
   java.lang.RuntimeException: Map operator initialization failed
   ...
   Caused by: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.hive.serde2.WriteBuffers.nextBufferToWrite(WriteBuffers.java:261)
        at 
org.apache.hadoop.hive.serde2.WriteBuffers.write(WriteBuffers.java:237)
        at 
org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$LazyBinaryKvWriter.writeValue(MapJoinBytesTableContainer.java:333)
        at 
org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap.writeValueAndLength(BytesBytesMultiHashMap.java:923)
        at 
org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap.put(BytesBytesMultiHashMap.java:448)
        at 
org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.putRow(MapJoinBytesTableContainer.java:460)
        at 
org.apache.hadoop.hive.ql.exec.tez.HashTableLoader.load(HashTableLoader.java:261)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:381)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:448)
   ```
   
   (`TestMiniTezCliDriver` is used deliberately: its per-container memory 
isolation is what lets master's distributed cross product survive while the 
broadcast concentrates the whole build into one container. Under the default 
`MiniLlapLocal` driver everything shares one JVM heap, so the contrast doesn't 
surface.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29613: [Optimizer] Cross-product join falls back to single-reducer shuffle merge when small-side row estimate marginally exceeds hive.xprod.mapjoin.small.table.rows [hive]

Reply via email to