andygrove commented on PR #3564:
URL:
https://github.com/apache/datafusion-comet/pull/3564#issuecomment-3942022715
I have to give Claude Code a shoutout for tracking down this bug. It took
many iterations on debugging.
```
Root cause found and fixed: DataFusion's DataSourceExec always wraps
output with BatchSplitStream, which slices batches larger than batch_size
(default 8192 rows). In GHJ Phase 3:
1. A build batch with 696,344 rows / 22 MB is passed through DataSourceExec
2. BatchSplitStream slices it into 696,344 / 8192 = 85 slices
3. Each slice shares the original Arrow buffers via zero-copy batch.slice()
4. get_record_batch_memory_size() reports the full buffer size (~22 MB)
for each slice
5. collect_left_input calls try_grow(22 MB) 85 times → 1.87 GB phantom
reservation
6. The actual memory is only ~22 MB → 85x over-counting → spurious OOM
Fix: Created context_without_batch_splitting() that produces a TaskContext
with batch_size = usize::MAX, preventing BatchSplitStream from slicing. Applied
to all 3 Phase 3 code paths:
- Fast path (join_partition_recursive via fast path)
- Recursive path (join_partition_recursive)
- Spilled probe path (join_with_spilled_probe)
You can verify by running TPC-DS q72 again. The build batch will now pass
through as a single batch, and collect_left_input will correctly account for
~22 MB (1 try_grow) instead of 1.87 GB (85
try_grows).
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]