Re: [PR] Avoid concatenating record batches in joins [datafusion]

via GitHub Wed, 01 Jul 2026 06:37:47 -0700


mbutrovich commented on PR #23032:
URL: https://github.com/apache/datafusion/pull/23032#issuecomment-4855509587


   Thanks for putting this together. I'm trying to reason about how it behaves 
across a wider range of build-side shapes than the 60M-row case in the 
description, and a few things aren't obvious to me yet. Mostly questions rather 
than objections.
   
   **Sliced / over-allocated input batches.** `RecordBatch::slice` is a 
zero-copy view: it keeps the parent buffer alive and just adjusts offset and 
length. `concat_batches` copies out the live rows, so the oversized parents can 
drop. If the build side arrives as slices (LIMIT, some repartition and scan 
paths) or as batches whose buffers have capacity well above their length, 
keeping them un-concatenated seems like it would pin the full backing buffers 
for the whole probe phase, where concat would have compacted to live rows. In 
that case the retained footprint could be larger than the merged batch we're 
trying to avoid. Have you measured peak memory with sliced or partially-full 
build batches? Do the build sides you see in practice tend to be compact and 
fully packed?
   
   **Cost now applies to every multi-batch build, not just huge ones.** 
`collect_left_input` pushes each child batch into the vec with no coalescing, 
so any build side over one batch takes the `interleave` path. Each gather runs 
every candidate index through `flat_index_to_batch_row` (a `partition_point` 
binary search) and allocates a `Vec<(usize, usize)>`, twice per probe batch 
(`equal_rows_arr_multi` and `build_batch_from_indices_multi`). That scales with 
batch count, so many small batches is the worst shape. The single-batch fast 
path avoids it, but that only covers builds that fit in one batch. Do you have 
a benchmark for a build side split across many small batches (say a few 
thousand), versus the same data as one batch?
   
   **Outer joins lose `take`'s null propagation.** `take` maps null indices to 
null output for free. `interleave` needs a valid placeholder, so 
`take_build_array` adds an `is_null` scan plus a `nullif` pass over the output 
whenever build indices are null. For Left/Full joins with a low match rate, 
most output rows have null build indices, so that's two extra passes per output 
batch that the concat path didn't pay. Did any of the outer-join cases show up 
in benchmarking?
   
   **Dictionary and nested columns.** `interleave` merges dictionary values and 
uses `MutableArrayData` for nested types. `concat` does that merge once at 
build time; `interleave` repeats it per probe batch. A dictionary-encoded or 
struct/list-heavy build side seems like it could regress more than a constant 
factor here. Was that exercised?
   
   **PWMJ still concatenates the key column.** `build_buffered_data` keeps the 
payload columns separate but still does `concat(&key_arrays)` because the merge 
needs a contiguous sorted key. So a build side whose join key is itself a large 
string column would still overflow the offsets in piecewise merge join. That 
seems fine as a scope decision, just want to confirm the offset-overflow fix is 
intended to cover payload columns and not the key in that operator.
   
   On the memory numbers themselves, I'd find it easier to reason about this 
framed in bytes and by partition mode rather than row count. The doubling and 
the i32 offset limit are both byte-driven, and they mainly bite in 
`CollectLeft` where one operator holds the entire build side; `Partitioned` 
builds see roughly 1/N. The offset-overflow argument in particular stands on 
its own as a correctness ceiling that making `concat_batches` incremental 
wouldn't address, and that feels like the strongest motivation here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Avoid concatenating record batches in joins [datafusion]

Reply via email to