mbutrovich commented on PR #23032: URL: https://github.com/apache/datafusion/pull/23032#issuecomment-4855509587
Thanks for putting this together. I'm trying to reason about how it behaves across a wider range of build-side shapes than the 60M-row case in the description, and a few things aren't obvious to me yet. Mostly questions rather than objections. **Sliced / over-allocated input batches.** `RecordBatch::slice` is a zero-copy view: it keeps the parent buffer alive and just adjusts offset and length. `concat_batches` copies out the live rows, so the oversized parents can drop. If the build side arrives as slices (LIMIT, some repartition and scan paths) or as batches whose buffers have capacity well above their length, keeping them un-concatenated seems like it would pin the full backing buffers for the whole probe phase, where concat would have compacted to live rows. In that case the retained footprint could be larger than the merged batch we're trying to avoid. Have you measured peak memory with sliced or partially-full build batches? Do the build sides you see in practice tend to be compact and fully packed? **Cost now applies to every multi-batch build, not just huge ones.** `collect_left_input` pushes each child batch into the vec with no coalescing, so any build side over one batch takes the `interleave` path. Each gather runs every candidate index through `flat_index_to_batch_row` (a `partition_point` binary search) and allocates a `Vec<(usize, usize)>`, twice per probe batch (`equal_rows_arr_multi` and `build_batch_from_indices_multi`). That scales with batch count, so many small batches is the worst shape. The single-batch fast path avoids it, but that only covers builds that fit in one batch. Do you have a benchmark for a build side split across many small batches (say a few thousand), versus the same data as one batch? **Outer joins lose `take`'s null propagation.** `take` maps null indices to null output for free. `interleave` needs a valid placeholder, so `take_build_array` adds an `is_null` scan plus a `nullif` pass over the output whenever build indices are null. For Left/Full joins with a low match rate, most output rows have null build indices, so that's two extra passes per output batch that the concat path didn't pay. Did any of the outer-join cases show up in benchmarking? **Dictionary and nested columns.** `interleave` merges dictionary values and uses `MutableArrayData` for nested types. `concat` does that merge once at build time; `interleave` repeats it per probe batch. A dictionary-encoded or struct/list-heavy build side seems like it could regress more than a constant factor here. Was that exercised? **PWMJ still concatenates the key column.** `build_buffered_data` keeps the payload columns separate but still does `concat(&key_arrays)` because the merge needs a contiguous sorted key. So a build side whose join key is itself a large string column would still overflow the offsets in piecewise merge join. That seems fine as a scope decision, just want to confirm the offset-overflow fix is intended to cover payload columns and not the key in that operator. On the memory numbers themselves, I'd find it easier to reason about this framed in bytes and by partition mode rather than row count. The doubling and the i32 offset limit are both byte-driven, and they mainly bite in `CollectLeft` where one operator holds the entire build side; `Partitioned` builds see roughly 1/N. The offset-overflow argument in particular stands on its own as a correctness ceiling that making `concat_batches` incremental wouldn't address, and that feels like the strongest motivation here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
