SubhamSinghal opened a new pull request, #22103: URL: https://github.com/apache/datafusion/pull/22103
## Which issue does this PR close? Follow-up to [#21962](https://github.com/apache/datafusion/pull/21962). ## Rationale for this change After #21962, the memory pool accurately tracks residual `join_arrays` memory that remains after a `BufferedBatch` is spilled to disk. However, when spilled batches are **read back** from disk during output materialization in `materialize_right_columns`, the deserialized data temporarily exists in memory without any pool reservation. - **Single-source path**: one full batch loaded without reservation - **Multi-source interleave path**: ALL referenced spilled batches loaded simultaneously — N × batch_size untracked The pool thinks these batches cost 0 bytes during read-back. Under memory pressure (the reason they were spilled), other operators see stale headroom and may over-allocate, risking OOM. ## What changes are included in this PR? Changed `materialize_right_columns` from `&self` to `&mut self` and added `grow/shrink` at the exact points where spilled data is read from disk: **Path A (single source spilled):** - `grow(size_estimation)` immediately before `fetch_right_columns_by_idxs` - `shrink(size_estimation)` immediately after **Path B (multi-source interleave):** - Sum `size_estimation` for all spilled sources - `grow(total)` before `source_data` loading - `shrink(total)` after interleave completes Uses unconditional `grow()` because the data must be read to produce output — there is no fallback. Same rationale as #21962: if memory physically exists, the pool must reflect it. ## Are these changes tested? Yes — two new tests: - `spill_read_back_memory_accounting`: multiple buffered batches for same key (multi-source Path B) — verifies `peak_mem_used >= size_estimation` and `pool.reserved() == 0` at end - `spill_read_back_single_source`: distinct keys with one batch per group (single-source Path A) — same assertions ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
