Re: [PR] Validate the memory consumption in SPM created by multi level merge [datafusion]

via GitHub Mon, 04 Aug 2025 00:20:48 -0700


ding-young commented on PR #17029:
URL: https://github.com/apache/datafusion/pull/17029#issuecomment-3149424789


   ### List of failing tests 
   <details>
   
   ```
   // fuzz_cases 
   
fuzz_cases::aggregate_fuzz::test_single_mode_aggregate_single_mode_aggregate_with_spill
   
fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_aggregate_with_high_cardinality_with_limited_memory_and_different_sizes_of_record_batch
   
fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_aggregate_with_high_cardinality_with_limited_memory_and_different_sizes_of_record_batch_and_changing_memory_reservation
   
fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_aggregate_with_high_cardinality_with_limited_memory_and_different_sizes_of_record_batch_and_take_all_memory
   
fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_aggregate_with_high_cardinality_with_limited_memory_and_large_record_batch
   
fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_sort_with_limited_memory_and_large_record_batch
   ```
   
   ```
   // memory_limit
   memory_limit::test_stringview_external_sort 
   memory_limit::test_stringview_external_sort
   ```
   </details>
   
   ### Possible cause? 
   There are two problem discovered while working on this PR. 
   
   First, we don't reserve memory for in-memory streams (since we don't know 
the batch size without polling it) when we build `SortPreservingMergeStream` 
with both in-memory streams and sorted spill files. The `merge_reservation` 
only accounts for worst case memory usage for sorted spill files, so this means 
that we'll bypass the global memory pool but does not check the memory consumed 
by in-memory streams. 
   
   Second, it seems like the actual memory usage reported by local memory pool 
exceeds the size of precomputed `memory_reservation` even when there are only 
sorted spill files to merge. Failing tests indicate this, and when I printed 
out the size of `UnboundedMemoryPool`, it seems like we underestimated the 
worst case memory consumption for merge.
   
   <details>
   
   There are 6 spill files to merge, and we reserved `3676800` bytes based on 
max record batch size for each spill file, but the peak `used` bytes of 
`UnboundedMemoryPool` are `28280376`. Note that this test 
`test_stringview_external_sort` uses `FairSpillPool` as global memory pool, so 
I guess the usage log for `UnboundedMemoryPool` accounts for SPM in multi level 
merge. 
   
   ```
   // modify `with_bypass_mempool` in this PR to use UnbountedMemoryPool
   // RUST_LOG=debug cargo test -p datafusion 
memory_limit::test_stringview_external_sort -- --exact --nocapture
   [merge_sorted_runs_within_mem_limit] spill6, inmem0
   [merge_sorted_runs_within_mem_limit] memory_reservation:3676800
   create new merge sort with bypass mempool
   [mem pool] 1394496 used
   [mem pool] 2788992 used
   [mem pool] 3639880 used
   [mem pool] 4570768 used
   [mem pool] 5965448 used
   [mem pool] 7360128 used
   [mem pool] 8754744 used
   [mem pool] 10149360 used
   [mem pool] 11543832 used
   [mem pool] 12938304 used
   [mem pool] 14332872 used
   [mem pool] 15727440 used
   [mem pool] 17122440 used
   [mem pool] 18517440 used
   [mem pool] 19912376 used
   [mem pool] 21307312 used
   [mem pool] 22701912 used
   [mem pool] 24096512 used
   [mem pool] 25491152 used
   [mem pool] 26885792 used
   [mem pool] 28280376 used
   ```
   </details> 
   
   @2010YOUY01 @rluvaton  
   Could you help me check if the issue is really with estimation or memory 
management, rather than me logging things incorrectly? I'm not sure where this 
discrepancy is coming from, since it's not only the string array-related tests 
that are failing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Validate the memory consumption in SPM created by multi level merge [datafusion]

Reply via email to