ding-young commented on PR #17029: URL: https://github.com/apache/datafusion/pull/17029#issuecomment-3149424789
### List of failing tests <details> ``` // fuzz_cases fuzz_cases::aggregate_fuzz::test_single_mode_aggregate_single_mode_aggregate_with_spill fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_aggregate_with_high_cardinality_with_limited_memory_and_different_sizes_of_record_batch fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_aggregate_with_high_cardinality_with_limited_memory_and_different_sizes_of_record_batch_and_changing_memory_reservation fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_aggregate_with_high_cardinality_with_limited_memory_and_different_sizes_of_record_batch_and_take_all_memory fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_aggregate_with_high_cardinality_with_limited_memory_and_large_record_batch fuzz_cases::spilling_fuzz_in_memory_constrained_env::test_sort_with_limited_memory_and_large_record_batch ``` ``` // memory_limit memory_limit::test_stringview_external_sort memory_limit::test_stringview_external_sort ``` </details> ### Possible cause? There are two problem discovered while working on this PR. First, we don't reserve memory for in-memory streams (since we don't know the batch size without polling it) when we build `SortPreservingMergeStream` with both in-memory streams and sorted spill files. The `merge_reservation` only accounts for worst case memory usage for sorted spill files, so this means that we'll bypass the global memory pool but does not check the memory consumed by in-memory streams. Second, it seems like the actual memory usage reported by local memory pool exceeds the size of precomputed `memory_reservation` even when there are only sorted spill files to merge. Failing tests indicate this, and when I printed out the size of `UnboundedMemoryPool`, it seems like we underestimated the worst case memory consumption for merge. <details> There are 6 spill files to merge, and we reserved `3676800` bytes based on max record batch size for each spill file, but the peak `used` bytes of `UnboundedMemoryPool` are `28280376`. Note that this test `test_stringview_external_sort` uses `FairSpillPool` as global memory pool, so I guess the usage log for `UnboundedMemoryPool` accounts for SPM in multi level merge. ``` // modify `with_bypass_mempool` in this PR to use UnbountedMemoryPool // RUST_LOG=debug cargo test -p datafusion memory_limit::test_stringview_external_sort -- --exact --nocapture [merge_sorted_runs_within_mem_limit] spill6, inmem0 [merge_sorted_runs_within_mem_limit] memory_reservation:3676800 create new merge sort with bypass mempool [mem pool] 1394496 used [mem pool] 2788992 used [mem pool] 3639880 used [mem pool] 4570768 used [mem pool] 5965448 used [mem pool] 7360128 used [mem pool] 8754744 used [mem pool] 10149360 used [mem pool] 11543832 used [mem pool] 12938304 used [mem pool] 14332872 used [mem pool] 15727440 used [mem pool] 17122440 used [mem pool] 18517440 used [mem pool] 19912376 used [mem pool] 21307312 used [mem pool] 22701912 used [mem pool] 24096512 used [mem pool] 25491152 used [mem pool] 26885792 used [mem pool] 28280376 used ``` </details> @2010YOUY01 @rluvaton Could you help me check if the issue is really with estimation or memory management, rather than me logging things incorrectly? I'm not sure where this discrepancy is coming from, since it's not only the string array-related tests that are failing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org