Re: [PR] Validate the memory consumption in SPM created by multi level merge [datafusion]

via GitHub Thu, 07 Aug 2025 21:40:27 -0700


ding-young commented on PR #17029:
URL: https://github.com/apache/datafusion/pull/17029#issuecomment-3166544088


   Update: I took alternative approach similar to what @2010YOUY01 suggested.  
   > I have an alternative idea to make this validation more fine-grained: 
Let's say there are 3 spills to merge, each has estimated max batch size 10M, 
15M, 12M Then we can only check during merging, each stream's batch size is 
always less than [10M, 15M, 12M]
   
   
   I switched back to using `UnboundedMemoryPool`, but instead added check to 
`SpillReadStream` so that whenever a spill stream is polled, the memory size of 
the batch being read does not exceed `max_record_batch_memory`. This allows us 
to detect cases where we made an incorrect (underestimated) memory reservation 
— for example, when the batch consumes more memory after the write-read cycle 
than originally expected.
   
   There is a slight discrepancy due to minor vector allocations, so I added a 
margin to the check. Fortunately, in most cases, the validation passes. 
However, for external sorting with string views, the validation currently 
fails, so further investigation is needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Validate the memory consumption in SPM created by multi level merge [datafusion]

Reply via email to