ding-young commented on PR #17029: URL: https://github.com/apache/datafusion/pull/17029#issuecomment-3166544088
Update: I took alternative approach similar to what @2010YOUY01 suggested. > I have an alternative idea to make this validation more fine-grained: Let's say there are 3 spills to merge, each has estimated max batch size 10M, 15M, 12M Then we can only check during merging, each stream's batch size is always less than [10M, 15M, 12M] I switched back to using `UnboundedMemoryPool`, but instead added check to `SpillReadStream` so that whenever a spill stream is polled, the memory size of the batch being read does not exceed `max_record_batch_memory`. This allows us to detect cases where we made an incorrect (underestimated) memory reservation — for example, when the batch consumes more memory after the write-read cycle than originally expected. There is a slight discrepancy due to minor vector allocations, so I added a margin to the check. Fortunately, in most cases, the validation passes. However, for external sorting with string views, the validation currently fails, so further investigation is needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
