cetra3 commented on issue #19216: URL: https://github.com/apache/datafusion/issues/19216#issuecomment-3920442449
I've played around with some prototype changes and managed to get this working. I've saved the investigation in this branch: https://github.com/apache/datafusion/compare/main...pydantic:datafusion:memory_observations There are a few changes that I needed to make to get this to run to completion: ## Memory Coordinator The main change is designing a **Memory Coordinator**, which is like a memory pool, but adds a few ways to `.await` waiting for memory to be available. I.e, if there are consumers waiting for memory, then other spillable consumers should spill. I.e, in one execution we wait for memory: ```rust self.memory_consumer.allocate(size, AllocationType::Required).await ``` This registers a list of wakers waiting for memory in a fifo style environment Then in another execution path, we periodically check if we should spill: ```rust if self.memory_consumer.should_spill() { self.sort_and_spill_in_mem_batches().await?; } ``` This *does* have the downside of potentially deadlocking on memory, and so some protections should be in place, maybe a timeout waiting or something. Regardless, it's better than the current strategy of completely bailing out if there isn't memory available at the request time. ## MultiLevelMergeBuilder This is the main reason why I think things are failing. Essentially when you have more than 1 partition, one will greedily take all the available memory, starving out other consumers/partitions. And so there is no headroom to do other allocations as well, like in memory sorts etc.. I've changed this to be a bit nicer. First it will `.try_grow` and await waiting for minimum required memory, then it will try and grow the allocation if there is headroom, up to a limit. The limit is the number of registered consumers, which normally equals the number of partitions, but if you have multiple queries running this can scale accordingly. ## Allocation Sizes There is a *lot* of double accounting and other weirdness around how we know how much memory is used. I turned off a few areas of double accounting that I saw, and I have ran this with `dhat` to actually check it, and seems like it falls under the 4gb limit. But we need a better solution here holistically. So yeah a lot of changes needed for this one, I will work on breaking out the changes into separate PRs if people are amenable. Starting with `MemoryCoordinator` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
