cetra3 commented on issue #19216:
URL: https://github.com/apache/datafusion/issues/19216#issuecomment-3920442449

   I've played around with some prototype changes and managed to get this 
working.
   
   I've saved the investigation in this branch:
   
   
https://github.com/apache/datafusion/compare/main...pydantic:datafusion:memory_observations
   
   There are a few changes that I needed to make to get this to run to 
completion:
   
   ## Memory Coordinator
   
   The main change is designing a **Memory Coordinator**, which is like a 
memory pool, but adds a few ways to `.await` waiting for memory to be 
available.  I.e, if there are consumers waiting for memory, then other 
spillable consumers should spill.  
   
   I.e, in one execution we wait for memory:
   
   ```rust
   self.memory_consumer.allocate(size, AllocationType::Required).await
   ```
   
   This registers a list of wakers waiting for memory in a fifo style 
environment
   
   Then in another execution path, we periodically check if we should spill:
   
   ```rust
   if self.memory_consumer.should_spill() {
     self.sort_and_spill_in_mem_batches().await?;
   }
   ```
   
   This *does* have the downside of potentially deadlocking on memory, and so 
some protections should be in place, maybe a timeout waiting or something.  
Regardless, it's better than the current strategy of completely bailing out if 
there isn't memory available at the request time.
   
   ## MultiLevelMergeBuilder
   
   This is the main reason why I think things are failing.  Essentially when 
you have more than 1 partition, one will greedily take all the available 
memory, starving out other consumers/partitions.  
   
   And so there is no headroom to do other allocations as well, like in memory 
sorts etc..  I've changed this to be a bit nicer.  First it will `.try_grow` 
and await waiting for minimum required memory, then it will try and grow the 
allocation if there is headroom, up to a limit.  The limit is the number of 
registered consumers, which normally equals the number of partitions, but if 
you have multiple queries running this can scale accordingly.
   
   ## Allocation Sizes
   
   There is a *lot* of double accounting and other weirdness around how we know 
how much memory is used.  I turned off a few areas of double accounting that I 
saw, and I have ran this with `dhat` to actually check it, and seems like it 
falls under the 4gb limit.  But we need a better solution here holistically.
   
   
   
   So yeah a lot of changes needed for this one, I will work on breaking out 
the changes into separate PRs if people are amenable.  Starting with 
`MemoryCoordinator` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to