[I] [VL] Distinct aggregation OOM when getOutput [incubator-gluten]

via GitHub Fri, 22 Nov 2024 00:17:24 -0800


ccat3z opened a new issue, #8025:
URL: https://github.com/apache/incubator-gluten/issues/8025


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   Distinct aggregation will merge all sorted spill file in `getOutput()` 
(`SpillPartition::createOrderedReader`). If there are too many spill files, 
reading the first batch of each file into memory will consume a significant 
amount of memory. In one of our internal cases, one task generated 300 spill 
files, which requires close to 3G of memory.
   
   
![image](https://github.com/user-attachments/assets/23dd540e-a4b7-448e-84e0-caae00aa5147)
   
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [VL] Distinct aggregation OOM when getOutput [incubator-gluten]

Reply via email to