helenweng-stripe opened a new pull request, #3203:
URL: https://github.com/apache/celeborn/pull/3203

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] 
Your PR title ...'.
     - Be sure to keep the PR description updated to reflect all changes.
     - Please write your PR title to summarize what this PR proposes.
     - If possible, provide a concise example to reproduce the issue for a 
faster review.
   -->
   
   ### What changes were proposed in this pull request?
   Set the max memory threshold to actual memory allocated to the task. This is 
reverse-calculated from how Spark determines it since TaskMemoryManager in 
Spark does not expose how much memory is available to a task.
   
   Calculation is based on whether mode is onheap or offheap:
   * ((`spark.memory.offHeap.size` or `spark.executor.memory`) - reservedMemory 
(hardcoded to )) * `spark.memory.fraction` * 
`celeborn.client.spark.push.sort.memory.maxMemoryFactor`  / 
`spark.executor.cores`
   
   Based on calculations here: 
https://github.com/apache/spark/blob/branch-3.3/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L213-L235
   
   MaxMemory can be set statically with new setting 
`celeborn.client.spark.push.sort.memory.maxMemoryBytes`. It is only dynamically 
calculated with 
`celeborn.client.spark.push.sort.memory.calculateMaxMemoryBytes` set to true 
(default false).
   
   Better solution would probably be to expose `getMaxMemory` on the spark 
side, however it is currently not available. 
   
   ### Why are the changes needed?
   Currently maxMemory is set to `Runtime.getRuntime().maxMemory() * 
maxMemoryFactor` where `Runtime.getRuntime().maxMemory()` equals the amount of 
memory available to the entire app. Thus for some large tasks, SortBasedPusher 
with `celeborn.client.spark.push.sort.memory.useAdaptiveThreshold` enabled will 
always OOM.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Unit tests
   We also are running with this in prod. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to