[ 
https://issues.apache.org/jira/browse/SPARK-57242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57242:
-----------------------------------
    Labels: correctness pull-request-available  (was: correctness)

> Avoid unbounded TaskMemoryManager page allocation retries after allocator OOM
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-57242
>                 URL: https://issues.apache.org/jira/browse/SPARK-57242
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 4.3.0
>            Reporter: Chao Sun
>            Priority: Major
>              Labels: correctness, pull-request-available
>
> TaskMemoryManager.allocatePage() first acquires execution memory from 
> MemoryManager and then asks the Tungsten allocator to create the physical 
> page. If the physical allocator throws OutOfMemoryError, Spark currently 
> retains the grant and recursively calls allocatePage() again.
> Because each recursive call requests another execution-memory grant, repeated 
> physical allocation failures can accumulate acquired-but-unused grants and 
> keep retrying indefinitely. Tasks may hang for a long time, recurse until 
> stack exhaustion, or wait indefinitely instead of either recovering or 
> failing promptly.
> This remains present on current master. SPARK-54354 documented that the 
> recursive retry can cause multi-hour hangs while constructing large broadcast 
> hashed relations, but its fix only bounded temporary hashed-relation memory 
> managers. SPARK-54818 improved diagnostics for the allocator failure without 
> changing the retry behavior.
> The desired behavior is:
> * Keep the existing execution-memory grant after a physical allocator failure.
> * Ask task-managed memory consumers to spill without requesting another 
> fair-share grant.
> * Retry the physical allocation only while spilling makes measurable progress.
> * If no consumer can release memory, return allocation failure promptly so 
> the caller raises SparkOutOfMemoryError.
> * Prevent nested allocations from spill callbacks from recursively entering 
> allocator recovery.
> ShuffleExternalSorter also needs to restore its pointer array lazily and 
> recheck pointer-array capacity after data-page allocation, because allocator 
> recovery may spill and reset the pointer array while record insertion is in 
> progress. UnsafeExternalSorter already follows this recheck pattern.
> This differs from SPARK-31720, which concerns fair-share execution-memory 
> grants becoming unavailable when new tasks arrive. Here, MemoryManager 
> already granted execution memory, but the physical heap or off-heap allocator 
> rejected the page.
> Related issues: SPARK-54354, SPARK-54818, SPARK-32901, SPARK-25081.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to