Chao Sun created SPARK-57242:
--------------------------------

             Summary: Avoid unbounded TaskMemoryManager page allocation retries 
after allocator OOM
                 Key: SPARK-57242
                 URL: https://issues.apache.org/jira/browse/SPARK-57242
             Project: Spark
          Issue Type: Bug
          Components: Shuffle, Spark Core
    Affects Versions: 4.3.0
            Reporter: Chao Sun


TaskMemoryManager.allocatePage() first acquires execution memory from 
MemoryManager and then asks the Tungsten allocator to create the physical page. 
If the physical allocator throws OutOfMemoryError, Spark currently retains the 
grant and recursively calls allocatePage() again.

Because each recursive call requests another execution-memory grant, repeated 
physical allocation failures can accumulate acquired-but-unused grants and keep 
retrying indefinitely. Tasks may hang for a long time, recurse until stack 
exhaustion, or wait indefinitely instead of either recovering or failing 
promptly.

This remains present on current master. SPARK-54354 documented that the 
recursive retry can cause multi-hour hangs while constructing large broadcast 
hashed relations, but its fix only bounded temporary hashed-relation memory 
managers. SPARK-54818 improved diagnostics for the allocator failure without 
changing the retry behavior.

The desired behavior is:

* Keep the existing execution-memory grant after a physical allocator failure.
* Ask task-managed memory consumers to spill without requesting another 
fair-share grant.
* Retry the physical allocation only while spilling makes measurable progress.
* If no consumer can release memory, return allocation failure promptly so the 
caller raises SparkOutOfMemoryError.
* Prevent nested allocations from spill callbacks from recursively entering 
allocator recovery.

ShuffleExternalSorter also needs to restore its pointer array lazily and 
recheck pointer-array capacity after data-page allocation, because allocator 
recovery may spill and reset the pointer array while record insertion is in 
progress. UnsafeExternalSorter already follows this recheck pattern.

This differs from SPARK-31720, which concerns fair-share execution-memory 
grants becoming unavailable when new tasks arrive. Here, MemoryManager already 
granted execution memory, but the physical heap or off-heap allocator rejected 
the page.

Related issues: SPARK-54354, SPARK-54818, SPARK-32901, SPARK-25081.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to