Chao Sun created SPARK-57242:
--------------------------------
Summary: Avoid unbounded TaskMemoryManager page allocation retries
after allocator OOM
Key: SPARK-57242
URL: https://issues.apache.org/jira/browse/SPARK-57242
Project: Spark
Issue Type: Bug
Components: Shuffle, Spark Core
Affects Versions: 4.3.0
Reporter: Chao Sun
TaskMemoryManager.allocatePage() first acquires execution memory from
MemoryManager and then asks the Tungsten allocator to create the physical page.
If the physical allocator throws OutOfMemoryError, Spark currently retains the
grant and recursively calls allocatePage() again.
Because each recursive call requests another execution-memory grant, repeated
physical allocation failures can accumulate acquired-but-unused grants and keep
retrying indefinitely. Tasks may hang for a long time, recurse until stack
exhaustion, or wait indefinitely instead of either recovering or failing
promptly.
This remains present on current master. SPARK-54354 documented that the
recursive retry can cause multi-hour hangs while constructing large broadcast
hashed relations, but its fix only bounded temporary hashed-relation memory
managers. SPARK-54818 improved diagnostics for the allocator failure without
changing the retry behavior.
The desired behavior is:
* Keep the existing execution-memory grant after a physical allocator failure.
* Ask task-managed memory consumers to spill without requesting another
fair-share grant.
* Retry the physical allocation only while spilling makes measurable progress.
* If no consumer can release memory, return allocation failure promptly so the
caller raises SparkOutOfMemoryError.
* Prevent nested allocations from spill callbacks from recursively entering
allocator recovery.
ShuffleExternalSorter also needs to restore its pointer array lazily and
recheck pointer-array capacity after data-page allocation, because allocator
recovery may spill and reset the pointer array while record insertion is in
progress. UnsafeExternalSorter already follows this recheck pattern.
This differs from SPARK-31720, which concerns fair-share execution-memory
grants becoming unavailable when new tasks arrive. Here, MemoryManager already
granted execution memory, but the physical heap or off-heap allocator rejected
the page.
Related issues: SPARK-54354, SPARK-54818, SPARK-32901, SPARK-25081.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]