[
https://issues.apache.org/jira/browse/SPARK-57242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57242:
-----------------------------------
Labels: correctness pull-request-available (was: correctness)
> Avoid unbounded TaskMemoryManager page allocation retries after allocator OOM
> -----------------------------------------------------------------------------
>
> Key: SPARK-57242
> URL: https://issues.apache.org/jira/browse/SPARK-57242
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, Spark Core
> Affects Versions: 4.3.0
> Reporter: Chao Sun
> Priority: Major
> Labels: correctness, pull-request-available
>
> TaskMemoryManager.allocatePage() first acquires execution memory from
> MemoryManager and then asks the Tungsten allocator to create the physical
> page. If the physical allocator throws OutOfMemoryError, Spark currently
> retains the grant and recursively calls allocatePage() again.
> Because each recursive call requests another execution-memory grant, repeated
> physical allocation failures can accumulate acquired-but-unused grants and
> keep retrying indefinitely. Tasks may hang for a long time, recurse until
> stack exhaustion, or wait indefinitely instead of either recovering or
> failing promptly.
> This remains present on current master. SPARK-54354 documented that the
> recursive retry can cause multi-hour hangs while constructing large broadcast
> hashed relations, but its fix only bounded temporary hashed-relation memory
> managers. SPARK-54818 improved diagnostics for the allocator failure without
> changing the retry behavior.
> The desired behavior is:
> * Keep the existing execution-memory grant after a physical allocator failure.
> * Ask task-managed memory consumers to spill without requesting another
> fair-share grant.
> * Retry the physical allocation only while spilling makes measurable progress.
> * If no consumer can release memory, return allocation failure promptly so
> the caller raises SparkOutOfMemoryError.
> * Prevent nested allocations from spill callbacks from recursively entering
> allocator recovery.
> ShuffleExternalSorter also needs to restore its pointer array lazily and
> recheck pointer-array capacity after data-page allocation, because allocator
> recovery may spill and reset the pointer array while record insertion is in
> progress. UnsafeExternalSorter already follows this recheck pattern.
> This differs from SPARK-31720, which concerns fair-share execution-memory
> grants becoming unavailable when new tasks arrive. Here, MemoryManager
> already granted execution memory, but the physical heap or off-heap allocator
> rejected the page.
> Related issues: SPARK-54354, SPARK-54818, SPARK-32901, SPARK-25081.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]