Julien Tournay created FLINK-35073: -------------------------------------- Summary: Deadlock in LocalBufferPool when NetworkBufferPool.internalRecycleMemorySegments is called concurrently Key: FLINK-35073 URL: https://issues.apache.org/jira/browse/FLINK-35073 Project: Flink Issue Type: Bug Reporter: Julien Tournay Attachments: deadlock_threaddump_extract.json
The reported issue is easy to reproduce in batch mode using hybrid shuffle and a somewhat large total number of slots in the cluster. Parallelism does not seem to matter much. Note: Joined a partial threaddump to illustrate the issue. When `NetworkBufferPool.internalRecycleMemorySegments` is called concurrently. The following chain of call may happen: {code:java} NetworkBufferPool.internalRecycleMemorySegments -> LocalBufferPool.onGlobalPoolAvailable -> LocalBufferPool.checkAndUpdateAvailability -> LocalBufferPool.requestMemorySegmentFromGlobalWhenAvailable{code} `requestMemorySegmentFromGlobalWhenAvailable can cause `onGlobalPoolAvailable` to be invoked on another `LocalBufferPool` instance which triggers the same chain of actions. The issue arises when 2 threads go through this specific code path at the same time. Each thread will `requestMemorySegmentFromGlobalWhenAvailable` and in the process try to acquire a new locks on a series of LocalBuffer. As an example, assume there are 6 `LocalBufferPool` instance A, B, C, D, E and F: Thread 1 locks A, B, C and tries to lock D Thread 2 locks D, E, F and tried to lock A ==> Both threads 1 and 2 are blocked. The example threadump captured this issue: First thread locked java.util.ArrayDeque@41d6a3bb and is blocked on java.util.ArrayDeque@e2b5e34 Second thread locked java.util.ArrayDeque@e2b5e34 and is blocked on java.util.ArrayDeque@41d6a3bb Note that I'm not familiar enough with Flink internals to know what the fix should be but I'm happy to submit a PR if someone tells me what the correct behaviour should be. -- This message was sent by Atlassian Jira (v8.20.10#820010)