Julien Tournay created FLINK-35073:
--------------------------------------

             Summary: Deadlock in LocalBufferPool when 
NetworkBufferPool.internalRecycleMemorySegments is called concurrently
                 Key: FLINK-35073
                 URL: https://issues.apache.org/jira/browse/FLINK-35073
             Project: Flink
          Issue Type: Bug
            Reporter: Julien Tournay
         Attachments: deadlock_threaddump_extract.json

The reported issue is easy to reproduce in batch mode using hybrid shuffle and 
a somewhat large total number of slots in the cluster. Parallelism does not 
seem to matter much.

Note: Joined a partial threaddump to illustrate the issue.

When `NetworkBufferPool.internalRecycleMemorySegments` is called concurrently. 
The following chain of call may happen:
{code:java}
NetworkBufferPool.internalRecycleMemorySegments -> 
LocalBufferPool.onGlobalPoolAvailable ->
LocalBufferPool.checkAndUpdateAvailability -> 
LocalBufferPool.requestMemorySegmentFromGlobalWhenAvailable{code}
`requestMemorySegmentFromGlobalWhenAvailable can cause `onGlobalPoolAvailable` 
to be invoked on another `LocalBufferPool` instance which triggers the same 
chain of actions.

The issue arises when 2 threads go through this specific code path at the same 
time.

Each thread will `requestMemorySegmentFromGlobalWhenAvailable` and in the 
process try to acquire a new locks on a series of LocalBuffer.

As an example, assume there are 6 `LocalBufferPool` instance A, B, C, D, E and 
F:
Thread 1 locks A, B, C and tries to lock D
Thread 2 locks D, E, F and tried to lock A
==> Both threads 1 and 2 are blocked.

The example threadump captured this issue:
First thread locked java.util.ArrayDeque@41d6a3bb and is blocked on 
java.util.ArrayDeque@e2b5e34
Second thread locked java.util.ArrayDeque@e2b5e34 and is blocked on 
java.util.ArrayDeque@41d6a3bb

 

Note that I'm not familiar enough with Flink internals to know what the fix 
should be but I'm happy to submit a PR if someone tells me what the correct 
behaviour should be.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to