Weijie Guo created FLINK-29298:
----------------------------------

             Summary: LocalBufferPool request buffer from NetworkBufferPool 
hanging
                 Key: FLINK-29298
                 URL: https://issues.apache.org/jira/browse/FLINK-29298
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.16.0
            Reporter: Weijie Guo
         Attachments: image-2022-09-14-10-52-15-259.png, 
image-2022-09-14-10-58-45-987.png, image-2022-09-14-11-00-47-309.png

In the scenario where the buffer contention is fierce, sometimes the task hang 
can be observed. Through the thread dump information, we can found that the 
task thread is blocked by requestMemorySegmentBlocking forever. After 
investigating the dumped heap information, I found that the NetworkBufferPool 
actually has many buffers, but the LocalBufferPool is still unavailable and no 
buffer has been obtained.

By looking at the code, I am sure that this is a bug in thread race: when the 
task thread polled out the last buffer in LocalBufferPool and triggered the 
onGlobalPoolAvailable callback itself, it will skip this notification  (as 
currently the LocalBufferPool is available), which will cause the BufferPool to 
eventually become unavailable and will never register a callback to the 
NetworkBufferPool.

The conditions for triggering the problem are relatively strict, but I have 
found a stable way to reproduce it, I will try to fix and verify this problem.

!image-2022-09-14-10-52-15-259.png|width=1021,height=219!

!image-2022-09-14-10-58-45-987.png|width=997,height=315!

!image-2022-09-14-11-00-47-309.png|width=453,height=121!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to