Vincent Woo created FLINK-34636:
-----------------------------------

             Summary: Requesting exclusive buffers timeout causes repeated 
restarts and cannot be automatically recovered
                 Key: FLINK-34636
                 URL: https://issues.apache.org/jira/browse/FLINK-34636
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
            Reporter: Vincent Woo
         Attachments: image-20240308100308649.png, image-20240308101008765.png, 
image-20240308101407396.png, image-20240308101934756.png

Based on the observation of logs and metrics, it was found that a subtask 
deployed on a same TM consistently reported an exception of requesting 
exclusive buffers timeout. It was discovered that during the restart process, 
【{*}Network{*}】 metric remained unchanged (heap memory usage did change). I 
suspect that the network buffer memory was not properly released during the 
restart process, which caused the newly deployed task to fail to obtain the 
network buffer. This problem persisted despite repeated restarts, and the 
application failed to recover automatically.

(I'm not sure if there are other reasons for this issue)

Attached below are screenshots of the exception stack and relevant metrics:
{code:java}
2024-03-08 09:58:18,738 WARN  org.apache.flink.runtime.taskmanager.Task         
           [] - GroupWindowAggregate switched from DEPLOYING to FAILED with 
failure cause: java.io.IOException: Timeout triggered when requesting exclusive 
buffers: The total number of network buffers is currently set to 32768 of 32768 
bytes each. You can increase this number by setting the configuration keys 
'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 
'taskmanager.memory.network.max',  or you may increase the timeout which is 
30000ms by setting the key 
'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'.
at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427)
  
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257)
  
at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84)
  
at 
org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952)
  
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655)  
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)  
at java.lang.Thread.run(Thread.java:748) {code}
!image-20240308101407396.png!

Network metric:Only this TM is always 100%, without any variation.

!image-20240308100308649.png|width=2540,height=989!

The status of the task deployed to this TM cannot be RUNNING and the status 
change is slow

!image-20240308101008765.png!

Although the root exception thrown by the  application is 
PartitionNotFoundException, the actual underlying root cause exception log 
found is IOException: Timeout triggered when requesting exclusive buffers

!image-20240308101934756.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to