[
https://issues.apache.org/jira/browse/FLINK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891463#comment-17891463
]
Vincent Woo edited comment on FLINK-34636 at 10/21/24 9:32 AM:
---------------------------------------------------------------
This issue is occurring in version 1.13.2, and it looks like it may be related
to this Network buffer leak:[Network buffer leak when ResultPartition is
released (failover)|https://issues.apache.org/jira/browse/FLINK-23724], so I'll
verify that the fix code avoids this issue first.
was (Author: JIRAUSER299026):
This issue is occurring in version 1.13.2, and it looks like it may be related
to this Network buffer leak:Network buffer leak when ResultPartition is
released (failover), so I'll verify that the fix code avoids this issue first.
> Requesting exclusive buffers timeout causes repeated restarts and cannot be
> automatically recovered
> ---------------------------------------------------------------------------------------------------
>
> Key: FLINK-34636
> URL: https://issues.apache.org/jira/browse/FLINK-34636
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.13.2
> Reporter: Vincent Woo
> Priority: Major
> Attachments: image-20240308100308649.png,
> image-20240308101008765.png, image-20240308101407396.png,
> image-20240308101934756.png
>
>
> Based on the observation of logs and metrics, it was found that a subtask
> deployed on a same TM consistently reported an exception of requesting
> exclusive buffers timeout. It was discovered that during the restart process,
> 【{*}Network{*}】 metric remained unchanged (heap memory usage did change). I
> suspect that the network buffer memory was not properly released during the
> restart process, which caused the newly deployed task to fail to obtain the
> network buffer. This problem persisted despite repeated restarts, and the
> application failed to recover automatically.
> (I'm not sure if there are other reasons for this issue)
> Attached below are screenshots of the exception stack and relevant metrics:
> {code:java}
> 2024-03-08 09:58:18,738 WARN org.apache.flink.runtime.taskmanager.Task
> [] - GroupWindowAggregate switched from DEPLOYING to FAILED with
> failure cause: java.io.IOException: Timeout triggered when requesting
> exclusive buffers: The total number of network buffers is currently set to
> 32768 of 32768 bytes each. You can increase this number by setting the
> configuration keys 'taskmanager.memory.network.fraction',
> 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max', or
> you may increase the timeout which is 30000ms by setting the key
> 'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'.
> at
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246)
> at
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169)
> at
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
> at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427)
>
> at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257)
>
> at
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84)
>
> at
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952)
>
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
> at java.lang.Thread.run(Thread.java:748) {code}
> !image-20240308101407396.png|width=866,height=171!
> Network metric:Only this TM is always 100%, without any variation.
> !image-20240308100308649.png|width=868,height=338!
> The status of the task deployed to this TM cannot be RUNNING and the status
> change is slow
> !image-20240308101008765.png|width=869,height=118!
> Although the root exception thrown by the application is
> PartitionNotFoundException, the actual underlying root cause exception log
> found is IOException: Timeout triggered when requesting exclusive buffers
> !image-20240308101934756.png|width=869,height=394!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)