[jira] [Commented] (FLINK-24377) TM resource may not be properly released after heartbeat timeout
[ https://issues.apache.org/jira/browse/FLINK-24377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421250#comment-17421250 ] Xintong Song commented on FLINK-24377: -- Fixed via: - master (1.15): c81a530392b149141b1124bf83918f717d022111 - release-1.14: fcb19e6bb65128f24d80d68c29ce432d9bdcea22 I'm leaving this ticket open for now, for porting the fix to the 1.13 branch. > TM resource may not be properly released after heartbeat timeout > > > Key: FLINK-24377 > URL: https://issues.apache.org/jira/browse/FLINK-24377 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.14.0, 1.13.2 >Reporter: Xintong Song >Assignee: Xintong Song >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0, 1.13.3, 1.15.0 > > > In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat > times out. However, it does not actively release the pod / container of that > TM. The releasing of pod / container relies on the TM to terminate itself > after failing to re-register to the RM. > In some rare conditions, the TM process may not terminate and hang out for > long time. In such cases, k8s / yarn sees the process running, thus will not > release the pod / container. Neither will Flink's resource manager. > Consequently, the resource is leaked until the entire application is > terminated. > To fix this, we should make {{ActiveResourceManager}} to actively release the > resource to K8s / Yarn after a TM heartbeat timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-24377) TM resource may not be properly released after heartbeat timeout
[ https://issues.apache.org/jira/browse/FLINK-24377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420282#comment-17420282 ] zlzhang0122 commented on FLINK-24377: - We also have encountered this situation too. > TM resource may not be properly released after heartbeat timeout > > > Key: FLINK-24377 > URL: https://issues.apache.org/jira/browse/FLINK-24377 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.14.0, 1.13.2 >Reporter: Xintong Song >Assignee: Xintong Song >Priority: Major > Fix For: 1.14.0, 1.13.3, 1.15.0 > > > In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat > times out. However, it does not actively release the pod / container of that > TM. The releasing of pod / container relies on the TM to terminate itself > after failing to re-register to the RM. > In some rare conditions, the TM process may not terminate and hang out for > long time. In such cases, k8s / yarn sees the process running, thus will not > release the pod / container. Neither will Flink's resource manager. > Consequently, the resource is leaked until the entire application is > terminated. > To fix this, we should make {{ActiveResourceManager}} to actively release the > resource to K8s / Yarn after a TM heartbeat timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)