[jira] [Commented] (FLINK-24377) TM resource may not be properly released after heartbeat timeout

2021-09-28 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421250#comment-17421250
 ] 

Xintong Song commented on FLINK-24377:
--

Fixed via:
- master (1.15): c81a530392b149141b1124bf83918f717d022111
- release-1.14: fcb19e6bb65128f24d80d68c29ce432d9bdcea22

I'm leaving this ticket open for now, for porting the fix to the 1.13 branch.

> TM resource may not be properly released after heartbeat timeout
> 
>
> Key: FLINK-24377
> URL: https://issues.apache.org/jira/browse/FLINK-24377
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0, 1.13.3, 1.15.0
>
>
> In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat 
> times out. However, it does not actively release the pod / container of that 
> TM. The releasing of pod / container relies on the TM to terminate itself 
> after failing to re-register to the RM.
> In some rare conditions, the TM process may not terminate and hang out for 
> long time. In such cases, k8s / yarn sees the process running, thus will not 
> release the pod / container. Neither will Flink's resource manager. 
> Consequently, the resource is leaked until the entire application is 
> terminated.
> To fix this, we should make {{ActiveResourceManager}} to actively release the 
> resource to K8s / Yarn after a TM heartbeat timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24377) TM resource may not be properly released after heartbeat timeout

2021-09-26 Thread zlzhang0122 (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420282#comment-17420282
 ] 

zlzhang0122 commented on FLINK-24377:
-

We also have encountered this situation too.

> TM resource may not be properly released after heartbeat timeout
> 
>
> Key: FLINK-24377
> URL: https://issues.apache.org/jira/browse/FLINK-24377
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.14.0, 1.13.2
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
> Fix For: 1.14.0, 1.13.3, 1.15.0
>
>
> In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat 
> times out. However, it does not actively release the pod / container of that 
> TM. The releasing of pod / container relies on the TM to terminate itself 
> after failing to re-register to the RM.
> In some rare conditions, the TM process may not terminate and hang out for 
> long time. In such cases, k8s / yarn sees the process running, thus will not 
> release the pod / container. Neither will Flink's resource manager. 
> Consequently, the resource is leaked until the entire application is 
> terminated.
> To fix this, we should make {{ActiveResourceManager}} to actively release the 
> resource to K8s / Yarn after a TM heartbeat timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)