Xintong Song created FLINK-16299:
------------------------------------

             Summary: Release containers recovered from previous attempt in 
which TaskExecutor is not started.
                 Key: FLINK-16299
                 URL: https://issues.apache.org/jira/browse/FLINK-16299
             Project: Flink
          Issue Type: Improvement
          Components: Deployment / YARN
            Reporter: Xintong Song


As discussed in FLINK-16215, on Yarn deployment, {{YarnResourceManager}} starts 
a new {{TaskExecutor}} in two steps:
 # Request a new container from Yarn
 # Starts a {{TaskExecutor}} process in the allocated container

If JM failover happens between the two steps, in the new attempt 
{{YarnResourceManager}} will not start {{TaskExecutor}} processes in recovered 
containers. That means such containers are neither used nor released.

A potential fix to this problem, is to query form the container status by 
calling {{NMClientAsync#getContainerStatusAsync}}, and release the containers 
whose state is {{NEW}}, keeps only those whose state is {{RUNNING}} and waiting 
for them to register.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to