Hi, I am currently trying to keep the containers from the previous attempts alive across attempts so that when AM restarts happen, the processing containers stay intact.
I am achieving this using the keep-containers-across-application-attempt flag. For my use case, I do need to stop the processing container from the previous attempt in case of certain metadata changes (e.g. work assignments). When the new AM tries to stop the existing container, the NMClientAsync throws the following exception org.apache.hadoop.yarn.exceptions.YarnException: Container > container_1606797336059_0004_01_000002 is neither started nor scheduled to > start > at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45) > ~[hadoop-yarn-common-2.7.1.jar:?] > at > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.stopContainerAsync(NMClientAsyncImpl.java:235) > ~[hadoop-yarn-client-2.7.1.jar:? > I am guessing the NMClient is unaware of this container since it didn't start it in the first place. I tried fetching the status through the NMClient which is successful and returns running. My guess is the list of containers that NMClient tracks doesn't have the containers that belonged to previous attempts and hence there is no way to stop it. Any help is appreciated. Thanks, Bharath