Hi, Was trying to understand why it takes about 9 minutes between the last try to start a container and when it finally gets the sigterm to kill the YarnApplicationMasterRunner.
Client: Calc Engine: 2017-08-28 12:39:23,596 INFO org.apache.flink.yarn.YarnClusterClient - Waiting until all TaskManagers have connected Calc Engine: Waiting until all TaskManagers have connected Calc Engine: 2017-08-28 12:39:23,600 INFO org.apache.flink.yarn.YarnClusterClient - Starting client actor system. Calc Engine: 2017-08-28 12:39:24,077 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started Calc Engine: 2017-08-28 12:39:24,366 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://fl...@dlp-qa-176378-023.dc.gs.com:39353] Calc Engine: 2017-08-28 12:39:24,609 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (0/4) Calc Engine: TaskManager status (0/4) Calc Engine: 2017-08-28 12:39:29,864 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (1/4) Calc Engine: TaskManager status (1/4) Calc Engine: 2017-08-28 12:39:30,389 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (2/4) Calc Engine: TaskManager status (2/4) Calc Engine: 2017-08-28 12:41:04,920 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (1/4) Calc Engine: TaskManager status (1/4) Calc Engine: 2017-08-28 12:41:13,775 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (0/4) Calc Engine: TaskManager status (0/4) Calc Engine: 2017-08-28 12:50:43,133 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://fl...@d191303-019.dc.gs.com:58084] has failed, address is now gated for [5000] ms. Reason: [Disassociated] Logs: Container id: container_e71_1503688027943_30786_01_000013 Exit code: 134 Stack trace: ExitCodeException exitCode=134: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:293) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Shell output: main : command provided 1 main : user is delp main : requested yarn user is delp Container exited with a non-zero exit code 134 17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Total number of failed containers so far: 5 17/08/28 12:39:51 ERROR yarn.YarnFlinkResourceManager: Stopping YARN session because the number of failed containers (5) exceeded the maximum failed containers (4). This number is controlled by the 'yarn.maximum-failed-containers' configuration setting. By default its the number of requested containers. 17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Shutting down cluster with status FAILED : Stopping YARN session because the number of failed containers (5) exceeded the maximum failed containers (4). This number is controlled by the 'yarn.maximum-failed-containers' configuration setting. By default its the number of requested containers. 17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Unregistering application from the YARN Resource Manager 17/08/28 12:39:51 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454 17/08/28 12:39:51 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274) 17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454 17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454 17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454 17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-016.dc.gs.com:45454 17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-013.dc.gs.com:45454 17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454 17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454 17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367] 17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786] 17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786] 17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367] 17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786] 17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367] 17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367] 17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786] 17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367] 17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786] 17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367] 17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786] 17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786] 17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367] 17/08/28 12:41:04 WARN remote.RemoteWatcher: Detected unreachable: [akka.tcp://fl...@d191303-010.dc.gs.com:48786] 17/08/28 12:41:04 INFO yarn.YarnJobManager: Task manager akka.tcp://fl...@d191303-010.dc.gs.com:48786/user/taskmanager terminated. 17/08/28 12:41:04 INFO instance.InstanceManager: Unregistered task manager d191303-010.dc.gs.com/10.79.252.104. Number of registered task managers 1. Number of available slots 2. 17/08/28 12:41:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367] 17/08/28 12:41:13 WARN remote.RemoteWatcher: Detected unreachable: [akka.tcp://fl...@d191303-016.dc.gs.com:58367] 17/08/28 12:41:13 INFO yarn.YarnJobManager: Task manager akka.tcp://fl...@d191303-016.dc.gs.com:58367/user/taskmanager terminated. 17/08/28 12:41:13 INFO instance.InstanceManager: Unregistered task manager d191303-016.dc.gs.com/10.79.162.181. Number of registered task managers 0. Number of available slots 0. 17/08/28 12:50:42 INFO yarn.YarnApplicationMasterRunner: RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard root cache directory /tmp/flink-web-d1eebf19-098f-419e-859e-101cfd6c0749 17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard jar upload directory /tmp/flink-web-4d9bcf76-ddcb-4dbe-b91d-4a8d8da3d716 17/08/28 12:50:42 INFO blob.BlobServer: Stopped BLOB server at 0.0.0.0:35815 Regina Chan Goldman Sachs - Enterprise Platforms, Data Architecture 30 Hudson Street, 37th floor | Jersey City, NY 07302 * (212) 902-5697