[ https://issues.apache.org/jira/browse/SAMZA-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jagadish updated SAMZA-1359: ---------------------------- Description: At LinkedIn, we have observed that there are phantom notifications corresponding to containers not being actively managed by Samza. We should improve our AppMaster's logic to be resilient to these phantom notifications. Currently, we take actions based on these notifications - for example, we increment the number of failed containers, mark the job health to be false etc. These actions cause false alerts to be fired and automated remediations based on these alerts. At LinkedIn, this has affected our ability to do upgrades on the Samza cluster because several jobs have to be restarted owing to these phantom notifications. Here's the investigation during an upgrade and a rolling bounce of the Samza ResourceManagers. 1. Samza starts container 13 on a certain host for this job. {code} 2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner [INFO] Claimed container ID 13 for container container_1490916233553_1471_01_000016 on node ltx1-app19956.prod.linkedin.com (http://ltx1-app19956.prod.linkedin.com:8042/node/containerlogs/container_1490916233553_1471_01_000016). {code} 2. Yarn reports that container 13 is lost due to a failure of its NodeManager on the host at which the container was running. {code} 2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ] 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: 13 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Removing container ID 13 from completed containers 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Matching container ID found 13 org.apache.samza.clustermanager.SamzaResource@e4e2d63a 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Got an exit code of -100. This means that container container_1490916233553_1471_01_000016 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. or due to preemption by the RM 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Released container container_1490916233553_1471_01_000016 was assigned task group ID 13. Requesting a refactor container for the task group. 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401 {code} 3. Samza makes a request to Yarn for restarting container 13 on a different host. {code} 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Requesting resources on ANY_HOST for container 13 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST 2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] Received new token for : ltx1-app19939.prod.linkedin.com:1158 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState [INFO] Host affinity not enabled. Saving the samzaResource container_1490916233553_1471_01_000027 in the buffer for ANY_HOST 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Cancelling request SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, requestTimestampMs=1496988880401} 2017-06-09 06:14:44.310 [Container Allocator Thread] AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. Assigning request for container_id 13 with timestamp 1496988880401 to resource container_1490916233553_1471_01_000027 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Received launch request for 13 on hostname ltx1-app19939.prod.linkedin.com 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner [INFO] Got available container ID (13) for container: Container: [ContainerId: container_1490916233553_1471_01_000027, NodeId: ltx1-app19939.prod.linkedin.com:1158, NodeHttpAddress: ltx1-app19939.prod.linkedin.com:8042, Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.149.196.44:1158 }, ] {code} 4. Container 13 is re-started and runs on a different host {code} 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib= 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using command ./__package//bin/run-container.sh 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using environment variables: ltx1-app1348.prod.linkedin.com ltx1-app1355.prod.linkedin.com ltx1-app20440.prod.linkedin.com {code} 5. A cluster wide upgrade happens and the machines in the cluster have to be bounced. We typically do a rolling bounce on the RM to give an opportunity for the passive RM to take ownership. RM failover happens and state is re-synced from the passive to the active RM. Samza's AppMaster (which had a previous connection to the active RM) resyncs it's state from the other RM. {code} 2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler [INFO] Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying to fail over immediately. java.net.ConnectException: Call From ltx1-app19958.prod.linkedin.com/10.149.196.108 to ltx1-app1348.prod.linkedin.com:8030 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] ApplicationMaster is out of sync with ResourceManager, hence resyncing. {code} 6. A phantom notification (duplicate) is received for the same container (Note the same container ID: container_1490916233553_1471_01_000016). This notification declares that the container exited with code 154. {code} Get a Container killed by RM error with non zero exit code 154 2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container Killed by ResourceManager Container exited with a non-zero exit code 154 , ExitStatus: 154, ] 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: -1 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] No matching container id found for SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', diagnostics='Container Killed by ResourceManager Container exited with a non-zero exit code 154 ', exitCode=154} 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container failed for some reason. Let's start it again 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container container_1490916233553_1471_01_000016 failed with exit code . 154 - Container Killed by ResourceManager Container exited with a non-zero exit code 154 containerID is -1 {code} 7. We take an action based on this phantom notification (and reset job health, other metrics). This causes automated false-alerts, and remediation to kick in. All jobs in the cluster have to be restarted now. *Node Manager scenarios that triggered the phantom notification:* We have observed that if a node manager is re-started after a failure, NM recovery kicks in and tries to take ownership of containers it was managing in its previous run. In some cases, the container could have been killed / exited before the node manager comes back up. This would result in a timeout in the NM recovery process. Here are the relevant logs : a. NM is killed due to some errors and auto re-started. It was down from 06:01 to 06:17. {code} 06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NodeManager STARTUP_MSG: host = ltx1-app19956.prod.linkedin.com/10.149.196.106 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.7.3 {code} b. NM recovery kicks in trying to own previous containers {code} 2017-06-09 06:17:32,899 [INFO] [main] containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307) - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with exit code -1000 2017-06-09 06:17:33,115 [INFO] [main] containermanager.ContainerManagerImpl.waitForRecoveredContainers 2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304) - Adding container_1490916233553_1471_01_000016 to application application_1490916233553_1471 2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZING to LOCALIZED 2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) - Reacquiring container_1490916233553_1471_01_000016 with pid 23848 reqcquireing pid 2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING 2017-06-09 06:17:35,094 Sending diagnostics [container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: -1000, Priority: 0] {code} c. Recovery times out because the container already exited (due to an unrelated error) {code} 2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324) - Failed to read cgroup tasks file. java.io.FileNotFoundException: /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at java.io.FileInputStream.<init>(FileInputStream.java:93) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - Unable to recover container container_1490916233553_1471_01_000016 java.io.IOException: Timeout while waiting for exit code from container_1490916233553_1471_01_000016 at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - Cleaning up container container_1490916233553_1471_01_000016 2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from KILLING to EXITED_WITH_FAILURE 2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE to DONE 2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - USER=samza-trust-repartition OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1490916233553_1471 CONTAINERID=container_1490916233553_1471_01_000016 {code} d. When an RM fail over happens, and the state is re-synced from the passive to the active RM triggering a phantom notification. was: At LinkedIn, we have observed that there are phantom notifications corresponding to containers not being actively managed by Samza. We should improve our AppMaster's logic to be resilient to these phantom notifications. Currently, we take actions based on these notifications - for example, we increment the number of failed containers, mark the job health to be false etc. These actions cause false alerts to be fired and automated remediations based on these alerts. At LinkedIn, this has affected our ability to do upgrades on the Samza cluster because several jobs have to be restarted owing to these phantom notifications. Here's the investigation during an upgrade and a rolling bounce of the Samza ResourceManagers. 1. Samza starts container 13 on a certain host for this job. {code} 2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner [INFO] Claimed container ID 13 for container container_1490916233553_1471_01_000016 on node ltx1-app19956.prod.linkedin.com (http://ltx1-app19956.prod.linkedin.com:8042/node/containerlogs/container_1490916233553_1471_01_000016). {code} 2. Yarn reports that container 13 is lost due to a failure of its NodeManager on the host at which the container was running. {code} 2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ] 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: 13 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Removing container ID 13 from completed containers 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Matching container ID found 13 org.apache.samza.clustermanager.SamzaResource@e4e2d63a 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Got an exit code of -100. This means that container container_1490916233553_1471_01_000016 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. or due to preemption by the RM 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Released container container_1490916233553_1471_01_000016 was assigned task group ID 13. Requesting a refactor container for the task group. 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401 {code} 3. Samza makes a request to Yarn for restarting container 13 on a different host. {code} 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Requesting resources on ANY_HOST for container 13 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST 2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] Received new token for : ltx1-app19939.prod.linkedin.com:1158 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState [INFO] Host affinity not enabled. Saving the samzaResource container_1490916233553_1471_01_000027 in the buffer for ANY_HOST 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Cancelling request SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, requestTimestampMs=1496988880401} 2017-06-09 06:14:44.310 [Container Allocator Thread] AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. Assigning request for container_id 13 with timestamp 1496988880401 to resource container_1490916233553_1471_01_000027 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Received launch request for 13 on hostname ltx1-app19939.prod.linkedin.com 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner [INFO] Got available container ID (13) for container: Container: [ContainerId: container_1490916233553_1471_01_000027, NodeId: ltx1-app19939.prod.linkedin.com:1158, NodeHttpAddress: ltx1-app19939.prod.linkedin.com:8042, Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.149.196.44:1158 }, ] {code} 4. Container 13 is re-started and runs on a different host {code} 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib= 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using command ./__package//bin/run-container.sh 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using environment variables: ltx1-app1348.prod.linkedin.com ltx1-app1355.prod.linkedin.com ltx1-app20440.prod.linkedin.com {code} 5. A cluster wide upgrade happens and the machines in the cluster have to be bounced. We typically do a rolling bounce on the RM to give an opportunity for the passive RM to take ownership. RM failover happens and state is re-synced from the passive to the active RM. Samza's AppMaster (which had a previous connection to the active RM) resyncs it's state from the other RM. {code} 2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler [INFO] Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying to fail over immediately. java.net.ConnectException: Call From ltx1-app19958.prod.linkedin.com/10.149.196.108 to ltx1-app1348.prod.linkedin.com:8030 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] ApplicationMaster is out of sync with ResourceManager, hence resyncing. {code} 6. A phantom notification (duplicate) is received for the same container (Note the same container ID: container_1490916233553_1471_01_000016). This notification declares that the container exited with code 154. {code} Get a Container killed by RM error with non zero exit code 154 2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container Killed by ResourceManager Container exited with a non-zero exit code 154 , ExitStatus: 154, ] 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: -1 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] No matching container id found for SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', diagnostics='Container Killed by ResourceManager Container exited with a non-zero exit code 154 ', exitCode=154} 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container failed for some reason. Let's start it again 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container container_1490916233553_1471_01_000016 failed with exit code . 154 - Container Killed by ResourceManager Container exited with a non-zero exit code 154 containerID is -1 {code} 7. We take an action based on this phantom notification (and reset job health, other metrics). This causes automated false-alerts, and remediation to kick in. All jobs in the cluster have to be restarted now. *Node Manager scenarios that triggered the phantom notification:* We have observed that if a node manager is re-started after a failure, NM recovery kicks in and tries to take ownership of containers it was managing in its previous run. In some cases, the container could have been killed / exited before the node manager comes back up. This would result in a timeout in the NM recovery process. Here are the relevant logs : a. NM is killed due to some errors and auto re-started. It was down from 06:01 to 06:17. {code} 06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NodeManager STARTUP_MSG: host = ltx1-app19956.prod.linkedin.com/10.149.196.106 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.7.3 {code} b. NM recovery kicks in trying to own previous containers {code} 2017-06-09 06:17:32,899 [INFO] [main] containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307) - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with exit code -1000 2017-06-09 06:17:33,115 [INFO] [main] containermanager.ContainerManagerImpl.waitForRecoveredContainers 2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304) - Adding container_1490916233553_1471_01_000016 to application application_1490916233553_1471 2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZING to LOCALIZED 2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) - Reacquiring container_1490916233553_1471_01_000016 with pid 23848 reqcquireing pid 2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING 2017-06-09 06:17:35,094 Sending diagnostics [container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: -1000, Priority: 0] {code} c. Recovery times out because the container already exited (due to an unrelated error) {code} 2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324) - Failed to read cgroup tasks file. java.io.FileNotFoundException: /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at java.io.FileInputStream.<init>(FileInputStream.java:93) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - Unable to recover container container_1490916233553_1471_01_000016 java.io.IOException: Timeout while waiting for exit code from container_1490916233553_1471_01_000016 at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - Cleaning up container container_1490916233553_1471_01_000016 2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from KILLING to EXITED_WITH_FAILURE 2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE to DONE 2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - USER=samza-trust-repartition OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1490916233553_1471 CONTAINERID=container_1490916233553_1471_01_000016 {code} d. When an RM fail over happens, and the state is re-synced from the passive to the active RM triggering a phantom notification. > Handle phantom container notifications cleanly during RM fail-over > ------------------------------------------------------------------ > > Key: SAMZA-1359 > URL: https://issues.apache.org/jira/browse/SAMZA-1359 > Project: Samza > Issue Type: Bug > Reporter: Jagadish > > At LinkedIn, we have observed that there are phantom notifications > corresponding to containers not being actively managed by Samza. We should > improve our AppMaster's logic to be resilient to these phantom notifications. > Currently, we take actions based on these notifications - for example, we > increment the number of failed containers, mark the job health to be false > etc. These actions cause false alerts to be fired and automated remediations > based on these alerts. At LinkedIn, this has affected our ability to do > upgrades on the Samza cluster because several jobs have to be restarted owing > to these phantom notifications. > Here's the investigation during an upgrade and a rolling bounce of the Samza > ResourceManagers. > 1. Samza starts container 13 on a certain host for this job. > {code} > 2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner > [INFO] Claimed container ID 13 for container > container_1490916233553_1471_01_000016 on node > ltx1-app19956.prod.linkedin.com > (http://ltx1-app19956.prod.linkedin.com:8042/node/containerlogs/container_1490916233553_1471_01_000016). > {code} > 2. Yarn reports that container 13 is lost due to a failure of its NodeManager > on the host at which the container was running. > {code} > 2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Container completed from RM > ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: > COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, > ] > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Completed container had ID: 13 > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Removing container ID 13 from completed > containers > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Matching container ID found 13 > org.apache.samza.clustermanager.SamzaResource@e4e2d63a > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Got an exit code of -100. This means that > container container_1490916233553_1471_01_000016 was killed by YARN, either > due to being released by the application master or being 'lost' due to node > failures etc. or due to preemption by the RM > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Released container > container_1490916233553_1471_01_000016 was assigned task group ID 13. > Requesting a refactor container for the task group. > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest > [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401 > {code} > 3. Samza makes a request to Yarn for restarting container 13 on a different > host. > {code} > 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Requesting resources on ANY_HOST for > container 13 > 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST > 2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] > Received new token for : ltx1-app19939.prod.linkedin.com:1158 > 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Container allocated from RM on > ltx1-app19939.prod.linkedin.com > 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Container allocated from RM on > ltx1-app19939.prod.linkedin.com > 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState > [INFO] Host affinity not enabled. Saving the samzaResource > container_1490916233553_1471_01_000027 in the buffer for ANY_HOST > 2017-06-09 06:14:44.310 [Container Allocator Thread] > YarnClusterResourceManager [INFO] Cancelling request > SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', > requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, > requestTimestampMs=1496988880401} > 2017-06-09 06:14:44.310 [Container Allocator Thread] > AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. > Assigning request for container_id 13 with timestamp 1496988880401 to > resource container_1490916233553_1471_01_000027 > 2017-06-09 06:14:44.310 [Container Allocator Thread] > YarnClusterResourceManager [INFO] Received launch request for 13 on hostname > ltx1-app19939.prod.linkedin.com > 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner > [INFO] Got available container ID (13) for container: Container: > [ContainerId: container_1490916233553_1471_01_000027, NodeId: > ltx1-app19939.prod.linkedin.com:1158, NodeHttpAddress: > ltx1-app19939.prod.linkedin.com:8042, Resource: <memory:4096, vCores:1>, > Priority: 0, Token: Token { kind: ContainerToken, service: 10.149.196.44:1158 > }, ] > {code} > 4. Container 13 is re-started and runs on a different host > {code} > 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner > [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib= > 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner > [INFO] Container ID 13 using command ./__package//bin/run-container.sh > 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner > [INFO] Container ID 13 using environment variables: > ltx1-app1348.prod.linkedin.com > ltx1-app1355.prod.linkedin.com > ltx1-app20440.prod.linkedin.com > {code} > 5. A cluster wide upgrade happens and the machines in the cluster have to be > bounced. We typically do a rolling bounce on the RM to give an opportunity > for the passive RM to take ownership. RM failover happens and state is > re-synced from the passive to the active RM. Samza's AppMaster (which had a > previous connection to the active RM) resyncs it's state from the other RM. > {code} > 2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler > [INFO] Exception while invoking allocate of class > ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying > to fail over immediately. > java.net.ConnectException: Call From > ltx1-app19958.prod.linkedin.com/10.149.196.108 to > ltx1-app1348.prod.linkedin.com:8030 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > 2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] > ApplicationMaster is out of sync with ResourceManager, hence resyncing. > {code} > 6. A phantom notification (duplicate) is received for the same container > (Note the same container ID: container_1490916233553_1471_01_000016). This > notification declares that the container exited with code 154. > {code} > Get a Container killed by RM error with non zero exit code 154 > 2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Container completed from RM > ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: > COMPLETE, Diagnostics: Container Killed by ResourceManager > Container exited with a non-zero exit code 154 > , ExitStatus: 154, ] > 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Completed container had ID: -1 > 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] No matching container id found for > SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', > diagnostics='Container Killed by ResourceManager > Container exited with a non-zero exit code 154 > ', exitCode=154} > 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Container failed for some reason. Let's start > it again > 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Container > container_1490916233553_1471_01_000016 failed with exit code . 154 - > Container Killed by ResourceManager > Container exited with a non-zero exit code 154 > containerID is -1 > {code} > 7. We take an action based on this phantom notification (and reset job > health, other metrics). This causes automated false-alerts, and remediation > to kick in. All jobs in the cluster have to be restarted now. > *Node Manager scenarios that triggered the phantom notification:* > We have observed that if a node manager is re-started after a failure, NM > recovery kicks in and tries to take ownership of containers it was managing > in its previous run. In some cases, the container could have been killed / > exited before the node manager comes back up. This would result in a timeout > in the NM recovery process. > Here are the relevant logs : > a. NM is killed due to some errors and auto re-started. It was down from > 06:01 to 06:17. > {code} > 06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - > STARTUP_MSG: > /************************************************************ > STARTUP_MSG: Starting NodeManager > STARTUP_MSG: host = ltx1-app19956.prod.linkedin.com/10.149.196.106 > STARTUP_MSG: args = [] > STARTUP_MSG: version = 2.7.3 > {code} > b. NM recovery kicks in trying to own previous containers > {code} > 2017-06-09 06:17:32,899 [INFO] [main] > containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307) > - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with > exit code -1000 > 2017-06-09 06:17:33,115 [INFO] [main] > containermanager.ContainerManagerImpl.waitForRecoveredContainers > 2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] > application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304) > - Adding container_1490916233553_1471_01_000016 to application > application_1490916233553_1471 > 2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] > container.ContainerImpl.handle(ContainerImpl.java:1135) - Container > container_1490916233553_1471_01_000016 transitioned from LOCALIZING to > LOCALIZED > 2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] > nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) > - Reacquiring container_1490916233553_1471_01_000016 with pid 23848 > reqcquireing pid > 2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] > container.ContainerImpl.handle(ContainerImpl.java:1135) - Container > container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING > 2017-06-09 06:17:35,094 Sending diagnostics > [container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: > RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: > -1000, Priority: 0] > {code} > c. Recovery times out because the container already exited (due to an > unrelated error) > {code} > 2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] > util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324) > - Failed to read cgroup tasks file. > java.io.FileNotFoundException: > /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No > such file or directory) > at java.io.FileInputStream.open0(Native Method) > at java.io.FileInputStream.open(FileInputStream.java:195) > at java.io.FileInputStream.<init>(FileInputStream.java:138) > at java.io.FileInputStream.<init>(FileInputStream.java:93) > at > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308) > at > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339) > at > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388) > at > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] > launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - > Unable to recover container container_1490916233553_1471_01_000016 > java.io.IOException: Timeout while waiting for exit code from > container_1490916233553_1471_01_000016 > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] > launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - > Cleaning up container container_1490916233553_1471_01_000016 > 2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] > container.ContainerImpl.handle(ContainerImpl.java:1135) - Container > container_1490916233553_1471_01_000016 transitioned from KILLING to > EXITED_WITH_FAILURE > 2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] > container.ContainerImpl.handle(ContainerImpl.java:1135) - Container > container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE > to DONE > 2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] > nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - > USER=samza-trust-repartition OPERATION=Container Finished - Failed > TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with > state: EXITED_WITH_FAILURE APPID=application_1490916233553_1471 > CONTAINERID=container_1490916233553_1471_01_000016 > {code} > d. When an RM fail over happens, and the state is re-synced from the passive > to the active RM triggering a phantom notification. -- This message was sent by Atlassian JIRA (v6.4.14#64029)