[ https://issues.apache.org/jira/browse/SAMZA-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jagadish resolved SAMZA-1359. ----------------------------- Resolution: Fixed > Handle phantom container notifications cleanly during RM fail-over > ------------------------------------------------------------------ > > Key: SAMZA-1359 > URL: https://issues.apache.org/jira/browse/SAMZA-1359 > Project: Samza > Issue Type: Bug > Reporter: Jagadish > Assignee: Jagadish > > At LinkedIn, we have observed that there are phantom notifications > corresponding to containers not being actively managed by Samza. We should > improve our AppMaster's logic to be resilient to these phantom notifications. > Currently, we take actions based on these notifications - for example, we > increment the number of failed containers, mark the job health to be false > etc. These actions cause false alerts to be fired and automated remediations > based on these alerts. At LinkedIn, this has affected our ability to do > upgrades on the Samza cluster because several jobs have to be restarted owing > to these phantom notifications. > Here's the investigation during an upgrade and a rolling bounce of the Samza > ResourceManagers. > 1. Samza starts container 13 on a certain host for this job. > {code} > 2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner > [INFO] Claimed container ID 13 for container > container_1490916233553_1471_01_000016 on node > hostname_masked19956masked_host > (http://hostname_masked19956masked_host:8042/node/containerlogs/container_1490916233553_1471_01_000016). > {code} > 2. Yarn reports that container 13 is lost due to a failure of its NodeManager > on the host at which the container was running. > {code} > 2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Container completed from RM > ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: > COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, > ] > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Completed container had ID: 13 > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Removing container ID 13 from completed > containers > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Matching container ID found 13 > org.apache.samza.clustermanager.SamzaResource@e4e2d63a > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Got an exit code of -100. This means that > container container_1490916233553_1471_01_000016 was killed by YARN, either > due to being released by the application master or being 'lost' due to node > failures etc. or due to preemption by the RM > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Released container > container_1490916233553_1471_01_000016 was assigned task group ID 13. > Requesting a refactor container for the task group. > 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest > [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401 > {code} > 3. Samza makes a request to Yarn for restarting container 13 on a different > host. > {code} > 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Requesting resources on ANY_HOST for > container 13 > 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST > 2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] > Received new token for : hostname_masked19939masked_host > 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Container allocated from RM on > hostname_masked19939masked_host > 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Container allocated from RM on > hostname_masked19939masked_host > 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState > [INFO] Host affinity not enabled. Saving the samzaResource > container_1490916233553_1471_01_000027 in the buffer for ANY_HOST > 2017-06-09 06:14:44.310 [Container Allocator Thread] > YarnClusterResourceManager [INFO] Cancelling request > SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', > requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, > requestTimestampMs=1496988880401} > 2017-06-09 06:14:44.310 [Container Allocator Thread] > AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. > Assigning request for container_id 13 with timestamp 1496988880401 to > resource container_1490916233553_1471_01_000027 > 2017-06-09 06:14:44.310 [Container Allocator Thread] > YarnClusterResourceManager [INFO] Received launch request for 13 on hostname > hostname_masked19939masked_host > 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner > [INFO] Got available container ID (13) for container: Container: > [ContainerId: container_1490916233553_1471_01_000027, NodeId: > hostname_masked19939masked_host:1158, NodeHttpAddress: > hostname_masked19939masked_host:8042, Resource: <memory:4096, vCores:1>, > Priority: 0, Token: Token { kind: ContainerToken, service: }, ] > {code} > 4. Container 13 is re-started and runs on a different host > {code} > 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner > [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib= > 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner > [INFO] Container ID 13 using command ./__package//bin/run-container.sh > 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner > [INFO] Container ID 13 using environment variables: > hostname_masked1348masked_host > hostname_masked1355masked_host > hostname_masked20440masked_host > {code} > 5. A cluster wide upgrade happens and the machines in the cluster have to be > bounced. We typically do a rolling bounce on the RM to give an opportunity > for the passive RM to take ownership. RM failover happens and state is > re-synced from the passive to the active RM. Samza's AppMaster (which had a > previous connection to the active RM) resyncs it's state from the other RM. > {code} > 2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler > [INFO] Exception while invoking allocate of class > ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying > to fail over immediately. > java.net.ConnectException: Call From hostname_masked19958masked_host/ to > hostname_masked1348masked_host:8030 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > 2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] > ApplicationMaster is out of sync with ResourceManager, hence resyncing. > {code} > 6. A phantom notification (duplicate) is received for the same container > (Note the same container ID: container_1490916233553_1471_01_000016). This > notification declares that the container exited with code 154. > {code} > Get a Container killed by RM error with non zero exit code 154 > 2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Container completed from RM > ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: > COMPLETE, Diagnostics: Container Killed by ResourceManager > Container exited with a non-zero exit code 154 > , ExitStatus: 154, ] > 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] > YarnClusterResourceManager [INFO] Completed container had ID: -1 > 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] No matching container id found for > SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', > diagnostics='Container Killed by ResourceManager > Container exited with a non-zero exit code 154 > ', exitCode=154} > 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Container failed for some reason. Let's start > it again > 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] > ContainerProcessManager [INFO] Container > container_1490916233553_1471_01_000016 failed with exit code . 154 - > Container Killed by ResourceManager > Container exited with a non-zero exit code 154 > containerID is -1 > {code} > 7. We take an action based on this phantom notification (and reset job > health, other metrics). This causes automated false-alerts, and remediation > to kick in. All jobs in the cluster have to be restarted now. > *What scenarios on the Node Manager side trigger the phantom notification?* > We have observed that if a node manager is re-started after a failure, NM > recovery kicks in and tries to take ownership of containers it was managing > in its previous run. In some cases, the container could have been killed / > exited before the node manager comes back up. This would result in a timeout > in the NM recovery process. > Here are the relevant logs : > a. NM is killed due to some errors and auto re-started. It was down from > 06:01 to 06:17. > {code} > 06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - > STARTUP_MSG: > /************************************************************ > STARTUP_MSG: Starting NodeManager > STARTUP_MSG: host = hostname_masked19956masked_host > STARTUP_MSG: args = [] > STARTUP_MSG: version = 2.7.3 > {code} > b. NM recovery kicks in trying to own previous containers > {code} > 2017-06-09 06:17:32,899 [INFO] [main] > containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307) > - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with > exit code -1000 > 2017-06-09 06:17:33,115 [INFO] [main] > containermanager.ContainerManagerImpl.waitForRecoveredContainers > 2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] > application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304) > - Adding container_1490916233553_1471_01_000016 to application > application_1490916233553_1471 > 2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] > container.ContainerImpl.handle(ContainerImpl.java:1135) - Container > container_1490916233553_1471_01_000016 transitioned from LOCALIZING to > LOCALIZED > 2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] > nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) > - Reacquiring container_1490916233553_1471_01_000016 with pid 23848 > reqcquireing pid > 2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] > container.ContainerImpl.handle(ContainerImpl.java:1135) - Container > container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING > 2017-06-09 06:17:35,094 Sending diagnostics > [container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: > RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: > -1000, Priority: 0] > {code} > c. Recovery times out because the container already exited (due to an > unrelated error) > {code} > 2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] > util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324) > - Failed to read cgroup tasks file. > java.io.FileNotFoundException: > /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No > such file or directory) > at java.io.FileInputStream.open0(Native Method) > at java.io.FileInputStream.open(FileInputStream.java:195) > at java.io.FileInputStream.<init>(FileInputStream.java:138) > at java.io.FileInputStream.<init>(FileInputStream.java:93) > at > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308) > at > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339) > at > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388) > at > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] > launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - > Unable to recover container container_1490916233553_1471_01_000016 > java.io.IOException: Timeout while waiting for exit code from > container_1490916233553_1471_01_000016 > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] > launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - > Cleaning up container container_1490916233553_1471_01_000016 > 2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] > container.ContainerImpl.handle(ContainerImpl.java:1135) - Container > container_1490916233553_1471_01_000016 transitioned from KILLING to > EXITED_WITH_FAILURE > 2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] > container.ContainerImpl.handle(ContainerImpl.java:1135) - Container > container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE > to DONE > 2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] > nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - > USER=samza-trust-repartition OPERATION=Container Finished - Failed > TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with > state: EXITED_WITH_FAILURE APPID=application_1490916233553_1471 > CONTAINERID=container_1490916233553_1471_01_000016 > {code} > d. When an RM fail over happens, and the state is re-synced from the passive > to the active RM triggering a phantom notification. -- This message was sent by Atlassian JIRA (v6.4.14#64029)