[ 
https://issues.apache.org/jira/browse/SAMZA-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadish resolved SAMZA-1359.
-----------------------------
    Resolution: Fixed

> Handle phantom container notifications cleanly during RM fail-over
> ------------------------------------------------------------------
>
>                 Key: SAMZA-1359
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1359
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jagadish
>            Assignee: Jagadish
>
> At LinkedIn, we have observed that there are phantom notifications 
> corresponding to containers not being actively managed by Samza. We should 
> improve our AppMaster's logic to be resilient to these phantom notifications. 
> Currently, we take actions based on these notifications - for example, we 
> increment the number of failed containers, mark the job health to be false 
> etc. These actions cause false alerts to be fired and automated remediations 
> based on these alerts. At LinkedIn, this has affected our ability to do 
> upgrades on the Samza cluster because several jobs have to be restarted owing 
> to these phantom notifications. 
> Here's the investigation during an upgrade and a rolling bounce of the Samza 
> ResourceManagers.
> 1. Samza starts container 13 on a certain host for this job.
> {code}
> 2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner 
> [INFO] Claimed container ID 13 for container 
> container_1490916233553_1471_01_000016 on node 
> hostname_masked19956masked_host 
> (http://hostname_masked19956masked_host:8042/node/containerlogs/container_1490916233553_1471_01_000016).
> {code}
> 2. Yarn reports that container 13 is lost due to a failure of its NodeManager 
> on the host at which the container was running.
> {code}
> 2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Container completed from RM 
> ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: 
> COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, 
> ]
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Completed container had ID: 13
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Removing container ID 13 from completed 
> containers
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Matching container ID found 13 
> org.apache.samza.clustermanager.SamzaResource@e4e2d63a
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Got an exit code of -100. This means that 
> container container_1490916233553_1471_01_000016 was killed by YARN, either 
> due to being released by the application master or being 'lost' due to node 
> failures etc. or due to preemption by the RM
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Released container 
> container_1490916233553_1471_01_000016 was assigned task group ID 13. 
> Requesting a refactor container for the task group.
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest 
> [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401
> {code}
> 3. Samza makes a request to Yarn for restarting container 13 on a different 
> host.
> {code}
> 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Requesting resources on  ANY_HOST for 
> container 13
> 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST
> 2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] 
> Received new token for : hostname_masked19939masked_host
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Container allocated from RM on 
> hostname_masked19939masked_host
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Container allocated from RM on 
> hostname_masked19939masked_host
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState 
> [INFO] Host affinity not enabled. Saving the samzaResource 
> container_1490916233553_1471_01_000027 in the buffer for ANY_HOST
> 2017-06-09 06:14:44.310 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Cancelling request 
> SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', 
> requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, 
> requestTimestampMs=1496988880401} 
> 2017-06-09 06:14:44.310 [Container Allocator Thread] 
> AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. 
> Assigning request for container_id 13 with timestamp 1496988880401 to 
> resource container_1490916233553_1471_01_000027
> 2017-06-09 06:14:44.310 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Received launch request for 13 on hostname 
> hostname_masked19939masked_host
> 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner 
> [INFO] Got available container ID (13) for container: Container: 
> [ContainerId: container_1490916233553_1471_01_000027, NodeId: 
> hostname_masked19939masked_host:1158, NodeHttpAddress: 
> hostname_masked19939masked_host:8042, Resource: <memory:4096, vCores:1>, 
> Priority: 0, Token: Token { kind: ContainerToken, service:  }, ]
> {code}
> 4. Container 13 is re-started and runs on a different host
> {code}
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner 
> [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner 
> [INFO] Container ID 13 using command ./__package//bin/run-container.sh
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner 
> [INFO] Container ID 13 using environment variables: 
> hostname_masked1348masked_host
> hostname_masked1355masked_host
> hostname_masked20440masked_host
> {code}
> 5. A cluster wide upgrade happens and the machines in the cluster have to be 
> bounced. We typically do a rolling bounce on the RM to give an opportunity 
> for the passive RM to take ownership. RM failover happens and state is 
> re-synced from the passive to the active RM. Samza's AppMaster (which had a 
> previous connection to the active RM) resyncs it's state from the other RM.
> {code}
> 2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler 
> [INFO] Exception while invoking allocate of class 
> ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying 
> to fail over immediately.
> java.net.ConnectException: Call From hostname_masked19958masked_host/ to 
> hostname_masked1348masked_host:8030 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> 2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] 
> ApplicationMaster is out of sync with ResourceManager, hence resyncing.
> {code}
> 6. A phantom notification (duplicate) is received for the same container 
> (Note the same container ID: container_1490916233553_1471_01_000016). This 
> notification declares that the container exited with code 154.
> {code}
> Get a Container killed by RM error with non zero exit code 154
> 2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Container completed from RM 
> ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: 
> COMPLETE, Diagnostics: Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
> , ExitStatus: 154, ]
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Completed container had ID: -1
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] No matching container id found for 
> SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', 
> diagnostics='Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
> ', exitCode=154}
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Container failed for some reason. Let's start 
> it again
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Container 
> container_1490916233553_1471_01_000016 failed with exit code . 154 - 
> Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
>  containerID is -1
> {code}
> 7. We take an action based on this phantom notification (and reset job 
> health, other metrics). This causes automated false-alerts, and remediation 
> to kick in. All jobs in the cluster have to be restarted now.
> *What scenarios on the Node Manager side trigger the phantom notification?*
> We have observed that if a node manager is re-started after a failure, NM 
> recovery kicks in and tries to take ownership of containers it was managing 
> in its previous run. In some cases, the container could have been killed / 
> exited before the node manager comes back up. This would result in a timeout 
> in the NM recovery process.
> Here are the relevant logs :
> a. NM is killed due to some errors and auto re-started. It was down from 
> 06:01 to 06:17.
> {code}
> 06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - 
> STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NodeManager
> STARTUP_MSG:   host = hostname_masked19956masked_host
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 2.7.3
> {code}
> b. NM recovery kicks in trying to own previous containers
> {code}
> 2017-06-09 06:17:32,899 [INFO] [main] 
> containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307)
>  - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with 
> exit code -1000
> 2017-06-09 06:17:33,115 [INFO] [main] 
> containermanager.ContainerManagerImpl.waitForRecoveredContainers
> 2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] 
> application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304)
>  - Adding container_1490916233553_1471_01_000016 to application 
> application_1490916233553_1471
> 2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_1490916233553_1471_01_000016 transitioned from LOCALIZING to 
> LOCALIZED
> 2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] 
> nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) 
> - Reacquiring container_1490916233553_1471_01_000016 with pid 23848
> reqcquireing pid
> 2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING
> 2017-06-09 06:17:35,094 Sending diagnostics
> [container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: 
> RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: 
> -1000, Priority: 0]
> {code}
> c. Recovery times out because the container already exited (due to an 
> unrelated error)
> {code}
> 2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] 
> util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324)
>  - Failed to read cgroup tasks file.
> java.io.FileNotFoundException: 
> /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No 
> such file or directory)
>         at java.io.FileInputStream.open0(Native Method)
>         at java.io.FileInputStream.open(FileInputStream.java:195)
>         at java.io.FileInputStream.<init>(FileInputStream.java:138)
>         at java.io.FileInputStream.<init>(FileInputStream.java:93)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] 
> launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - 
> Unable to recover container container_1490916233553_1471_01_000016
> java.io.IOException: Timeout while waiting for exit code from 
> container_1490916233553_1471_01_000016
>         at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] 
> launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - 
> Cleaning up container container_1490916233553_1471_01_000016
> 2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_1490916233553_1471_01_000016 transitioned from KILLING to 
> EXITED_WITH_FAILURE
> 2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE 
> to DONE
> 2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] 
> nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - 
> USER=samza-trust-repartition      OPERATION=Container Finished - Failed   
> TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with 
> state: EXITED_WITH_FAILURE    APPID=application_1490916233553_1471    
> CONTAINERID=container_1490916233553_1471_01_000016
> {code}
> d. When an RM fail over happens, and the state is re-synced from the passive 
> to the active RM triggering a phantom notification.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to