[ 
https://issues.apache.org/jira/browse/SAMZA-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadish updated SAMZA-1359:
----------------------------
    Description: 
At LinkedIn, we have observed that there are phantom notifications 
corresponding to containers not being actively managed by Samza. We should 
improve our AppMaster's logic to be resilient to these phantom notifications. 
Currently, we take actions based on these notifications - for example, we 
increment the number of failed containers, mark the job health to be false etc. 
These actions cause false alerts to be fired and automated remediations based 
on these alerts. At LinkedIn, this has affected our ability to do upgrades on 
the Samza cluster because several jobs have to be restarted owing to these 
phantom notifications. 

Here's the investigation during an upgrade and a rolling bounce of the Samza 
ResourceManagers.
1. Samza starts container 13 on a certain host for this job.
{code}
2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner [INFO] 
Claimed container ID 13 for container container_1490916233553_1471_01_000016 on 
node hostname_masked19956masked_host 
(http://hostname_masked19956masked_host:8042/node/containerlogs/container_1490916233553_1471_01_000016).
{code}

2. Yarn reports that container 13 is lost due to a failure of its NodeManager 
on the host at which the container was running.
{code}
2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: 
[ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, 
Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Completed container had ID: 13
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Removing container ID 13 from completed 
containers
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Matching container ID found 13 
org.apache.samza.clustermanager.SamzaResource@e4e2d63a
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Got an exit code of -100. This means that container 
container_1490916233553_1471_01_000016 was killed by YARN, either due to being 
released by the application master or being 'lost' due to node failures etc. or 
due to preemption by the RM
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Released container container_1490916233553_1471_01_000016 was assigned 
task group ID 13. Requesting a refactor container for the task group.
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest 
[INFO] Resource Request created for 13 on ANY_HOST at 1496988880401
{code}

3. Samza makes a request to Yarn for restarting container 13 on a different 
host.
{code}
2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Requesting resources on  ANY_HOST for 
container 13
2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST
2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] 
Received new token for : hostname_masked19939masked_host:1158
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Container allocated from RM on 
hostname_masked19939masked_host
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Container allocated from RM on hostname_masked19939masked_host
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState 
[INFO] Host affinity not enabled. Saving the samzaResource 
container_1490916233553_1471_01_000027 in the buffer for ANY_HOST
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager 
[INFO] Cancelling request SamzaResourceRequest{numCores=1, memoryMB=4000, 
preferredHost='ANY_HOST', requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', 
containerID=13, requestTimestampMs=1496988880401} 
2017-06-09 06:14:44.310 [Container Allocator Thread] AbstractContainerAllocator 
[INFO] Found available resources on ANY_HOST. Assigning request for 
container_id 13 with timestamp 1496988880401 to resource 
container_1490916233553_1471_01_000027
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager 
[INFO] Received launch request for 13 on hostname 
hostname_masked19939masked_host
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner [INFO] 
Got available container ID (13) for container: Container: [ContainerId: 
container_1490916233553_1471_01_000027, NodeId: 
hostname_masked19939masked_host:1158, NodeHttpAddress: 
hostname_masked19939masked_host:8042, Resource: <memory:4096, vCores:1>, 
Priority: 0, Token: Token { kind: ContainerToken, service: 10.149.196.44:1158 
}, ]
{code}


4. Container 13 is re-started and runs on a different host
{code}
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] 
In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] 
Container ID 13 using command ./__package//bin/run-container.sh
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] 
Container ID 13 using environment variables: 

hostname_masked1348masked_host
hostname_masked1355masked_host
hostname_masked20440masked_host
{code}


5. A cluster wide upgrade happens and the machines in the cluster have to be 
bounced. We typically do a rolling bounce on the RM to give an opportunity for 
the passive RM to take ownership. RM failover happens and state is re-synced 
from the passive to the active RM. Samza's AppMaster (which had a previous 
connection to the active RM) resyncs it's state from the other RM.
{code}
2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler [INFO] 
Exception while invoking allocate of class 
ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying to 
fail over immediately.
java.net.ConnectException: Call From 
hostname_masked19958masked_host/10.149.196.108 to 
hostname_masked1348masked_host:8030 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] 
ApplicationMaster is out of sync with ResourceManager, hence resyncing.
{code}

6. A phantom notification (duplicate) is received for the same container (Note 
the same container ID: container_1490916233553_1471_01_000016). This 
notification declares that the container exited with code 154.
{code}
Get a Container killed by RM error with non zero exit code 154

2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: 
[ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, 
Diagnostics: Container Killed by ResourceManager
Container exited with a non-zero exit code 154
, ExitStatus: 154, ]
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Completed container had ID: -1
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] No matching container id found for 
SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', 
diagnostics='Container Killed by ResourceManager
Container exited with a non-zero exit code 154
', exitCode=154}
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Container failed for some reason. Let's start it again
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Container container_1490916233553_1471_01_000016 failed with exit code . 
154 - Container Killed by ResourceManager
Container exited with a non-zero exit code 154
 containerID is -1
{code}

7. We take an action based on this phantom notification (and reset job health, 
other metrics). This causes automated false-alerts, and remediation to kick in. 
All jobs in the cluster have to be restarted now.

*What scenarios on the Node Manager side trigger the phantom notification?*

We have observed that if a node manager is re-started after a failure, NM 
recovery kicks in and tries to take ownership of containers it was managing in 
its previous run. In some cases, the container could have been killed / exited 
before the node manager comes back up. This would result in a timeout in the NM 
recovery process.

Here are the relevant logs :
a. NM is killed due to some errors and auto re-started. It was down from 06:01 
to 06:17.

{code}
06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - 
STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG:   host = hostname_masked19956masked_host/10.149.196.106
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.3
{code}


b. NM recovery kicks in trying to own previous containers
{code}

2017-06-09 06:17:32,899 [INFO] [main] 
containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307)
 - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with 
exit code -1000
2017-06-09 06:17:33,115 [INFO] [main] 
containermanager.ContainerManagerImpl.waitForRecoveredContainers
2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] 
application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304)
 - Adding container_1490916233553_1471_01_000016 to application 
application_1490916233553_1471
2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] 
container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
container_1490916233553_1471_01_000016 transitioned from LOCALIZING to LOCALIZED
2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] 
nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) - 
Reacquiring container_1490916233553_1471_01_000016 with pid 23848
reqcquireing pid
2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] 
container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING
2017-06-09 06:17:35,094 Sending diagnostics
[container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: 
RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: -1000, 
Priority: 0]
{code}

c. Recovery times out because the container already exited (due to an unrelated 
error)

{code}
2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] 
util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324)
 - Failed to read cgroup tasks file.
java.io.FileNotFoundException: 
/cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No 
such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at 
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308)
        at 
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339)
        at 
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388)
        at 
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] 
launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - 
Unable to recover container container_1490916233553_1471_01_000016
java.io.IOException: Timeout while waiting for exit code from 
container_1490916233553_1471_01_000016
        at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] 
launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - Cleaning 
up container container_1490916233553_1471_01_000016
2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] 
container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
container_1490916233553_1471_01_000016 transitioned from KILLING to 
EXITED_WITH_FAILURE

2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] 
container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE to 
DONE
2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] 
nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - 
USER=samza-trust-repartition      OPERATION=Container Finished - Failed   
TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with 
state: EXITED_WITH_FAILURE    APPID=application_1490916233553_1471    
CONTAINERID=container_1490916233553_1471_01_000016
{code}

d. When an RM fail over happens, and the state is re-synced from the passive to 
the active RM triggering a phantom notification.



  was:
At LinkedIn, we have observed that there are phantom notifications 
corresponding to containers not being actively managed by Samza. We should 
improve our AppMaster's logic to be resilient to these phantom notifications. 
Currently, we take actions based on these notifications - for example, we 
increment the number of failed containers, mark the job health to be false etc. 
These actions cause false alerts to be fired and automated remediations based 
on these alerts. At LinkedIn, this has affected our ability to do upgrades on 
the Samza cluster because several jobs have to be restarted owing to these 
phantom notifications. 

Here's the investigation during an upgrade and a rolling bounce of the Samza 
ResourceManagers.
1. Samza starts container 13 on a certain host for this job.
{code}
2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner [INFO] 
Claimed container ID 13 for container container_1490916233553_1471_01_000016 on 
node ltx1-app19956.prod.linkedin.com 
(http://ltx1-app19956.prod.linkedin.com:8042/node/containerlogs/container_1490916233553_1471_01_000016).
{code}

2. Yarn reports that container 13 is lost due to a failure of its NodeManager 
on the host at which the container was running.
{code}
2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: 
[ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, 
Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Completed container had ID: 13
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Removing container ID 13 from completed 
containers
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Matching container ID found 13 
org.apache.samza.clustermanager.SamzaResource@e4e2d63a
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Got an exit code of -100. This means that container 
container_1490916233553_1471_01_000016 was killed by YARN, either due to being 
released by the application master or being 'lost' due to node failures etc. or 
due to preemption by the RM
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Released container container_1490916233553_1471_01_000016 was assigned 
task group ID 13. Requesting a refactor container for the task group.
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest 
[INFO] Resource Request created for 13 on ANY_HOST at 1496988880401
{code}

3. Samza makes a request to Yarn for restarting container 13 on a different 
host.
{code}
2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Requesting resources on  ANY_HOST for 
container 13
2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST
2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] 
Received new token for : ltx1-app19939.prod.linkedin.com:1158
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Container allocated from RM on 
ltx1-app19939.prod.linkedin.com
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState 
[INFO] Host affinity not enabled. Saving the samzaResource 
container_1490916233553_1471_01_000027 in the buffer for ANY_HOST
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager 
[INFO] Cancelling request SamzaResourceRequest{numCores=1, memoryMB=4000, 
preferredHost='ANY_HOST', requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', 
containerID=13, requestTimestampMs=1496988880401} 
2017-06-09 06:14:44.310 [Container Allocator Thread] AbstractContainerAllocator 
[INFO] Found available resources on ANY_HOST. Assigning request for 
container_id 13 with timestamp 1496988880401 to resource 
container_1490916233553_1471_01_000027
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager 
[INFO] Received launch request for 13 on hostname 
ltx1-app19939.prod.linkedin.com
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner [INFO] 
Got available container ID (13) for container: Container: [ContainerId: 
container_1490916233553_1471_01_000027, NodeId: 
ltx1-app19939.prod.linkedin.com:1158, NodeHttpAddress: 
ltx1-app19939.prod.linkedin.com:8042, Resource: <memory:4096, vCores:1>, 
Priority: 0, Token: Token { kind: ContainerToken, service: 10.149.196.44:1158 
}, ]
{code}


4. Container 13 is re-started and runs on a different host
{code}
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] 
In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] 
Container ID 13 using command ./__package//bin/run-container.sh
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] 
Container ID 13 using environment variables: 

ltx1-app1348.prod.linkedin.com
ltx1-app1355.prod.linkedin.com
ltx1-app20440.prod.linkedin.com
{code}


5. A cluster wide upgrade happens and the machines in the cluster have to be 
bounced. We typically do a rolling bounce on the RM to give an opportunity for 
the passive RM to take ownership. RM failover happens and state is re-synced 
from the passive to the active RM. Samza's AppMaster (which had a previous 
connection to the active RM) resyncs it's state from the other RM.
{code}
2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler [INFO] 
Exception while invoking allocate of class 
ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying to 
fail over immediately.
java.net.ConnectException: Call From 
ltx1-app19958.prod.linkedin.com/10.149.196.108 to 
ltx1-app1348.prod.linkedin.com:8030 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] 
ApplicationMaster is out of sync with ResourceManager, hence resyncing.
{code}

6. A phantom notification (duplicate) is received for the same container (Note 
the same container ID: container_1490916233553_1471_01_000016). This 
notification declares that the container exited with code 154.
{code}
Get a Container killed by RM error with non zero exit code 154

2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: 
[ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, 
Diagnostics: Container Killed by ResourceManager
Container exited with a non-zero exit code 154
, ExitStatus: 154, ]
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
YarnClusterResourceManager [INFO] Completed container had ID: -1
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] No matching container id found for 
SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', 
diagnostics='Container Killed by ResourceManager
Container exited with a non-zero exit code 154
', exitCode=154}
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Container failed for some reason. Let's start it again
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager 
[INFO] Container container_1490916233553_1471_01_000016 failed with exit code . 
154 - Container Killed by ResourceManager
Container exited with a non-zero exit code 154
 containerID is -1
{code}

7. We take an action based on this phantom notification (and reset job health, 
other metrics). This causes automated false-alerts, and remediation to kick in. 
All jobs in the cluster have to be restarted now.

*What scenarios on the Node Manager side trigger the phantom notification?*

We have observed that if a node manager is re-started after a failure, NM 
recovery kicks in and tries to take ownership of containers it was managing in 
its previous run. In some cases, the container could have been killed / exited 
before the node manager comes back up. This would result in a timeout in the NM 
recovery process.

Here are the relevant logs :
a. NM is killed due to some errors and auto re-started. It was down from 06:01 
to 06:17.

{code}
06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - 
STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG:   host = ltx1-app19956.prod.linkedin.com/10.149.196.106
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.3
{code}


b. NM recovery kicks in trying to own previous containers
{code}

2017-06-09 06:17:32,899 [INFO] [main] 
containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307)
 - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with 
exit code -1000
2017-06-09 06:17:33,115 [INFO] [main] 
containermanager.ContainerManagerImpl.waitForRecoveredContainers
2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] 
application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304)
 - Adding container_1490916233553_1471_01_000016 to application 
application_1490916233553_1471
2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] 
container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
container_1490916233553_1471_01_000016 transitioned from LOCALIZING to LOCALIZED
2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] 
nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) - 
Reacquiring container_1490916233553_1471_01_000016 with pid 23848
reqcquireing pid
2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] 
container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING
2017-06-09 06:17:35,094 Sending diagnostics
[container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: 
RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: -1000, 
Priority: 0]
{code}

c. Recovery times out because the container already exited (due to an unrelated 
error)

{code}
2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] 
util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324)
 - Failed to read cgroup tasks file.
java.io.FileNotFoundException: 
/cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No 
such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at 
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308)
        at 
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339)
        at 
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388)
        at 
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] 
launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - 
Unable to recover container container_1490916233553_1471_01_000016
java.io.IOException: Timeout while waiting for exit code from 
container_1490916233553_1471_01_000016
        at 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] 
launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - Cleaning 
up container container_1490916233553_1471_01_000016
2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] 
container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
container_1490916233553_1471_01_000016 transitioned from KILLING to 
EXITED_WITH_FAILURE

2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] 
container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE to 
DONE
2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] 
nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - 
USER=samza-trust-repartition      OPERATION=Container Finished - Failed   
TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with 
state: EXITED_WITH_FAILURE    APPID=application_1490916233553_1471    
CONTAINERID=container_1490916233553_1471_01_000016
{code}

d. When an RM fail over happens, and the state is re-synced from the passive to 
the active RM triggering a phantom notification.




> Handle phantom container notifications cleanly during RM fail-over
> ------------------------------------------------------------------
>
>                 Key: SAMZA-1359
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1359
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jagadish
>            Assignee: Jagadish
>
> At LinkedIn, we have observed that there are phantom notifications 
> corresponding to containers not being actively managed by Samza. We should 
> improve our AppMaster's logic to be resilient to these phantom notifications. 
> Currently, we take actions based on these notifications - for example, we 
> increment the number of failed containers, mark the job health to be false 
> etc. These actions cause false alerts to be fired and automated remediations 
> based on these alerts. At LinkedIn, this has affected our ability to do 
> upgrades on the Samza cluster because several jobs have to be restarted owing 
> to these phantom notifications. 
> Here's the investigation during an upgrade and a rolling bounce of the Samza 
> ResourceManagers.
> 1. Samza starts container 13 on a certain host for this job.
> {code}
> 2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner 
> [INFO] Claimed container ID 13 for container 
> container_1490916233553_1471_01_000016 on node 
> hostname_masked19956masked_host 
> (http://hostname_masked19956masked_host:8042/node/containerlogs/container_1490916233553_1471_01_000016).
> {code}
> 2. Yarn reports that container 13 is lost due to a failure of its NodeManager 
> on the host at which the container was running.
> {code}
> 2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Container completed from RM 
> ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: 
> COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, 
> ]
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Completed container had ID: 13
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Removing container ID 13 from completed 
> containers
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Matching container ID found 13 
> org.apache.samza.clustermanager.SamzaResource@e4e2d63a
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Got an exit code of -100. This means that 
> container container_1490916233553_1471_01_000016 was killed by YARN, either 
> due to being released by the application master or being 'lost' due to node 
> failures etc. or due to preemption by the RM
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Released container 
> container_1490916233553_1471_01_000016 was assigned task group ID 13. 
> Requesting a refactor container for the task group.
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest 
> [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401
> {code}
> 3. Samza makes a request to Yarn for restarting container 13 on a different 
> host.
> {code}
> 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Requesting resources on  ANY_HOST for 
> container 13
> 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST
> 2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] 
> Received new token for : hostname_masked19939masked_host:1158
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Container allocated from RM on 
> hostname_masked19939masked_host
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Container allocated from RM on 
> hostname_masked19939masked_host
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState 
> [INFO] Host affinity not enabled. Saving the samzaResource 
> container_1490916233553_1471_01_000027 in the buffer for ANY_HOST
> 2017-06-09 06:14:44.310 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Cancelling request 
> SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', 
> requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, 
> requestTimestampMs=1496988880401} 
> 2017-06-09 06:14:44.310 [Container Allocator Thread] 
> AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. 
> Assigning request for container_id 13 with timestamp 1496988880401 to 
> resource container_1490916233553_1471_01_000027
> 2017-06-09 06:14:44.310 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Received launch request for 13 on hostname 
> hostname_masked19939masked_host
> 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner 
> [INFO] Got available container ID (13) for container: Container: 
> [ContainerId: container_1490916233553_1471_01_000027, NodeId: 
> hostname_masked19939masked_host:1158, NodeHttpAddress: 
> hostname_masked19939masked_host:8042, Resource: <memory:4096, vCores:1>, 
> Priority: 0, Token: Token { kind: ContainerToken, service: 10.149.196.44:1158 
> }, ]
> {code}
> 4. Container 13 is re-started and runs on a different host
> {code}
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner 
> [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner 
> [INFO] Container ID 13 using command ./__package//bin/run-container.sh
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner 
> [INFO] Container ID 13 using environment variables: 
> hostname_masked1348masked_host
> hostname_masked1355masked_host
> hostname_masked20440masked_host
> {code}
> 5. A cluster wide upgrade happens and the machines in the cluster have to be 
> bounced. We typically do a rolling bounce on the RM to give an opportunity 
> for the passive RM to take ownership. RM failover happens and state is 
> re-synced from the passive to the active RM. Samza's AppMaster (which had a 
> previous connection to the active RM) resyncs it's state from the other RM.
> {code}
> 2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler 
> [INFO] Exception while invoking allocate of class 
> ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying 
> to fail over immediately.
> java.net.ConnectException: Call From 
> hostname_masked19958masked_host/10.149.196.108 to 
> hostname_masked1348masked_host:8030 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> 2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] 
> ApplicationMaster is out of sync with ResourceManager, hence resyncing.
> {code}
> 6. A phantom notification (duplicate) is received for the same container 
> (Note the same container ID: container_1490916233553_1471_01_000016). This 
> notification declares that the container exited with code 154.
> {code}
> Get a Container killed by RM error with non zero exit code 154
> 2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Container completed from RM 
> ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: 
> COMPLETE, Diagnostics: Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
> , ExitStatus: 154, ]
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
> YarnClusterResourceManager [INFO] Completed container had ID: -1
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] No matching container id found for 
> SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', 
> diagnostics='Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
> ', exitCode=154}
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Container failed for some reason. Let's start 
> it again
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] 
> ContainerProcessManager [INFO] Container 
> container_1490916233553_1471_01_000016 failed with exit code . 154 - 
> Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
>  containerID is -1
> {code}
> 7. We take an action based on this phantom notification (and reset job 
> health, other metrics). This causes automated false-alerts, and remediation 
> to kick in. All jobs in the cluster have to be restarted now.
> *What scenarios on the Node Manager side trigger the phantom notification?*
> We have observed that if a node manager is re-started after a failure, NM 
> recovery kicks in and tries to take ownership of containers it was managing 
> in its previous run. In some cases, the container could have been killed / 
> exited before the node manager comes back up. This would result in a timeout 
> in the NM recovery process.
> Here are the relevant logs :
> a. NM is killed due to some errors and auto re-started. It was down from 
> 06:01 to 06:17.
> {code}
> 06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - 
> STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NodeManager
> STARTUP_MSG:   host = hostname_masked19956masked_host/10.149.196.106
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 2.7.3
> {code}
> b. NM recovery kicks in trying to own previous containers
> {code}
> 2017-06-09 06:17:32,899 [INFO] [main] 
> containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307)
>  - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with 
> exit code -1000
> 2017-06-09 06:17:33,115 [INFO] [main] 
> containermanager.ContainerManagerImpl.waitForRecoveredContainers
> 2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] 
> application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304)
>  - Adding container_1490916233553_1471_01_000016 to application 
> application_1490916233553_1471
> 2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_1490916233553_1471_01_000016 transitioned from LOCALIZING to 
> LOCALIZED
> 2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] 
> nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) 
> - Reacquiring container_1490916233553_1471_01_000016 with pid 23848
> reqcquireing pid
> 2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING
> 2017-06-09 06:17:35,094 Sending diagnostics
> [container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: 
> RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: 
> -1000, Priority: 0]
> {code}
> c. Recovery times out because the container already exited (due to an 
> unrelated error)
> {code}
> 2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] 
> util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324)
>  - Failed to read cgroup tasks file.
> java.io.FileNotFoundException: 
> /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No 
> such file or directory)
>         at java.io.FileInputStream.open0(Native Method)
>         at java.io.FileInputStream.open(FileInputStream.java:195)
>         at java.io.FileInputStream.<init>(FileInputStream.java:138)
>         at java.io.FileInputStream.<init>(FileInputStream.java:93)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] 
> launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - 
> Unable to recover container container_1490916233553_1471_01_000016
> java.io.IOException: Timeout while waiting for exit code from 
> container_1490916233553_1471_01_000016
>         at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] 
> launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - 
> Cleaning up container container_1490916233553_1471_01_000016
> 2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_1490916233553_1471_01_000016 transitioned from KILLING to 
> EXITED_WITH_FAILURE
> 2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE 
> to DONE
> 2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] 
> nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - 
> USER=samza-trust-repartition      OPERATION=Container Finished - Failed   
> TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with 
> state: EXITED_WITH_FAILURE    APPID=application_1490916233553_1471    
> CONTAINERID=container_1490916233553_1471_01_000016
> {code}
> d. When an RM fail over happens, and the state is re-synced from the passive 
> to the active RM triggering a phantom notification.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to