Aleksandr Balitsky created YARN-6019:
----------------------------------------

             Summary: MR application fails with "No NMToken sent" exception 
after MRAppMaster recovery
                 Key: YARN-6019
                 URL: https://issues.apache.org/jira/browse/YARN-6019
             Project: Hadoop YARN
          Issue Type: Bug
          Components: yarn
    Affects Versions: 2.7.0
         Environment: Centos 7
            Reporter: Aleksandr Balitsky
            Priority: Critical


*Steps to reproduce:*
1) Submit MR application (for example PI app with 50 containers)
2) Find MRAppMaster process id for the application 
3) Kill MRAppMaster by kill -9 command

*Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
and application finish correctly

*Actually:* After launching new MRAppMaster and MRAppAttempt the application 
fails with the following exception:

{noformat}
2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
launch failed for container_1482408247195_0002_02_000011 : 
org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
for node1:43037
        at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
        at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:244)
        at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
        at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
        at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
        at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

{noformat}

*Problem*:
When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted to 
RMCommunicator in RegisterApplicationMasterResponse  
(getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
RMCommunicator.register method. RM don't transmit tese tokens again for other 
allocated requests, but we don't have these tokens in NMTokenCache. Accordingly 
we get "No NMToken sent for node" exception.

I have found that this issue appears after changes from the 
https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
 

I tried to do the same scenario without the commit and application completed 
successfully after RMAppMaster recovery




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to