[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894205#comment-15894205
 ] 

Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/3/17 11:59 AM:
------------------------------------------------------------------------

After deeper investigation I have found the root cause:
https://github.com/apache/hadoop/blob/branch-2.7.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java#L299-L327

Mentioned code should be executed only when "work-preserving AM restart" is 
true. Resource Manager saves new NMTokens into NMTokenSecretManagerInRM and 
sends this tokens only once with AM register response. But MR's AM doesn't 
handle those tokens, because it doesn't support work-preserving AM restart. 
Obviously, RM will no longer send those tokens again during next allocation 
requests.

When I use MR job (or another kind of frameworks, that doesn't support 
work-preserving AM restart), RM shouldn't retrieve previous attempt's 
containers and corresponding NM tokens. 

But as far as I see, this problem has bean already fixed in scope of YARN-3136. 
So 2.8.0 and next versions don't have this issue. 

[~haibochen], [~jlowe], thank you guys for the help and review. 


was (Author: abalitsky1):
After deeper investigation I have found the root cause:
https://github.com/apache/hadoop/blob/branch-2.7.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java#L299-L327

Mentioned code should be executed only when "work-preserving AM restart" is 
true. When I use MR job (or another kind of frameworks, that doesn't support 
work-preserving AM restart), RM shouldn't retrieve previous attempt's 
containers and corresponding NM tokens. 

Resource Manager saves new NMTokens into NMTokenSecretManagerInRM and sends 
this tokens only once with AM register response. But MR's AM doesn't handle 
those tokens, because it doesn't support work-preserving AM restart. Obviously, 
RM will no longer send those tokens again during next allocation requests.

But as far as I see, this problem has bean already fixed in scope of YARN-3136. 
So 2.8.0 and next versions don't have this issue. 

[~haibochen], [~jlowe], thank you guys for the help and review. 

> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6834
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager, yarn
>    Affects Versions: 2.7.0
>         Environment: Centos 7
>            Reporter: Aleksandr Balitsky
>            Assignee: Aleksandr Balitsky
>            Priority: Critical
>         Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_000011 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:244)
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to