[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891327#comment-15891327
 ] 

Haibo Chen edited comment on MAPREDUCE-6834 at 3/1/17 11:52 PM:
----------------------------------------------------------------

Thanks for the clarification, [~jlowe]. We have not made changes to preserve 
containers in MR. Chasing the code in more details, I came to a similar 
conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
   MR relies on YARN RM to get the NMtokens needed to launch containers with 
NMs. Given the code today, it is possible that a null NMToken is sent to MR, 
which contracts with the javadoc in SchedulerApplicationAttempt.java here
{code:java}
  // Create container token and NMToken altogether, if either of them fails for
  // some reason like DNS unavailable, do not return this container and keep it
  // in the newlyAllocatedContainers waiting to be refetched.
  public synchronized ContainersAndNMTokensAllocation {...}
{code}
I believe this is a duplicate of YARN-3112, so I am going to close this jira as 
a duplicate. Feel free to reopen it if you disagree.



was (Author: haibochen):
Thanks for the clarification, [~jlowe]. We have not made changes to preserve 
containers in MR. Chasing the code in more details, I came to a similar 
conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
   MR relies on YARN RM to get the NMtokens needed to launch containers with 
NMs. Given the code today, it is possible that a null NMToken is sent to MR, 
which contracts with the javadoc here
bq.
  // Create container token and NMToken altogether, if either of them fails for
  // some reason like DNS unavailable, do not return this container and keep it
  // in the newlyAllocatedContainers waiting to be refetched.
  public synchronized ContainersAndNMTokensAllocation {...}

I believe this is a duplicate of YARN-3112, so I am going to close this jira as 
a duplicate. Feel free to reopen it if you disagree.


> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6834
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager, yarn
>    Affects Versions: 2.7.0
>         Environment: Centos 7
>            Reporter: Aleksandr Balitsky
>            Assignee: Aleksandr Balitsky
>            Priority: Critical
>         Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_000011 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:244)
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to