[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

Aleksandr Balitsky (JIRA) Thu, 02 Mar 2017 07:29:30 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892400#comment-15892400
 ]


Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:27 PM:
-----------------------------------------------------------------------

Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve 
containers across app attempts? I ask because ApplicationMasterService normally 
does not call setNMTokensFromPreviousAttempts on 
RegisterApplicationMasterResponse unless 
getKeepContainersAcrossApplicationAttempts on the application submission 
context is true. Last I checked the MapReduce client (YARNRunner) wasn't 
specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM 
work-preserving restart and currently I see that my first patch isn't good 
solution for this problem. Thanks for the review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a 
scheduler, but I will check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more 
details, I came to a similar conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. 
Given the code today, it is possible that a null NMToken is sent to MR, which 
contracts with the javadoc in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to preserve containers 
in MR. But the solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters 
for each and every allocated container, but only for the first time or if 
NMTokens have to be invalidated due to the rollover of the underlying master key
{quote}

That's so true, it is possible that a null NMToken is sent to MR. NMTokens 
sends only after first creation, it's designed feature. Then it saves to 
NMTokenCache from AM side. It's not necessary to pass NM tokens during each 
allocation interaction. So, it's not the best decision to clear 
NMTokenSecretManager cache during each allocation, because it disables "cache" 
feature and new NM Tokens will be generated (instead of using instance from 
cache) during each allocation response. IMHO, we shouldn't do this, because 
it's not the fix for root cause. It looks like workaround. 


was (Author: abalitsky1):
Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve 
containers across app attempts? I ask because ApplicationMasterService normally 
does not call setNMTokensFromPreviousAttempts on 
RegisterApplicationMasterResponse unless 
getKeepContainersAcrossApplicationAttempts on the application submission 
context is true. Last I checked the MapReduce client (YARNRunner) wasn't 
specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM 
work-preserving restart and currently I see that my first patch isn't good 
solution for this problem. Thanks for the review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a 
scheduler, but I will check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more 
details, I came to a similar conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. 
Given the code today, it is possible that a null NMToken is sent to MR, which 
contracts with the javadoc in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to reserve containers in 
MR. But the solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters 
for each and every allocated container, but only for the first time or if 
NMTokens have to be invalidated due to the rollover of the underlying master key
{quote}

That's so true, it is possible that a null NMToken is sent to MR. NMTokens 
sends only after first creation, it's designed feature. Then it saves to 
NMTokenCache from AM side. It's not necessary to pass NM tokens during each 
allocation interaction. So, it's not the best decision to clear 
NMTokenSecretManager cache during each allocation, because it disables "cache" 
feature and new NM Tokens will be generated (instead of using instance from 
cache) during each allocation response. IMHO, we shouldn't do this, because 
it's not the fix for root cause. It looks like workaround. 

> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6834
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: resourcemanager, yarn
>    Affects Versions: 2.7.0
>         Environment: Centos 7
>            Reporter: Aleksandr Balitsky
>            Assignee: Aleksandr Balitsky
>            Priority: Critical
>         Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_000011 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:244)
>       at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>       at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

Reply via email to