[jira] [Commented] (MAPREDUCE-3186) User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.

Eric Payne (Commented) (JIRA) Wed, 26 Oct 2011 17:37:57 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136641#comment-13136641
 ]


Eric Payne commented on MAPREDUCE-3186:
---------------------------------------

Problems being solved and their solutions:

# +When an application is running and the RM goes down, the MRAppMaster loops 
forever.+
Changes were made to {{RMContainerAllocator::getResources()}} to attempt to 
make contact with RM a certain number of times. The number of retries is based 
on {{MRJobConfig.MR_AM_TO_RM_RETRIES}}, which property name is 
{{yarn.app.mapreduce.am.scheduler.connection.retries}}.
??This is a new yarn config property??.
If contact with the RM fails the specified number of times, 
{{RMContainerAllocator::getResources()}} will generate an INTERNAL_ERROR event 
and will throw a YarnException, which will be caught by 
{{RMCommunicator::AllocatorThread}} and cause that thread to exit.
# When the RM is stopped and restarted, the MRAppMaster does not honor the 
"shouldreboot" flag sent from the RM and keeps attempting to connect with the 
new RM.
Changes were made to {{RMContainerAllocator::getResources()}} to check the 
reboot rlag in the response from the call to {{makeRemoteRequest()}}. If the 
reboot flag is set, {{RMContainerAllocator::getResources()}} will generate an 
INTERNAL_ERROR event and will throw a YarnException which is caught by 
{{RMCommunicator::AllocatorThread}} and cause that thread to exit.

                
> User jobs are getting hanged if the Resource manager process goes down and 
> comes up while job is getting executed.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3186
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3186
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>         Environment: linux
>            Reporter: Ramgopal N
>            Assignee: Eric Payne
>            Priority: Blocker
>              Labels: test
>
> If the resource manager is restarted while the job execution is in progress, 
> the job is getting hanged.
> UI shows the job as running.
> In the RM log, it is throwing an error "ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001"
> In the console MRAppMaster and Runjar processes are not getting killed

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3186) User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.

Reply via email to