[
https://issues.apache.org/jira/browse/MAPREDUCE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136641#comment-13136641
]
Eric Payne commented on MAPREDUCE-3186:
---------------------------------------
Problems being solved and their solutions:
# +When an application is running and the RM goes down, the MRAppMaster loops
forever.+
Changes were made to {{RMContainerAllocator::getResources()}} to attempt to
make contact with RM a certain number of times. The number of retries is based
on {{MRJobConfig.MR_AM_TO_RM_RETRIES}}, which property name is
{{yarn.app.mapreduce.am.scheduler.connection.retries}}.
??This is a new yarn config property??.
If contact with the RM fails the specified number of times,
{{RMContainerAllocator::getResources()}} will generate an INTERNAL_ERROR event
and will throw a YarnException, which will be caught by
{{RMCommunicator::AllocatorThread}} and cause that thread to exit.
# When the RM is stopped and restarted, the MRAppMaster does not honor the
"shouldreboot" flag sent from the RM and keeps attempting to connect with the
new RM.
Changes were made to {{RMContainerAllocator::getResources()}} to check the
reboot rlag in the response from the call to {{makeRemoteRequest()}}. If the
reboot flag is set, {{RMContainerAllocator::getResources()}} will generate an
INTERNAL_ERROR event and will throw a YarnException which is caught by
{{RMCommunicator::AllocatorThread}} and cause that thread to exit.
> User jobs are getting hanged if the Resource manager process goes down and
> comes up while job is getting executed.
> ------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3186
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3186
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.0
> Environment: linux
> Reporter: Ramgopal N
> Assignee: Eric Payne
> Priority: Blocker
> Labels: test
>
> If the resource manager is restarted while the job execution is in progress,
> the job is getting hanged.
> UI shows the job as running.
> In the RM log, it is throwing an error "ERROR
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
> AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001"
> In the console MRAppMaster and Runjar processes are not getting killed
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira