[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583093#comment-14583093
 ] 

Naganarasimha G R commented on YARN-3644:
-----------------------------------------

Hi [~raju.bairishetti],
IIUC intention of this jira is to only make NM wait for RM infinitely and hence 
we don't want to set  {{yarn.resourcemanager.connect.max-wait.ms}} to  FOREVER 
retry policy which might affect other clients connecting to RM right ?
If so i feel overall approach is fine except for the cosmetic comments below
# {{NM_SHUTSDWON_ON_RM_CONNECTION_FAILURES}}  typo,  SHUTSDWON => SHUTDOWN
# if agree on the earlier then 
{{DEFAULT_NM_SHUTSDOWN_ON_RM_CONNECTION_FAILURES}} => 
{{DEFAULT_NM_SHUTDOWN_ON_RM_CONNECTION_FAILURES}} 
# configuration could be {{yarn.nodemanager.shutdown.on.connection.failures}} 
=> {{yarn.nodemanager.shutdown.on.RM.connection.failures}}. correct the same in 
yarn-default.xml's  description and name also
# Testcase introduces new {{MyNodeStatusUpdater6}} whose only change is to get 
the new Resource tracker for the test case, its becoming more and more 
duplicate code for NodeStatusUpdater as most of the other overloaded 
NodeStatusUpdater is also doing the same, so can we bring in a common 
NodeStatusUpdater  class which accepts ResourceTracker  as parameter to 
constructor ? (may be refactoring other classes can be taken up in other jira 
if req)
# {{MyResourceTracker8}} could extend {{MyResourceTracker5}} and just override 
the required methods. Would also appreciate if some documentation is added 
above these classes so that in future it will be helpfull to reuse if req.

> Node manager shuts down if unable to connect with RM
> ----------------------------------------------------
>
>                 Key: YARN-3644
>                 URL: https://issues.apache.org/jira/browse/YARN-3644
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Srikanth Sundarrajan
>            Assignee: Raju Bairishetti
>         Attachments: YARN-3644.001.patch, YARN-3644.patch
>
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>           } catch (ConnectException e) {
>             //catch and throw the exception if tried MAX wait time to connect 
> RM
>             dispatcher.getEventHandler().handle(
>                 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
>             throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the 
> NMs shuts themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
> effects, where non connection failures are being retried infinitely by all 
> YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to