[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311922#comment-14311922 ]
Naganarasimha G R commented on YARN-3152: ----------------------------------------- Thanks for explaining [~xgong], and [~rohithsharma] for providing your view, I would like to add one more scenario here Assume there is 2 RM HA cluster and active RM is made down (manually or for any other reason) and standby RM is not having the configured exclude file Now Elector service tries to make the standby RM as active. As per the existing code refresh fails @ NodeListManager and AdminService throws ServiceFailedException and as {{rm.transitionToActive()}} is called before refreshing, stand by RM has already become active. But {{ActiveStandbyElector.becomeActive()}} returns false and continuously tries to make already active RM, active again. If the intention is to fail the transition then i think we need to transition RM to standby, if exception is thrown on refresh before further throwing the exception {{also would suggest in non HA case RM should fail to start}} and if intention is only to log then any way we can log the same in NodeListManager. Please share your views on the same. > Missing hadoop exclude file fails RMs in HA > ------------------------------------------- > > Key: YARN-3152 > URL: https://issues.apache.org/jira/browse/YARN-3152 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Environment: Debian 7 > Reporter: Neill Lima > Assignee: Naganarasimha G R > > NI have two NNs in HA, they do not fail when the exclude file is not present > (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in > HA. I didn't create the exclude file at this point as well. I applied the HA > RM settings properly and when I started both RMs I started getting this > exception: > 2015-02-06 12:25:25,326 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root > OPERATION=transitionToActive TARGET=RMHAProtocolService > RESULT=FAILURE DESCRIPTION=Exception transitioning to active > PERMISSIONS=All users are allowed > 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: > java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file > or directory) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) > ... 5 more > 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Trying to re-establish ZK session > 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x44af32566180094 closed > 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 > sessionTimeout=10000 > watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c > 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to x.x.x.x/x.x.x.x:2181, initiating session > The issue is descriptive enough to resolve the problem - and it has been > fixed by creating the exclude file. > I just think as of a improvement: > - Should RMs ignore the missing file as the NNs did? > - Should single RM fail even when the file is not present? > Just suggesting this improvement to keep the behavior consistent when working > with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)