[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524290#comment-16524290 ]
Chandni Singh edited comment on YARN-8409 at 6/27/18 12:51 AM: --------------------------------------------------------------- This happens when RM is started immediately after killing zookeeper leader. The {{zkClient}} reference in {{ActiveStandbyElector}} is null which causes NPE. Below is the chain of calls: # In {{ActiveStandbyElector}} constructor, at line 274: {{reEstablishSession()}} is invoked. # {{reEstablishSession}} tries to create zookeeper connection at line 825. # {{createConnection}} calls {{connectToZookeeper}} at line 850 to initialize {{zkClient}} # However, {{connectToZookeeper}} throws IOException because of session timeout # {{zkClient}} never gets initialized and is {{null}}. {{ActiveStandbyElectorBasedElectorService}} currently doesn't care if elector is connected to zookeeper and executes {{elector.ensureParentZNode()}} which then throws NPE. was (Author: csingh): This happens when RM is started immediately after killing zookeeper leader. The {{zkClient}} is null. > ActiveStandbyElectorBasedElectorService is failing with NPE > ----------------------------------------------------------- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.1.1 > Reporter: Yesha Vora > Assignee: Chandni Singh > Priority: Major > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org