[ https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258881#comment-14258881 ]
Rohith commented on YARN-2992: ------------------------------ It makes sense to me. +1 for the issue. And also I would like to bring up the scenario where ZK is not available during RM start up. I have observed that RM exits while starting if ZK is not available. Why RM can not be transit to standby? > ZKRMStateStore crashes due to session expiry > -------------------------------------------- > > Key: YARN-2992 > URL: https://issues.apache.org/jira/browse/YARN-2992 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Karthik Kambatla > Assignee: Karthik Kambatla > Priority: Blocker > Attachments: yarn-2992-1.patch > > > We recently saw the RM crash with the following stacktrace. On session > expiry, we should gracefully transition to standby. > {noformat} > 2014-12-18 06:28:42,689 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687) > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)