[ https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhihai xu resolved YARN-3023. ----------------------------- Resolution: Duplicate > Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM > crash > --------------------------------------------------------------------------------- > > Key: YARN-3023 > URL: https://issues.apache.org/jira/browse/YARN-3023 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: zhihai xu > Assignee: zhihai xu > > Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM > crash. > The sequence for the Race condition is the following: > 1, RM Store attempt state to ZK by calling createWithRetries > {code} > 2015-01-06 12:37:35,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Storing attempt: AppId: application_1418914202950_42363 AttemptId: > appattempt_1418914202950_42363_000001 MasterContainer: Container: > [ContainerId: container_1418914202950_42363_01_000001, > {code} > 2. unluckily ConnectionLoss for the ZK session happened at the same time as > RM Stored attempt state to ZK. > The ZooKeeper server created the node and store the data successfully, But > due to ConnectionLoss, RM didn't know the operation (createWithRetries) is > succeeded. > {code} > 2015-01-06 12:37:36,102 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > {code} > 3.RM did retry to store attempt state to ZK after one second > {code} > 2015-01-06 12:37:36,104 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Retrying operation on ZK. Retry no. 1 > {code} > 4. during the one second interval, the ZK session is reconnected. > {code} > 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established initiating session > 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated > timeout = 10000 > {code} > 5. Because the node was created successfully at ZooKeeper in the first > try(runWithCheck), > For the second try, it will fail with NodeExists KeeperException > {code} > 2015-01-06 12:37:37,116 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > 2015-01-06 12:37:37,118 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > {code} > 6.This NodeExists KeeperException will cause Storing AppAttempt failure in > RMStateStore > {code} > 2015-01-06 12:37:37,118 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > storing appAttempt: appattempt_1418914202950_42363_000001 > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > {code} > 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to > ResourceManager > {code} > protected void notifyStoreOperationFailed(Exception failureCause) { > RMFatalEventType type; > if (failureCause instanceof StoreFencedException) { > type = RMFatalEventType.STATE_STORE_FENCED; > } else { > type = RMFatalEventType.STATE_STORE_OP_FAILED; > } > rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, > failureCause)); > } > {code} > 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED > RMFatalEvent. > {code} > 2015-01-06 12:37:37,128 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)