zhihai xu created YARN-3023:
-------------------------------

             Summary: Race condition in ZKRMStateStore#createWithRetries from 
ZooKeeper cause RM crash 
                 Key: YARN-3023
                 URL: https://issues.apache.org/jira/browse/YARN-3023
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.6.0
            Reporter: zhihai xu
            Assignee: zhihai xu


Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
crash.

The sequence for the Race condition is the following:
1, RM Store attempt state to ZK by calling createWithRetries
{code}
2015-01-06 12:37:35,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
appattempt_1418914202950_42363_000001 MasterContainer: Container: [ContainerId: 
container_1418914202950_42363_01_000001,
{code}

2. unluckily ConnectionLoss for the ZK session happened at the same time as RM 
Stored attempt state to ZK.
The ZooKeeper server created the node and store the data successfully, But due 
to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
succeeded.
{code}
2015-01-06 12:37:36,102 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
{code}

3.RM did retry to store attempt state to ZK after one second
{code}
2015-01-06 12:37:36,104 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 1
{code}

4. during the one second interval, the ZK session is reconnected.
{code}
2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established initiating session
2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
timeout = 10000
{code}

5. Because the node was created successfully at ZooKeeper in the first 
try(runWithCheck),
For the second try, it will fail with NodeExists KeeperException
{code}
2015-01-06 12:37:37,116 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
2015-01-06 12:37:37,118 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
out ZK retries. Giving up!
{code}

6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
RMStateStore
{code}
2015-01-06 12:37:37,118 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
storing appAttempt: appattempt_1418914202950_42363_000001
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
{code}

7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
ResourceManager
{code}
  protected void notifyStoreOperationFailed(Exception failureCause) {
    RMFatalEventType type;
    if (failureCause instanceof StoreFencedException) {
      type = RMFatalEventType.STATE_STORE_FENCED;
    } else {
      type = RMFatalEventType.STATE_STORE_OP_FAILED;
    }
    rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
  }
{code}

8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
RMFatalEvent.
{code}
2015-01-06 12:37:37,128 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to