[ 
https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-514:
-----------------------------

    Attachment: YARN-514.6.patch

Thank @Bikas for your investigation. I've modified the code. The newest patch 
contain the following major updates:

1. FAILED => FAILED transition on RMAppEventType.APP_SAVED and KILLED => KILLED 
transition on RMAppEventType.APP_SAVED are defined. It fixes the problem 
pointed by @Bikas.

2. In addition, I found there's a problem in RMApp state transition in the RM 
restarting scenario. The stored MRApp will be recovered, an RMApp instance will 
be created, it will transit to NEW_SAVING and be stored again with the previous 
patch. To fix the  problem, "isRecovered" is defined in RMAppImpl, and is set 
to true when RMAppImpl#recover is called. Then, on RMAppEventType.START being 
received, NEW => NEW_SAVING if the RMApp instance is not recovered, NEW => 
SUBMITTED otherwise.

3. Addition test cases are added in TestRMAppTransitions to test the 
aforementioned transition rules.

4. TestRMRestart should have traced the problem of saving the RMApp instance 
which is recovered again.  However, it didn't failed the test case with 
previous patch because MemoryRMStateStore didn't throw exceptions when storing 
a duplicate application/attempt. Therefore, in the newest patch, 
MemoryRMStateStore will through IOException when the application/attempt has 
already been stored, which is consistent with the behavior of 
FileSystemRMStateStore. Then, the current test case of TestRMRestart can trace 
the problem of saving the RMApp instance twice.
                
> Delayed store operations should not result in RM unavailability for app 
> submission
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-514
>                 URL: https://issues.apache.org/jira/browse/YARN-514
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Zhijie Shen
>         Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, 
> YARN-514.4.patch, YARN-514.5.patch, YARN-514.6.patch
>
>
> Currently, app submission is the only store operation performed synchronously 
> because the app must be stored before the request returns with success. This 
> makes the RM susceptible to blocking all client threads on slow store 
> operations, resulting in RM being perceived as unavailable by clients.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to