[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhijie Shen updated YARN-514: ----------------------------- Attachment: YARN-514.6.patch Thank @Bikas for your investigation. I've modified the code. The newest patch contain the following major updates: 1. FAILED => FAILED transition on RMAppEventType.APP_SAVED and KILLED => KILLED transition on RMAppEventType.APP_SAVED are defined. It fixes the problem pointed by @Bikas. 2. In addition, I found there's a problem in RMApp state transition in the RM restarting scenario. The stored MRApp will be recovered, an RMApp instance will be created, it will transit to NEW_SAVING and be stored again with the previous patch. To fix the problem, "isRecovered" is defined in RMAppImpl, and is set to true when RMAppImpl#recover is called. Then, on RMAppEventType.START being received, NEW => NEW_SAVING if the RMApp instance is not recovered, NEW => SUBMITTED otherwise. 3. Addition test cases are added in TestRMAppTransitions to test the aforementioned transition rules. 4. TestRMRestart should have traced the problem of saving the RMApp instance which is recovered again. However, it didn't failed the test case with previous patch because MemoryRMStateStore didn't throw exceptions when storing a duplicate application/attempt. Therefore, in the newest patch, MemoryRMStateStore will through IOException when the application/attempt has already been stored, which is consistent with the behavior of FileSystemRMStateStore. Then, the current test case of TestRMRestart can trace the problem of saving the RMApp instance twice. > Delayed store operations should not result in RM unavailability for app > submission > ---------------------------------------------------------------------------------- > > Key: YARN-514 > URL: https://issues.apache.org/jira/browse/YARN-514 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Bikas Saha > Assignee: Zhijie Shen > Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, > YARN-514.4.patch, YARN-514.5.patch, YARN-514.6.patch > > > Currently, app submission is the only store operation performed synchronously > because the app must be stored before the request returns with success. This > makes the RM susceptible to blocking all client threads on slow store > operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira