[jira] [Updated] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

Jian He (JIRA) Fri, 13 Sep 2013 12:37:14 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jian He updated YARN-540:
-------------------------

    Attachment: YARN-540.8.patch

bq. why didnt this code in the previous patch cause an exception to be thrown 
for a normal job? 
Because I added a check in RMAppRemovingTransition instead of FinalTransition

bq. Can the app crash while its waiting to be unregistered. Will that generate 
an ATTEMPT_FAILED? Can the node crash and cause an ATTEMPT_FAILED. 
Since AppAttempt is already in FINISHING state if App is in REMOVING state. if 
app crashed,  attempt will receive  CONTAINER_FINISHED event and then attempt 
goes to FINISHED state.
If the node crash, attempt should receive EXPIRE event and attempt should go to 
FINISHED state as well. 

bq. We probably need to save the previous state and return that while the app 
is in REMOVING state.
Yes, added a function to return the previous state when App is in REMOVING state
                
> Race condition causing RM to potentially relaunch already unregistered AMs on 
> RM restart
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
> YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, 
> YARN-540.7.patch, YARN-540.8.patch, YARN-540.patch, YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown 
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
> next time RM comes back, it will reload the existing state files even though 
> the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

Reply via email to