subject:"\[jira\] \[Commented\] \(YARN\-540\) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart"

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-15 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767769#comment-13767769
 ] 

Hudson commented on YARN-540:
-

SUCCESS: Integrated in Hadoop-Yarn-trunk #333 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/333/])
YARN-540. Race condition causing RM to potentially relaunch already 
unregistered AMs on RM restart (Jian He via bikas) (bikas: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1523376)
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/FinishApplicationMasterResponse.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/FinishApplicationMasterResponsePBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRemovedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppState.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/MockRMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java


 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Fix For:

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-15 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767786#comment-13767786
 ] 

Hudson commented on YARN-540:
-

FAILURE: Integrated in Hadoop-Hdfs-trunk #1523 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1523/])
YARN-540. Race condition causing RM to potentially relaunch already 
unregistered AMs on RM restart (Jian He via bikas) (bikas: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1523376)
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/FinishApplicationMasterResponse.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/FinishApplicationMasterResponsePBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRemovedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppState.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/MockRMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java


 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Fix For:

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-15 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767802#comment-13767802
 ] 

Hudson commented on YARN-540:
-

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1549 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1549/])
YARN-540. Race condition causing RM to potentially relaunch already 
unregistered AMs on RM restart (Jian He via bikas) (bikas: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1523376)
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/FinishApplicationMasterResponse.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/FinishApplicationMasterResponsePBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRemovedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppState.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/MockRMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java


 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Fix

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-13 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766274#comment-13766274
]

Hadoop QA commented on YARN-540:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12602971/YARN-540.7.patch
against trunk revision .

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 3 new
or modified test files.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1912//console

This message is automatically generated.

Race condition causing RM to potentially relaunch already unregistered AMs on
RM restart

Key: YARN-540
URL: https://issues.apache.org/jira/browse/YARN-540
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch,
YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch,
YARN-540.7.patch, YARN-540.patch, YARN-540.patch

When job succeeds and successfully call finishApplicationMaster, RM shutdown
and restart-dispatcher is stopped before it can process REMOVE_APP event. The
next time RM comes back, it will reload the existing state files even though
the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-13 Thread Bikas Saha (JIRA)

[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766290#comment-13766290
]

Bikas Saha commented on YARN-540:
-

bq. Delete throws exception in case of not-existing
If that is the case, then why didnt this code in the previous patch cause an
exception to be thrown for a normal job? This is removing the app that should
already have been removed after unregister.
{code}
+ // application completely done and remove from state store.
+ // App state may be already removed during
RMAppFinishingOrRemovingTransition.
+ RMStateStore store = app.rmContext.getStateStore();
+ store.removeApplication(app)
{code}

bq. it should not be possible to generate RMAppEventType.ATTEMPT_FAILED event
at that state
Can the app crash while its waiting to be unregistered. Will that generate an
ATTEMPT_FAILED? Can the node crash and cause an ATTEMPT_FAILED. If yes, then
these would be apply to the FINISHING state also.

bq. In case of REMOVING, return YARNApplicationState as RUNNING, makes sense?
In general an app can be removed while its in ACCEPTED state also (kill app
after submission) These should also go through the REMOVING state. So its not
necessary that the app state will always be RUNNING. We probably need to save
the previous state and return that while the app is in REMOVING state.

Race condition causing RM to potentially relaunch already unregistered AMs on
RM restart

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-13 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766926#comment-13766926
 ] 

Bikas Saha commented on YARN-540:
-

bq. Because I added a check in RMAppRemovingTransition instead of 
FinalTransition
The check in RMAppRemovingTransition will pass in the normal case because the 
app has unregistered and this is the first call to remove app. Then in the end 
when the app container exits then FinalTransition is called and there is no 
check at that time. so removeapp will be called a second time and the delete 
will throw an exception. Is that not the flow?

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, 
 YARN-540.7.patch, YARN-540.8.patch, YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-13 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766943#comment-13766943
 ] 

Jian He commented on YARN-540:
--

bq. Is that not the flow?
Yeah, I think I missed that in the previous patch. That previous patch should 
throw exception for a normal job..

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, 
 YARN-540.7.patch, YARN-540.8.patch, YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767124#comment-13767124
 ] 

Hadoop QA commented on YARN-540:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12603143/YARN-540.9.patch
  against trunk revision .

{color:red}-1 patch{color}.  Trunk compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1927//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, 
 YARN-540.7.patch, YARN-540.8.patch, YARN-540.9.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767200#comment-13767200
 ] 

Hadoop QA commented on YARN-540:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12603153/YARN-540.9.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1928//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1928//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.10.patch, YARN-540.10.patch, YARN-540.1.patch, 
 YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, 
 YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.8.patch, 
 YARN-540.9.patch, YARN-540.9.patch, YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767256#comment-13767256
 ] 

Hadoop QA commented on YARN-540:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12603163/YARN-540.10.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1932//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1932//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.10.patch, YARN-540.10.patch, YARN-540.1.patch, 
 YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, 
 YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.8.patch, 
 YARN-540.9.patch, YARN-540.9.patch, YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767269#comment-13767269
 ] 

Hadoop QA commented on YARN-540:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12603163/YARN-540.10.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1933//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1933//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.10.patch, YARN-540.10.patch, YARN-540.1.patch, 
 YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, 
 YARN-540.6.patch, YARN-540.7.patch, YARN-540.7.patch, YARN-540.8.patch, 
 YARN-540.9.patch, YARN-540.9.patch, YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-12 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765187#comment-13765187
 ] 

Bikas Saha commented on YARN-540:
-


I think there is a member variable in AMRMClient that is used to get the ping 
interval from config. We could use that instead of the hardcoded 100. Sorry for 
not mentioning it earlier.
{code}
+LOG.info(Waiting for application to be successfully unregistered.);
+Thread.sleep(100);
{code}

Can we rename isAppRemovedFromStateStore() to isAppSafeToUnregister()? Then we 
can move the check for unmanagedAM within that method. This way we wont leak 
unmanagedAM outside RMAppImpl.

This transition is invalid and should not be ignored. Its a bug if it happens.
{code}
+// ignorable transitions
+.addTransition(RMAppState.REMOVING, RMAppState.REMOVING,
+RMAppEventType.ATTEMPT_UNREGISTERED)
{code}

Shouldnt the app.isAppRemovalRequestSent flag be checked here since this will 
typically happen after unregister has already removed the app. How is this 
working on a single node cluster? Is delete not throwing an exception for 
non-existent location?
{code}
+  // application completely done and remove from state store.
+  // App state may be already removed during 
RMAppFinishingOrRemovingTransition.
+  RMStateStore store = app.rmContext.getStateStore();
+  store.removeApplication(app);
{code}

What is the YARNApplicationState enum corresponding to AppState.REMOVING?

Is MockRMApp never expected to get removed from the store? I would have 
expected this to return true.
{code}
+  @Override
+  public boolean isAppRemovedFromStateStore() {
+return false;
{code}

Can RMAppEventType.ATTEMPT_FAILED be received when in REMOVING state (and also 
when in FINISHING state)?



 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-12 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766032#comment-13766032
 ] 

Jian He commented on YARN-540:
--

Thanks for the detailed comments, upload a new patch

bq. there is a member variable in AMRMClient that is used to get the ping 
interval from config
Turns out we have one in AMRMClientAsync but not in AMRMClientImpl, 
bq. Is delete not throwing an exception for non-existent location?
Delete throws exception in case of not-existing
bq. Can RMAppEventType.ATTEMPT_FAILED be received when in REMOVING state (and 
also when in FINISHING state)?
Once we moved to REMOVING/FINISHING state, it indicates attempt goes to 
FINISHING state, it should not be possible to generate 
RMAppEventType.ATTEMPT_FAILED event at that state
bq. What is the YARNApplicationState enum corresponding to AppState.REMOVING?
In case of REMOVING, return YARNApplicationState as RUNNING, makes sense?

Addressed other comments also.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-12 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766040#comment-13766040
]

Hadoop QA commented on YARN-540:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12602910/YARN-540.7.patch
against trunk revision .

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 3 new
or modified test files.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1907//console

This message is automatically generated.

Race condition causing RM to potentially relaunch already unregistered AMs on
RM restart

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-11 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764033#comment-13764033
 ] 

Jian He commented on YARN-540:
--

There seems to be one more problem that when RM crashes after the first 
unregister call happens and before the application state is cleaned from store, 
at that time AM keeps retry connecting with RM, and JobClient perhaps already 
returns SUCCEEDED. If AM process is killed at this time, when RM comes back it 
will relaunch the job again even though JobClient sees SUCCEEDED.


 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-11 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764045#comment-13764045
 ] 

Bikas Saha commented on YARN-540:
-

The solution is to not report success to user until services have stopped. I 
commented on MAPREDUCE-5497

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-11 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764287#comment-13764287
]

Jason Lowe commented on YARN-540:
-

bq. The solution is to not report success to user until services have stopped.

Note that delaying reporting success to downstream consumers isn't always
possible, as success can be reported via other means than JobClient directly.
For example, the _SUCCESS file written as part of FileOutputCommitter's commit
processing indicates to others that the job succeeded. IIRC Oozie can poll for
this as part of determining whether a job succeeded. I suspect other
committers have their own methods of notifying downstream consumers that the
job succeeded. And we shouldn't be unregistering from the RM before committing.

As such I think there will always be races where the YARN and MR app states can
end up inconsistent because a job could notify others of success and then fail
before it can notify YARN. We may still want to delay reporting success to
JobClient, but I don't think it completely solves the issue.

Race condition causing RM to potentially relaunch already unregistered AMs on
RM restart

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-11 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764510#comment-13764510
 ] 

Bikas Saha commented on YARN-540:
-

What is the official API to determine if the MR job is complete or not? If its 
the job client then we should probably make this change and make the official 
API better. 

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-11 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764689#comment-13764689
 ] 

Jason Lowe commented on YARN-540:
-

JobClient is the standard APIs.  I don't mean to imply we shouldn't try to 
improve that situation, rather that there are many out-of-band notifications in 
use and therefore fixing JobClient doesn't solve the problem in the general 
sense.

Job end notification (see mapreduce.job.end-notification.url) is another 
mechanism used to notify clients of job completion.  Currently this is done 
before unregistering, but we could move it to after unregistering.  The failure 
mode then changes such that an AM that crashes after unregistering but before 
notifying could end up never notifying a client because the RM would not retry. 
 However job end notification is currently best-effort and not guaranteed, and 
most frameworks I'm familiar with that are using it have a polling fallback 
(via something like JobClient) in case the notification fails to arrive.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-09 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762268#comment-13762268
 ] 

Bikas Saha commented on YARN-540:
-

getIsUnregistered() instead of getUnregistered() ?

Better grammar/more clear. e.g. is expected to retry until this flag becomes 
true.
{code}
Note: This flag only needed for RM recovery purpose. If RM recovery is
+ * enabled, user is expected to retry this flag until it becomes true.
{code}

Why is the default false? 
{code}
+  optional bool unregistered = 1 [default = false];
{code}

Very large value of sleep. 100ms? The log should be before the sleep.
{code}
+  while (true) {
+FinishApplicationMasterResponse response =
+rmClient.finishApplicationMaster(request);
+if (response.getUnregistered()) {
+  break;
+}
+Thread.sleep(1000);
+LOG.info(Waiting for application to be successfully unregistered.);
{code}

Instead of checking for an exact state I think it we should check for all 
terminal states of an RMApp. This will make the code more resilient to future 
changes in the state machines. So we check for FINISHING, FINISHED. FAILED, 
KILLED. This will also allow us to not special case the unmanaged AM in the 
latter half of the same function. Also, this is open to race conditions. e.g. 
someone kills the app before the app is removed from the store. We should 
probably make this an RMApp method like RMApp.isAppRemovedFromStore(). In this 
method we can either check the state or some boolean that we can set when the 
App_Removed event comes.
{code}
+  // Application state has been removed from RMStateStore, if it's in
+  // FINISHING state
+  if (rmContext.getRMApps().get(applicationAttemptId.getApplicationId())
+.getState().equals(RMAppState.FINISHING)) {
+return FinishApplicationMasterResponse.newInstance(true);
+  }
{code}

Is there a version of delete that will not fail if the file does not exist? OR 
we can have a boolean in RMApp to show that the removal request has already 
been sent and not send it multiple times. Lets try to avoid 2 remote HDFS calls 
in the common case.
{code}
+if(!fs.exists(deletePath))
+  return;
 if(!fs.delete(deletePath, true)) {
{code}





 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-09 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762581#comment-13762581
 ] 

Hadoop QA commented on YARN-540:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12602243/YARN-540.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1886//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1886//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-09 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762599#comment-13762599
 ] 

Bikas Saha commented on YARN-540:
-

bq. Exception because unManagedAM attempt will be immediately removed from the 
responseMap
Havent looked at the patch yet, but this sounds like a race condition waiting 
to happen in other cases. Lets say the first unregister returns false. Now 
someone kills the app and the app goes through the transition that removes it 
from the responseMap. Now if the AM comes back with the second unregister, 
should it fail or succeed.

The key question here is whether an AM is done after it calls unregister. If 
the unregister fails, then is the AM expected to considered failing itself or 
continuing as if it has succeeded?

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760939#comment-13760939
 ] 

Hadoop QA commented on YARN-540:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12601966/YARN-540.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1871//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1871//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-06 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760896#comment-13760896
 ] 

Hadoop QA commented on YARN-540:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12601956/YARN-540.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.client.api.impl.TestNMClient

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1870//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1870//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759081#comment-13759081
 ] 

Jason Lowe commented on YARN-540:
-

bq. Once work-preserving restart is implemented, this jira should not be a 
problem as there's no notion of relaunching a new AM in work-preserving 
restart, the old AM will just spin and resync with RM after RM restarts.

I'm still a bit confused as to why work-preserving restart matters here.  Most 
AMs are simply going to cleanup and leave after unregistering with the RM, 
since that's normally a terminal call for the AM-RM protocol.  If AMs are now 
required to poll as described in (2) only then does work-preserving restart 
seem to help here, but that relies on a behavior change in the AM.  Is that 
behavior change being implemented in the YARN API layer?

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759268#comment-13759268
 ] 

Jian He commented on YARN-540:
--

bq.  Is that behavior change being implemented in the YARN API layer?
IMHO, for work-preserving restart, after RM comes back, RM should be able to 
accept the old AM as normal instead of asking the AM to reboot or making NM 
kill the AM container(which currently happens). Then on RM side, AM 
unregistering just happens like a normal unregistering, even though RM had 
restarted. 

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759272#comment-13759272
 ] 

Jian He commented on YARN-540:
--

For that, we also don't need behavior change on AM side, since everything goes 
normal after RM restarts.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759400#comment-13759400
 ] 

Vinod Kumar Vavilapalli commented on YARN-540:
--

Yes, we missed that. We either
 - do a blocking persistence to the state-store *during* 
finishApplicationMaster call
 - or do the right thing and persist asynchronously, make 
finishApplicationMaster to be complete only when RM returns a success state - 
i.e. make the behaviour and API change now. Offline, I was trying to avoid this 
change, but it doesn't look like we can skip that.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759395#comment-13759395
 ] 

Jason Lowe commented on YARN-540:
-

Ah, after the RM restarts, the NM can notify the RM that the AM container 
exited then that would pretty much fix it.  We'd only have an issue if the NM 
went down at the same time the RM did.  I'm still a bit unclear on the 
specifics for how the RM recovers the container states in work-preserving 
restart, but assuming the NMs report not only active containers but also those 
that have exited since the last successful heartbeat upon RM 
recovery/re-registration then we should be OK.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759401#comment-13759401
 ] 

Vinod Kumar Vavilapalli commented on YARN-540:
--

Hehe, race conditions :)

bq. Ah, after the RM restarts, the NM can notify the RM that the AM container 
exited then that would pretty much fix it.
Interesting. Then RM will also need to *somehow* remember that unregister came 
in but the state-store app removal isn't done. Which is not possible without 
more state-store writes?

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759384#comment-13759384
 ] 

Jason Lowe commented on YARN-540:
-

Unless I'm missing something, it does require a behavior change on the AM to 
recover.  Here's the scenario:

# AM unregisters with RM, RM asynchronously schedules removal of the app from 
the store but returns from the call before this completes
# RM crashes before app removed from persistent state store
# AM proceeds to clean up, remove the staging directory, and exit (i.e.: no 
behavior change from what AMs do today after unregistering)
# RM restarts with the persistent state store showing the app as running (i.e.: 
it missed the fact that it unregistered)
# Without work-preserving restart, the RM will try to launch a new app attempt 
but the attempt (and therefore app) will be reported as failing because there's 
no staging directory.  With work-preserving restart, it will wait up until the 
AM expiry interval for the original attempt to report in and then it will 
launch a new attempt to try to recover which fails the attempt and ultimately 
the app.

I don't see how the old AM is going to report back into the RM after 
unregistering without a behavior change on the AM side.  Normally AMs cleanup 
and leave shortly after unregistering without trying to report back to the RM.  
This change narrows the race condition window, but the window can be larger 
than expected if the state store dispatcher is running behind because of a slow 
store backend.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759412#comment-13759412
]

Jason Lowe commented on YARN-540:
-

bq. Then RM will also need to somehow remember that unregister came in but the
state-store app removal isn't done. Which is not possible without more
state-store writes?

Argh, right I forgot. It will simply see the container exit but not understand
the context of that exit and misinterpret it as a crash and recover scenario.
Darn, I thought we had it. :-)

I think the existing unregister call should be blocking from the AMs
perspective, as that's the simplest and most-compatible way to fix it. We
could always add an asynchronous form of that API later. If most AMs are
expected to communicate through a wrapper layer where we can hide this behavior
then that's probably fine too -- RM and low-level API could be async but most
AMs still see it as a blocking call.

Part of the issue of making it async is at some point we need to have some flow
control. If apps are churning faster than we can persist them then there's
going to be issues (backup of store dispatcher queue, etc.). At some point we
have to block something.

Race condition causing RM to potentially relaunch already unregistered AMs on
RM restart

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-05 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759453#comment-13759453
 ] 

Bikas Saha commented on YARN-540:
-

Most apps will not see the async API since they will use AMRMClient. MR AM will 
see the async API since it use the low level API. Hence, we only need to change 
the MR AM now. I would avoid making the call blocking now and non-blocking 
later on since thats incompatible. In any case, it is orthogonal to solving the 
problem. The problem being RM crashing before being able to remove the app from 
the store.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-04 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757839#comment-13757839
 ] 

Jason Lowe commented on YARN-540:
-

Sorry for arriving late, but why wouldn't we want to implement choice (1) 
above?  (i.e.: block until store confirms app state is removed).  From an AM's 
perspective, that's the simplest solution.  Returning control to the AM early 
from the unregister is inviting the AM to do bad things wrt. a potential 
restart (e.g.: MR AM will remove its staging directory, effectively preventing 
the restart from succeeding and leading the RM to believe the app failed).  The 
unregister call is a terminal call in the AM-RM protocol, so I think it's 
appropriate for that to not return until the app truly is unregistered.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-04 Thread Bikas Saha (JIRA)

[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758037#comment-13758037
]

Bikas Saha commented on YARN-540:
-

1) or 2) are basically the same thing. 1) will block the unregister call until
it succeeds. 2) requires the AM to keep looping on unregister until it
succeeds. 2) just enables the RM to make the store operation asynchronously and
prevent RPC threads from getting blocked.
The core issue is that the RM can crash before removing the app from the store.
Thus when it restarts it thinks that the app is still running and tries to
re-launch it. This is the core issue in this jira and should be a rare event.
The MR app master sleeps for 5s before unregistering with the RM and reports
success meanwhile to the client. This exacerbates the above rare issue and
makes it possible to repro it more often.

Race condition causing RM to potentially relaunch already unregistered AMs on
RM restart

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-04 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758096#comment-13758096
 ] 

Jason Lowe commented on YARN-540:
-

Yes, I realize that 1) and 2) are at a high level accomplishing the same thing. 
 However 2) requires cooperation from the AM which is user code and therefore 
harder to control while 1) does not.  There is the issue of RPC threads getting 
blocked which may necessitate 2), but otherwise 1) would be preferable since it 
requires less cooperation/coordination with the AMs.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-04 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758611#comment-13758611
 ] 

Jian He commented on YARN-540:
--

Finally come to the conclusion that removeApplicationState immediately after 
attempt unregister. This combined with MAPREDUCE-5497 can significantly reduce 
the race here. Once work-preserving restart is implemented, this jira should 
not be a problem as there's no notion of relaunching a new AM in 
work-preserving restart, the old AM will just spin and resync with RM after RM 
restarts.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-04 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758651#comment-13758651
 ] 

Hadoop QA commented on YARN-540:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12601527/YARN-540.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1838//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1838//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-08-25 Thread Jian He (JIRA)

[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749734#comment-13749734
]

Jian He commented on YARN-540:
--

bq. What will happen if the RM failed after deleting the app from the store but
before the app pulled that information from the RM?
App will not fail because RM unregister is ignoring any exceptions coming from
finishApp(). JobClient can also get the final status of the App regardless
wether finishApp() fails or not.
bq. The state transitions are asynchronous. We cannot expect to always find the
app in the FINISHING state.
FINISHING state is the only state after unregister call happens that we can
reliably say app is removed from state store depending on currently
implemented state transitions. Tell me if I missed something.
bq. Can the application finish on the RM (in between 2 finishApp() requests)
such that it never gets a true response?
Application will not go to FINISHED state unless AM process exists or AM
expires. So I think it can reliably get the true response as long as RM is
available.
bq. Is this possible to avoid 2 round trips to store?
Are you saying is the following code possible to handle duplicative APP_REMOVE
events?
bq. There is no need for multiple code paths/transitions.
I in fact noticed this while writing the patch, the intention was to avoid the
unnecessary overhead trip to RMStateStore. thoughts?

Agree with other comments, will post a new patch soon.

Race condition causing RM to potentially relaunch already unregistered AMs on
RM restart

Key: YARN-540
URL: https://issues.apache.org/jira/browse/YARN-540
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
Priority: Blocker
Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch,
YARN-540.patch

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-08-24 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749516#comment-13749516
 ] 

Bikas Saha commented on YARN-540:
-

What will happen if the RM failed after deleting the app from the store but 
before the app pulled that information from the RM? I think the RM is not going 
to  recognize that app after restarting and return exception to the 
finishApplicationMaster response from the running app. The app will fail. And 
then it will not be re-started by the RM.

Comments about the patch itself.

Why are we sleeping before checking the value?
{code}
+  do {
+response = scheduler.finishApplicationMaster(request);
+Thread.sleep(rmPollInterval);
+  } while (response.getIsRemovedFromRMStateSore());
{code}

The state transitions are asynchronous. We cannot expect to always find the app 
in the FINISHING state.
{code}
+  if (RMAppState.FINISHING.equals(rmContext.getRMApps()
+.get(applicationAttemptId.getApplicationId()).getState())) {
+return FinishApplicationMasterResponse.newInstance(true);
+  }
{code}
Can the application finish on the RM (in between 2 finishApp() requests) such 
that it never gets a true response?

RMAppEventType.ATTEMPT_FINISHING should be renamed to ATTEMPT_UNREGISTERED in a 
different jira.

store.removeApplication() should be in the RMAppImpl transitions (AppRemoving 
and FinalTransition) instead of ApplicationMasterService and RMAppManager.

Can we pick a name that does not expose class names and impl details? eg. 
isUnregistered()
{code}
getIsRemovedFromRMStateSore()
{code}

Is this possible to avoid 2 round trips to store?
{code}
+if(!fs.exists(deletePath))
+  return;
 if(!fs.delete(deletePath, true)) {
   throw new Exception(Failed to delete  + deletePath);
{code}

There is no need for multiple code paths/transitions. It should always go from 
RUNNING-APP_REMOVING. Please look at NEW-NEW_SAVING. When recovery is not 
enabled we use the NullRMStateStore to ensure that the main code path in the RM 
remains the same.
{code}
+  private static final class RMAppFinishingOrRemovingTransition  implements
+  MultipleArcTransitionRMAppImpl, RMAppEvent, RMAppState {
+@Override
+public RMAppState transition(RMAppImpl app, RMAppEvent event) {
+  boolean isRecoveryEnabled =
+  app.conf.getBoolean(YarnConfiguration.RECOVERY_ENABLED,
+YarnConfiguration.DEFAULT_RM_RECOVERY_ENABLED);
+  if (isRecoveryEnabled) {
+LOG.info(Removing application with id  + app.applicationId);
+app.rmContext.getStateStore().removeApplication(app);
+return RMAppState.APP_REMOVING;
+  } else
+new RMAppFinishingTransition().transition(app, event);
+return RMAppState.FINISHING;
{code}

Why add the APP_?
{code}
   RUNNING,
+  APP_REMOVING,
{code}


 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
Priority: Blocker
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

40 matches

Mail list logo