[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631205#comment-14631205 ] Arun Suresh commented on YARN-3535: --- +1, Committing this shortly. Thanks to everyone involved. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Fix For: 2.8.0 > > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630861#comment-14630861 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 18s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 48s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 47s | The applied patch generated 5 new checkstyle issues (total was 337, now 342). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 24s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 51m 21s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 89m 43s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745756/0006-YARN-3535.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ee36f4f | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8568/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8568/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8568/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8568/console | This message was automatically generated. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629452#comment-14629452 ] zhihai xu commented on YARN-3535: - Also because {{containerCompleted}} and {{pullNewlyAllocatedContainersAndNMTokens}} are synchronized, it will guarantee if AM gets the container, {{ContainerRescheduledEvent}}({{recoverResourceRequestForContainer}}) won't be called. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629411#comment-14629411 ] Sunil G commented on YARN-3535: --- Thank you [~peng.zhang] and [~asuresh] for correcting. bq.that notification will happen only AFTER the recoverResourceRequest has completed.. since it will be handled by the same dispatcher Yes. I missed this. Ordering will be corrected here. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629394#comment-14629394 ] Arun Suresh commented on YARN-3535: --- bq. I think recoverResourceRequest will not be affected by whether container finished event is processed faster. Cause recoverResourceRequest only process the ResourceRequest in container and not care its status. I agree with [~peng.zhang] here. IIUC, The {{recoverResourceRequest}} only affects state of the Scheduler and the SchedulerApp. In any case, the fact that the container is killed (the outcome of the {{RMAppAttemptContainerFinishedEvent}} fired by {{FinishedTransition#transition}}) will be notified to the Scheduler.. and that notification will happen only AFTER the recoverResourceRequest has completed.. since it will be handled by the same dispatcher. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629369#comment-14629369 ] Peng Zhang commented on YARN-3535: -- bq. there are chances that recoverResourceRequest may not be correct. Sorry, I didn't catch this, maybe I missed sth?. I think {{recoverResourceRequest}} will not be affected by whether container finished event is processed faster. Cause {{recoverResourceRequest}} only process the ResourceRequest in container and not care its status. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629348#comment-14629348 ] Sunil G commented on YARN-3535: --- Hi [~rohithsharma] and [~peng.zhang] After seeing this patch, I feel there may a synchronization problem. Please correct me if I am wrong. In ContainerRescheduledTransition code, its been used like {code} + container.eventHandler.handle(new ContainerRescheduledEvent(container)); + new FinishedTransition().transition(container, event); {code} Hence ContainerRescheduledEvent is fired to Scheduler dispatcher and it will process the {{recoverResourceRequestForContainer}} is a separate thread. Meantime in RMAppImpl, {{FinishedTransition().transition}} will be invoked and it will be processed for closure for this container. If the Scheduler dispatcher is slower in processing due to pending event queue length, there are chances that recoverResourceRequest may not be correct. I feel we can introduce a new Event in {{RMContainerImpl}} from ALLOCATED to WAIT_FOR_REQUEST_RECOVERY and scheduler can fire back an event to {{RMContainerImpl}} indicate recovery of resource request is completed. This can move the state forward to KILLED in {{RMContainerImpl}}. Please share your thoughts. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629300#comment-14629300 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 14s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 44s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 41s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 46s | The applied patch generated 5 new checkstyle issues (total was 338, now 343). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 22s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 51m 30s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 89m 45s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745572/0005-YARN-3535.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 3ec0a04 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8554/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8554/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8554/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8554/console | This message was automatically generated. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629296#comment-14629296 ] zhihai xu commented on YARN-3535: - Sorry for coming late into this issue. The latest Patch looks good to me except one nit: Can we make {{ContainerRescheduledTransition}} child class of {{FinishedTransition}} similar as {{KillTransition}}? So we can call {{super.transition(container, event);}} instead of {{new FinishedTransition().transition(container, event);}}. I think this will make the code more readable and match other transition class implementation. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629253#comment-14629253 ] Arun Suresh commented on YARN-3535: --- The patch looks good !! Thanks for working on this [~peng.zhang] and [~rohithsharma] +1, Pending successful jenkins run with latest patch > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629249#comment-14629249 ] Arun Suresh commented on YARN-3535: --- I meant for the FairScheduler... but looks like your new patch has it... thanks > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629208#comment-14629208 ] Peng Zhang commented on YARN-3535: -- Thanks [~rohithsharma] for updating patch. patch LGTM. bq. One point to be clear that , here the assumption made is if RMContainer is ALLOCATED then only recover ResourceRequest. If RMcontainer is in RUNNING, then completed container will go to AM in allocate response and AM will ask new ResourceRequest. During running in our scale cluster with FS and preemption enabled, MapReduce app works good with this assumption. Basically, I think this assumption make sense for other type app. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629168#comment-14629168 ] Rohith Sharma K S commented on YARN-3535: - One point to be clear that , here the assumption made is if RMContainer is ALLOCATED then only recover ResourceRequest. If RMcontainer is in RUNNING, then completed container will go to AM in allocate response and AM will ask new ResourceRequest. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629157#comment-14629157 ] Rohith Sharma K S commented on YARN-3535: - ahh, right.. it can be removed. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629144#comment-14629144 ] Rohith Sharma K S commented on YARN-3535: - Yes, {{TestCapacityScheduler#testRecoverRequestAfterPreemption}} simulates this. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628722#comment-14628722 ] Arun Suresh commented on YARN-3535: --- Also... Is it possible to simulate the 2 cases in the testcase ? > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628688#comment-14628688 ] Arun Suresh commented on YARN-3535: --- bq. This jira fix 2. Kill Container event in CS. So removing recoverResourceRequestForContainer(cont); is make sense to me.. Any reason why we don't remove {{recoverResourceRequestForContainer}} from the {{warnOrKillContainer}} method in the FairSheduler ? wont the above situation happen in the FS as well.. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627916#comment-14627916 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 8s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 39s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 3s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 46s | The applied patch generated 3 new checkstyle issues (total was 337, now 340). | | {color:red}-1{color} | whitespace | 0m 2s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 51m 31s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 89m 55s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745422/0004-YARN-3535.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / edcaae4 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8545/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8545/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8545/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8545/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8545/console | This message was automatically generated. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627602#comment-14627602 ] Arun Suresh commented on YARN-3535: --- makes sense... thanks for clarifying.. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627594#comment-14627594 ] Rohith Sharma K S commented on YARN-3535: - {code} for (ApplicationId appId : reconnectEvent.getRunningApplications()) { handleRunningAppOnNode(rmNode, rmNode.context, appId, rmNode.nodeId); } {code} IIUC, This code will update RMApp about node details so that RMApp get to know that its some containers has run on this node. And this part of code does not kill the existing running containers. Running containers are killed when the NodeRemoved event is triggered to schedulers, and this event will be triggered by RMNodeImpl#Reconnected transition if noAppsRunning. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627581#comment-14627581 ] Arun Suresh commented on YARN-3535: --- bq. .. then on NM restart, running containers should be killed which is currently achieved by if-clause. I am probably missing something... but It looks like this is in fact being done in the else clause. (the code snippet I pasted in my comment [above|https://issues.apache.org/jira/browse/YARN-3535?focusedCommentId=14627370&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14627370]. lines 658 - 660 of RMNodeImpl in trunk). > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627577#comment-14627577 ] Rohith Sharma K S commented on YARN-3535: - I was not handled recovering twise RR, I will make a change and update the patch soon. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627576#comment-14627576 ] Rohith Sharma K S commented on YARN-3535: - bq. For preemption, container killed has two cases: container already pulled by AM or not. For 1st case, AM should know container is killed, and AM will re-ask container for task. For the case container not pull by AM, preemption killing caused the same case of this issue. So I think it should not be recovered twice. ahh, you are right. Basically if RMContainer is not pulled by AM, then its state is ALLOCATED. On preempting RMContainer, resource request was recovered twise i.e 1. This jira fix 2. Kill Container event in CS. So removing *recoverResourceRequestForContainer(cont);* is make sense to me. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627571#comment-14627571 ] Rohith Sharma K S commented on YARN-3535: - bq. wouldn't removing the if clause just solve this ? Yes, Just removing if clause should sovle the this issue problem. But problem is with legacy behaviour i.e if RM/NM work preserving restart feature is NOT enabled, then on NM restart, running containers should be killed which is currently achieved by if-clause. So retaining the existing behaviour, this issue fix is required. And the YARN-3286 is tracking jira for Reconnected event clean up change as you have mentioned. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627370#comment-14627370 ] Arun Suresh commented on YARN-3535: --- Apologies for the late suggestion. [~djp], Correct me if I am wrong here.. I was just looking at YARN-2561. It looks like the basic point of it was to ensure that on a reconnecting node, running containers were properly killed. This is achieved by the node removed and node added event. This happens in the {{if (noRunningApps) ..}} clause of the YARN-2561 patch. But I also see that a later patch has also handled the issue by introducing the following code inside the {{else ..}} clause of the above mentioned if. {noformat} for (ApplicationId appId : reconnectEvent.getRunningApplications()) { handleRunningAppOnNode(rmNode, rmNode.context, appId, rmNode.nodeId); } {noformat} This correctly kills only the running contains and does not do anything to the allocated containers (which I guess should be the case). Given the above, do we still need whatever is contained in the if clause ? wouldn't removing the if clause just solve this ? Thoughts ? > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626244#comment-14626244 ] Peng Zhang commented on YARN-3535: -- bq. Remove call of recoverResourceRequestForContainer from preemption to avoid duplication of recover RR. I remembered the reason. For preemption, container killed has two cases: container already pulled by AM or not. For 1st case, AM should know container is killed, and AM will re-ask container for task. For the case container not pull by AM, preemption killing caused the same case of this issue. So I think it should not be recovered twice. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626207#comment-14626207 ] Peng Zhang commented on YARN-3535: -- [~rohithsharma] Thanks for rebase and adding tests. As for removing {{recoverResourceRequestForContainer}}, in my notes, it caused test {{CapacityScheduler#testRecoverRequestAfterPreemption}} failed. But I cannot remember my old thoughts: bq. Remove call of recoverResourceRequestForContainer from preemption to avoid duplication of recover RR. I applied my patch {{YARN-3535-002.patch}} on our production cluster, preemption works well with FairScheduler. Failure of {{TestAMRestart.testAMRestartWithExistingContainers}} , I met it before. And I think it's because: bq. Changing TestAMRestart.java is because that case testAMRestartWithExistingContainers will trigger this logic. After this patch, one more container may be scheduled, and attempt.getJustFinishedContainers().size() may be bigger than expectedNum and loop never ends. So I simply change the situation. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14625238#comment-14625238 ] Arun Suresh commented on YARN-3535: --- Thanks for working on this [~peng.zhang]. We seem to be hitting this on our scale clusters as well.. so would be good to get this in soon. In our case the NM re-registration was caused by YARN-3842 The Patch looks good to me. Any idea why the tests failed ? > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624269#comment-14624269 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 2s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 48s | The applied patch generated 4 new checkstyle issues (total was 338, now 342). | | {color:red}-1{color} | whitespace | 0m 2s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 51m 51s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 89m 47s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | | | hadoop.yarn.server.resourcemanager.TestApplicationCleanup | | | hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions | | | hadoop.yarn.server.resourcemanager.TestResourceTrackerService | | | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12744980/0003-YARN-3535.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 5ed1fea | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8518/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8518/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8518/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8518/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8518/console | This message was automatically generated. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624202#comment-14624202 ] Rohith Sharma K S commented on YARN-3535: - [~peng.zhang] I rebased the patch to trunk and added FT test. The test simulates reported scenarion and fails with timeout if this fix is not present. After this fix, test passes. In you previous patch, I have one doubt that , why the below method is removed in both FS and CS? Any specific reason? {code} -recoverResourceRequestForContainer(cont); {code} > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624187#comment-14624187 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12729146/YARN-3535-002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / d667560 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8517/console | This message was automatically generated. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576970#comment-14576970 ] Peng Zhang commented on YARN-3535: -- Sorry for late reply. Thanks for your comments. bq. 1. I think the method recoverResourceRequestForContainer should be synchronized, any thought? I notice it's not with synchronized originally. I checked this method and found only "applications" need to be protected( get by calling "getCurrentAttemptForContainer()" ). "applications" is instantiated using ConcurrentHashMap in derived scheduler, so I think it's no need to add synchronized. Other three comments are all related with test. # Changing TestAMRestart.java is because that case testAMRestartWithExistingContainers will trigger this logic. After this patch, one more container may be scheduled, and attempt.getJustFinishedContainers().size() may be bigger than expectedNum and loop never ends. So I simply change the situation. # I agreed that this issue exist in all scheduler, and should be tested generally. But I didn't find good way to reproduce it. I'll take a try with ParameterizedSchedulerTestBase. # I change RMContextImpl.java to get schedulerDispatcher and start it in test TestFairScheduler. Otherwise event handler cannot be triggered. I'll check if this can also be solved based on ParameterizedSchedulerTestBase. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Labels: BB2015-05-TBR > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576774#comment-14576774 ] Rohith commented on YARN-3535: -- Recently in test we faced same issue, [~peng.zhang] would you mind updating the patch? > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Labels: BB2015-05-TBR > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560774#comment-14560774 ] Rohith commented on YARN-3535: -- Thanks [~peng.zhang] for working on this issue.. Some comments # I think the method {{recoverResourceRequestForContainer}} should be synchronized, any thought? # Why do we require {{RMContextImpl.java}} changes? I think this we can avoid, not necessarily required. Tests : # Any specific reason for chaning {{TestAMRestart.java}}? # IIUC, this issue can occur in all the scheduler given AM-RM heart beat is lesser than NM-RM heart beat interval. So can it include FT test case that applicable for both CS and FS. May it you can add test in the extending class {{ParameterizedSchedulerTestBase}} i.e TestAbstractYarnScheduler. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Labels: BB2015-05-TBR > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14519368#comment-14519368 ] Peng Zhang commented on YARN-3535: -- I think TestAMRestart failure is not related with this patch. I found YARN-2483 is to resolve it. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14519361#comment-14519361 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 55s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 58s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 22s | The applied patch generated 7 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 18s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 53m 45s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 95m 32s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12729146/YARN-3535-002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8f82970 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7538/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7538/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7538/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7538/console | This message was automatically generated. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517272#comment-14517272 ] Peng Zhang commented on YARN-3535: -- Sorry, I only run all tests in FairScheduler package, I'll fix others tomorrow. And how to know the specific checkstyle errors? I am using code formatter from cloudera in Intellij. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: YARN-3535-001.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517186#comment-14517186 ] Hadoop QA commented on YARN-3535: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 35s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 31s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 23s | The applied patch generated 8 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 14s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 59m 40s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 100m 37s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | | | hadoop.yarn.server.resourcemanager.TestRM | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12728784/YARN-3535-001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 99fe03e | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7523/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7523/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7523/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7523/console | This message was automatically generated. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: YARN-3535-001.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517022#comment-14517022 ] Peng Zhang commented on YARN-3535: -- Attached patch to restore ResourceRequest for transition ALLOCATED to KILLED. Added test case for FairScheduler and I added getter for SchedulerDispatcher in RMContextImpl to start it in test. I've tested rolling update operation in small cluster: found issue transition is triggered, and MR job works well. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: YARN-3535-001.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514899#comment-14514899 ] Jason Lowe commented on YARN-3535: -- Yes, the resource request needs to be added back. That's by far the simplest fix. The AM has no idea the request was fulfilled before it was killed, so from the AM's perspective the request is still outstanding. I'm +1 for adding a new flag indicating whether the NM reconnect is container-preserving or not, as long as we work through the upgrade scenarios to verify we don't introduce regressions. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514499#comment-14514499 ] Rohith commented on YARN-3535: -- Adding RR back to scheduler makes more sense to me. Since RM identifies NM restart enabled or not using running applications that reported during registration call, it will be difficult to distinguish between NM restart enabled with 0 applications reporting to RM VS NM restart disabled where all the time NM restarts reports 0 applications to RM. Why can't NM register with additional flag indicating to RM that NM restart is enabled. Any thoughts? I was created to refactor the code for RMNodeImpl#ReconnectedNodeTransition in YARN-3286, but did not progress since it was changing the behavior of killing running container on NM restart. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514190#comment-14514190 ] Junping Du commented on YARN-3535: -- [~jlowe], [~peng.zhang] and [~rohithsharma], from my comments in YARN-3212 (https://issues.apache.org/jira/browse/YARN-3212?focusedCommentId=14514182&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14514182), may be we should still support the case/transition from ALLOCATED to KILLED but make sure AM/RM can sync on the same page in some way, e.g. probably, adding back resource request in this case? > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514107#comment-14514107 ] Junping Du commented on YARN-3535: -- Sorry for coming late on this. Discussion above sounds good to me. bq. Junping Du, as this also has potential ramifications for graceful decommission. If we try to graceful decommission a node that isn't currently reporting applications we may also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet. That's a good point. [~jlowe]! I will put a note on YARN-3212 for applying the right check. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509276#comment-14509276 ] Jason Lowe commented on YARN-3535: -- The first item is to avoid containers failing due to an NM restart. As it is now, a container handed out by the RM to an idle NM can fail if the NM restarts before the AM launches the container. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509256#comment-14509256 ] Peng Zhang commented on YARN-3535: -- As per [~jlowe]'s thoughts, I understand here are two separated thing: # During NM reconnection, RM and NM should do sync at container level. For this issue's scenario, container 04 should not be killed and rescheduled, so AM can acquire and launch it on NM after NM registered. # Still need fix in RMContainerImpl: restore request during transition from ALLOCATED to KILLED. Because NM's real lost may cause transition from ALLOCATED to KILLED with very small possibility(AM may heartbeat and acquire container after NM heartbeats timeout). I think first thing is an improvement to save time or scheduling work done before. Or did I get any mistake? > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509153#comment-14509153 ] Jason Lowe commented on YARN-3535: -- I think we need to fix the RMContainerImpl ALLOCATED to KILLED transition, but I think there's another bug here. I believe the container was killed in the first place because the RMNodeImpl reconnect transition makes an assumption that is racy. When the node reconnects, it checks if the node reports no applications running. If it has no applications then it sends a removed node eventfollowed by a added node event to the scheduler. This will cause the scheduler to kill all containers allocated on that node. However the node will only know about a container iff the AM acquires the container and tries to launch the container on the node. That can take minutes to transpire, so it's dangerous to assume that a node not reporting any applications on the node means it doesn't have anything pending. I think we'll have to revisit the solution to YARN-2561 to either eliminate this race or make it safe if it does occur. Ideally we shouldn't be sending a remove/add event to the scheduler if the node is reconnecting, but we need to make sure we cancel containers on the node that are no longer running. Since the node reports what containers it has when it reconnects, it seems like we can convey that information to the scheduler to correct anything that doesn't match up. Any container in the RUNNING state that no longer appears in the list of containers when registering can be killed by the scheduler, as it does when a node is removed, and I believe that will fix YARN-2561 and also avoid this race. cc: [~djp] as this also has potential ramifications for graceful decommission. If we try to graceful decommission a node that isn't currently reporting applications we may also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508901#comment-14508901 ] Peng Zhang commented on YARN-3535: -- Thanks [~rohithsharma] for help. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508856#comment-14508856 ] Rohith commented on YARN-3535: -- Moved to YARN and updated the description as per real issue. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Attachments: syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)