[jira] [Created] (YARN-11302) hadoop-yarn-applications-mawo-core module publishes tar file during maven deploy
Steven Rand created YARN-11302: -- Summary: hadoop-yarn-applications-mawo-core module publishes tar file during maven deploy Key: YARN-11302 URL: https://issues.apache.org/jira/browse/YARN-11302 Project: Hadoop YARN Issue Type: Bug Components: applications, yarn Affects Versions: 3.3.4 Reporter: Steven Rand The {{hadoop-yarn-applications-mawo-core}} module will currently publish a file with extension {{bin.tar.gz}} during the maven deploy step: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-mawo/hadoop-yarn-applications-mawo-core/src/assembly/bin.xml#L16. I don't know whether the community considers this to be a bug or not, but creating a ticket because the deploy step typically creates JAR and POM files, and some maven repositories that are intended to only host JARs will have allowlists of file extensions that block tarballs from being published, and therefore cause the maven deploy operation to fail with this error: {code} Caused by: org.apache.maven.wagon.TransferFailedException: Failed to transfer file: https:///artifactory//org/apache/hadoop/applications/mawo/hadoop-yarn-applications-mawo-core//hadoop-yarn-applications-mawo-core--bin.tar.gz. Return code is: 409, ReasonPhrase: . {code} Feel free to close if the community doesn't consider this to be a problem, but notably it is a regression from versions predating mawo when only JAR and POM files were published in the deploy step. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11184) fenced active RM not failing over correctly in HA setup
[ https://issues.apache.org/jira/browse/YARN-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554318#comment-17554318 ] Steven Rand commented on YARN-11184: Possibly [ZOOKEEPER-2251|https://issues.apache.org/jira/browse/ZOOKEEPER-2251] is related? The thread dump is different, but it appears to be a similar problem of the {{StandByTransitionThread}} waiting indefinitely for a response. The ZK version used client side by hadoop does not include the fix for that issue. > fenced active RM not failing over correctly in HA setup > --- > > Key: YARN-11184 > URL: https://issues.apache.org/jira/browse/YARN-11184 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.3 >Reporter: Steven Rand >Priority: Major > Attachments: image-2022-06-14-16-38-00-336.png, > image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png, > image-2022-06-14-16-44-45-101.png > > > We've observed an issue recently on a production cluster running 3.2.3 in > which a fenced Resource Manager remains active, but does not communicate with > the ZK state store, and therefore cannot function correctly. This did not > occur while running 3.2.2 on the same cluster. > In more detail, what seems to happen is: > 1. The active RM gets a {{NodeExists}} error from ZK while storing an app in > the state store. I suspect that this is caused by some transient connection > issue that causes the first node creation request to succeed, but for the > response to not reach the RM, triggering a duplicate request which fails with > this error. > !image-2022-06-14-16-38-00-336.png! > 2. Because of this error, the active RM is fenced. > !image-2022-06-14-16-39-50-278.png! > 3. Because it is fenced, the active RM starts to transition to standby. > !image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully > transitions to standby. It never logs {{Transitioning RM to Standby mode}} > from the run method of {{{}StandByTransitionRunnable{}}}: > [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.] > Related to this, a jstack of the RM shows that thread being {{RUNNABLE}}, > but evidently not making progress: > !image-2022-06-14-16-44-45-101.png! > So the RM doesn't work because it is fenced, but remains active, which causes > an outage until a failover is manually initiated. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11184) fenced active RM not failing over correctly in HA setup
Steven Rand created YARN-11184: -- Summary: fenced active RM not failing over correctly in HA setup Key: YARN-11184 URL: https://issues.apache.org/jira/browse/YARN-11184 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.2.3 Reporter: Steven Rand Attachments: image-2022-06-14-16-38-00-336.png, image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png, image-2022-06-14-16-44-45-101.png We've observed an issue recently on a production cluster running 3.2.3 in which a fenced Resource Manager remains active, but does not communicate with the ZK state store, and therefore cannot function correctly. This did not occur while running 3.2.2 on the same cluster. In more detail, what seems to happen is: 1. The active RM gets a {{NodeExists}} error from ZK while storing an app in the state store. I suspect that this is caused by some transient connection issue that causes the first node creation request to succeed, but for the response to not reach the RM, triggering a duplicate request which fails with this error. !image-2022-06-14-16-38-00-336.png! 2. Because of this error, the active RM is fenced. !image-2022-06-14-16-39-50-278.png! 3. Because it is fenced, the active RM starts to transition to standby. !image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully transitions to standby. It never logs {{Transitioning RM to Standby mode}} from the run method of {{{}StandByTransitionRunnable{}}}: [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.] Related to this, a jstack of the RM shows that thread being {{RUNNABLE}}, but evidently not making progress: !image-2022-06-14-16-44-45-101.png! So the RM doesn't work because it is fenced, but remains active, which causes an outage until a failover is manually initiated. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214933#comment-17214933 ] Steven Rand commented on YARN-10244: Thanks [~aajisaka]! > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196839#comment-17196839 ] Steven Rand commented on YARN-10244: Just to be clear, I'm considering this finished from my side unless someone tells me that I've misunderstood something. The test failures are caused by YARN-10249, not by the patch, so I don't think further action is needed from my end. > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194719#comment-17194719 ] Steven Rand commented on YARN-10244: Thanks all for helping with the backport of this to branch-3.2. My guess is that the tests will keep failing until we resolve YARN-10249. > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10249) Various ResourceManager tests are failing on branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098400#comment-17098400 ] Steven Rand commented on YARN-10249: Also happening in YARN-10244 > Various ResourceManager tests are failing on branch-3.2 > --- > > Key: YARN-10249 > URL: https://issues.apache.org/jira/browse/YARN-10249 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10249.branch-3.2.POC001.patch > > > Various tests are failing on branch-3.2. Some examples can be found in: > YARN-10003, YARN-10002, YARN-10237. The seemingly common thing that all of > the failing tests are RM/Capacity Scheduler related, and the failures are > flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098397#comment-17098397 ] Steven Rand commented on YARN-10244: The test failures for all three patches are caused by YARN-10249, not by the patches themselves. > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-10244: --- Attachment: YARN-10244-branch-3.2.003.patch > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-10244: --- Attachment: YARN-10244-branch-3.2.002.patch > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092756#comment-17092756 ] Steven Rand edited comment on YARN-9848 at 4/26/20, 3:21 PM: - I created YARN-10244 for branch-3.2. For resolving this issue, I'm not a committer, so I think someone else will have to merge the patch to trunk and branch-3.3.0. was (Author: steven rand): I created YARN-10244 for branch-3.2. For resolving this issue, I'm not a committer, so I think someone else will have to merge the patch to trunk and branch-3.3.0. > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Blocker > Attachments: YARN-9848-01.patch, YARN-9848.002.patch, > YARN-9848.003.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092756#comment-17092756 ] Steven Rand commented on YARN-9848: --- I created YARN-10244 for branch-3.2. For resolving this issue, I'm not a committer, so I think someone else will have to merge the patch to trunk and branch-3.3.0. > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Blocker > Attachments: YARN-9848-01.patch, YARN-9848.002.patch, > YARN-9848.003.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-10244: --- Attachment: YARN-10244-branch-3.2.001.patch > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10244) backport YARN-9848 to branch-3.2
Steven Rand created YARN-10244: -- Summary: backport YARN-9848 to branch-3.2 Key: YARN-10244 URL: https://issues.apache.org/jira/browse/YARN-10244 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation, resourcemanager Reporter: Steven Rand Assignee: Steven Rand Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087990#comment-17087990 ] Steven Rand commented on YARN-9848: --- Thanks all. I also have a patch for branch-3.2 so that we can include it in a 3.2 maintenance release like [~vinodkv] suggested. Do I upload the patch to this JIRA, or is it better to make a new one? > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Blocker > Attachments: YARN-9848-01.patch, YARN-9848.002.patch, > YARN-9848.003.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083449#comment-17083449 ] Steven Rand edited comment on YARN-9848 at 4/14/20, 5:30 PM: - >From trying to apply the patch locally, it seems that trunk has changed since >I wrote it, and it no longer applies cleanly. I'll upload a new one soon. EDIT: The {{YARN-9848.003.patch}} file accounts for the changes from YARN-9886 and applies to trunk. was (Author: steven rand): >From trying to apply the patch locally, it seems that trunk has changed since >I wrote it, and it no longer applies cleanly. I'll upload a new one soon. > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Priority: Blocker > Attachments: YARN-9848-01.patch, YARN-9848.002.patch, > YARN-9848.003.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-9848: -- Attachment: YARN-9848.003.patch > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Priority: Blocker > Attachments: YARN-9848-01.patch, YARN-9848.002.patch, > YARN-9848.003.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083449#comment-17083449 ] Steven Rand commented on YARN-9848: --- >From trying to apply the patch locally, it seems that trunk has changed since >I wrote it, and it no longer applies cleanly. I'll upload a new one soon. > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Priority: Blocker > Attachments: YARN-9848-01.patch, YARN-9848.002.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-9848: -- Attachment: YARN-9848.002.patch > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Priority: Blocker > Attachments: YARN-9848-01.patch, YARN-9848.002.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup
[ https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024799#comment-17024799 ] Steven Rand commented on YARN-8990: --- How would people feel about cherrypicking this and YARN-8992 to {{branch-3.2}}? It seems like we should do that before {{branch-3.2.2}} gets cut for an eventual 3.2.2 release. > Fix fair scheduler race condition in app submit and queue cleanup > - > > Key: YARN-8990 > URL: https://issues.apache.org/jira/browse/YARN-8990 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.2.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Fix For: 3.2.0, 3.3.0 > > Attachments: YARN-8990.001.patch, YARN-8990.002.patch > > > With the introduction of the dynamic queue deletion in YARN-8191 a race > condition was introduced that can cause a queue to be removed while an > application submit is in progress. > The issue occurs in {{FairScheduler.addApplication()}} when an application is > submitted to a dynamic queue which is empty or the queue does not exist yet. > If during the processing of the application submit the > {{AllocationFileLoaderService}} kicks of for an update the queue clean up > will be run first. The application submit first creates the queue and get a > reference back to the queue. > Other checks are performed and as the last action before getting ready to > generate an AppAttempt the queue is updated to show the submitted application > ID.. > The time between the queue creation and the queue update to show the submit > is long enough for the queue to be removed. The application however is lost > and will never get any resources assigned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011446#comment-17011446 ] Steven Rand commented on YARN-4946: --- Any update on what we want to do here? It seems like we're starting to plan new releases, and I think it'd be good to either revert or make some adjustment before those come out. > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-4946.001.patch, YARN-4946.002.patch, > YARN-4946.003.patch, YARN-4946.004.patch > > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > The RM should not consider an app to be fully completed (and thus removed > from its history) until the aggregation status has reached a terminal state > (e.g. SUCCEEDED, FAILED, TIME_OUT). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup
[ https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967099#comment-16967099 ] Steven Rand edited comment on YARN-8990 at 11/4/19 11:43 PM: - Hi all, Unfortunately, this patch never made its way into the 3.2.1 release, which is affected by this race condition. I think what happened is that it was committed to trunk and backported to branch-3.2.0, but not to branch-3.2 (or branch-3.2.1). And unless I'm misinterpreting the git history, the 3.2.1 release is also missing YARN-8992, despite the fix version of that ticket. We should at minimum make sure that the fixes for these race conditions are in 3.2.2. Since this was a blocker and the impact is pretty serious, there may be more things we want to do, e.g., messaging and/or expediting the 3.2.2 release, but I'll leave that up to you to decide. was (Author: steven rand): Hi all, Unfortunately, this patch never made its way into the 3.2.1 release, which is affected by this race condition. I think what happened is that it was committed to trunk and backported to branch-3.2.0, but not to branch-3.2 (or branch-3.2.1). And unless I'm misinterpreting the git history, the 3.2.1 release is also missing YARN-8992, despite the fix version of that ticket. We should at minimum make sure that the fixes for these race conditions are in 3.2.2. Since this was a blocker and the impact is pretty serious, there may be more things we want to do, e.g., messaging or expediting the 3.2.2 release, but I'll leave that up you to decide. > Fix fair scheduler race condition in app submit and queue cleanup > - > > Key: YARN-8990 > URL: https://issues.apache.org/jira/browse/YARN-8990 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.2.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Fix For: 3.2.0, 3.3.0 > > Attachments: YARN-8990.001.patch, YARN-8990.002.patch > > > With the introduction of the dynamic queue deletion in YARN-8191 a race > condition was introduced that can cause a queue to be removed while an > application submit is in progress. > The issue occurs in {{FairScheduler.addApplication()}} when an application is > submitted to a dynamic queue which is empty or the queue does not exist yet. > If during the processing of the application submit the > {{AllocationFileLoaderService}} kicks of for an update the queue clean up > will be run first. The application submit first creates the queue and get a > reference back to the queue. > Other checks are performed and as the last action before getting ready to > generate an AppAttempt the queue is updated to show the submitted application > ID.. > The time between the queue creation and the queue update to show the submit > is long enough for the queue to be removed. The application however is lost > and will never get any resources assigned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup
[ https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967099#comment-16967099 ] Steven Rand commented on YARN-8990: --- Hi all, Unfortunately, this patch never made its way into the 3.2.1 release, which is affected by this race condition. I think what happened is that it was committed to trunk and backported to branch-3.2.0, but not to branch-3.2 (or branch-3.2.1). And unless I'm misinterpreting the git history, the 3.2.1 release is also missing YARN-8992, despite the fix version of that ticket. We should at minimum make sure that the fixes for these race conditions are in 3.2.2. Since this was a blocker and the impact is pretty serious, there may be more things we want to do, e.g., messaging or expediting the 3.2.2 release, but I'll leave that up you to decide. > Fix fair scheduler race condition in app submit and queue cleanup > - > > Key: YARN-8990 > URL: https://issues.apache.org/jira/browse/YARN-8990 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.2.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Fix For: 3.2.0, 3.3.0 > > Attachments: YARN-8990.001.patch, YARN-8990.002.patch > > > With the introduction of the dynamic queue deletion in YARN-8191 a race > condition was introduced that can cause a queue to be removed while an > application submit is in progress. > The issue occurs in {{FairScheduler.addApplication()}} when an application is > submitted to a dynamic queue which is empty or the queue does not exist yet. > If during the processing of the application submit the > {{AllocationFileLoaderService}} kicks of for an update the queue clean up > will be run first. The application submit first creates the queue and get a > reference back to the queue. > Other checks are performed and as the last action before getting ready to > generate an AppAttempt the queue is updated to show the submitted application > ID.. > The time between the queue creation and the queue update to show the submit > is long enough for the queue to be removed. The application however is lost > and will never get any resources assigned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8470) Fair scheduler exception with SLS
[ https://issues.apache.org/jira/browse/YARN-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947163#comment-16947163 ] Steven Rand commented on YARN-8470: --- Hi [~snemeth], [~szegedim], Friendly ping on this ticket. We've hit this issue in a production cluster running 3.2.1. > Fair scheduler exception with SLS > - > > Key: YARN-8470 > URL: https://issues.apache.org/jira/browse/YARN-8470 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Miklos Szegedi >Assignee: Szilard Nemeth >Priority: Major > > I ran into the following exception with sls: > 2018-06-26 13:34:04,358 ERROR resourcemanager.ResourceManager: Received > RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, > FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9850) document or revert change in which DefaultContainerExecutor no longer propagates NM env to containers
Steven Rand created YARN-9850: - Summary: document or revert change in which DefaultContainerExecutor no longer propagates NM env to containers Key: YARN-9850 URL: https://issues.apache.org/jira/browse/YARN-9850 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Steven Rand After [https://github.com/apache/hadoop/commit/9d4d30243b0fc9630da51a2c17b543ef671d035c], containers launched by the {{DefaultContainerExecutor}} no longer inherit the environment of the NodeManager. I don't object to the commit (I actually prefer the new behavior), but I do think that it's a notable breaking change, as people may be relying on variables in the NM environment for their containers to behave correctly. As far as I can tell, we don't currently include this behavior change in the release notes for Hadoop 3, and it's a particularly tricky one to track down, since there's no JIRA ticket for it. I think that we should at least include this change in the release notes for the 3.0.0 release. Arguably it's worth having the DefaultContainerExecutor set {{inheritParentEnv}} to true when it creates its {{ShellCommandExecutor}} since that preserves the old behavior and is less surprising to users, but I don't feel strongly either way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9552) FairScheduler: NODE_UPDATE can cause NoSuchElementException
[ https://issues.apache.org/jira/browse/YARN-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934105#comment-16934105 ] Steven Rand commented on YARN-9552: --- Thanks! > FairScheduler: NODE_UPDATE can cause NoSuchElementException > --- > > Key: YARN-9552 > URL: https://issues.apache.org/jira/browse/YARN-9552 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9552-001.patch, YARN-9552-002.patch, > YARN-9552-003.patch, YARN-9552-004.patch > > > We observed a race condition inside YARN with the following stack trace: > {noformat} > 18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR > EventDispatcher: Error in handling event type NODE_UPDATE to the Event > Dispatcher > java.util.NoSuchElementException > at > java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) > at > java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:748) > {noformat} > This is basically the same as the one described in YARN-7382, but the root > cause is different. > When we create an application attempt, we create an {{FSAppAttempt}} object. > This contains an {{AppSchedulingInfo}} which contains a set of > {{SchedulerRequestKey}}. Initially, this set is empty and only initialized a > bit later on a separate thread during a state transition: > {noformat} > 2019-05-07 15:58:02,659 INFO [RM StateStore dispatcher] > recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for > app: application_1557237478804_0001 > 2019-05-07 15:58:02,684 INFO [RM Event dispatcher] rmapp.RMAppImpl > (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change > from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED > 2019-05-07 15:58:02,690 INFO [SchedulerEventDispatcher:Event Processor] > fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted > application application_1557237478804_0001 from user: bacskop, in queue: > root.bacskop, currently num of applications: 1 > 2019-05-07 15:58:02,698 INFO [RM Event dispatcher] rmapp.RMAppImpl > (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change > from SUBMITTED to ACCEPTED on event = APP_ACCEPTED > 2019-05-07 15:58:02,731 INFO [RM Event dispatcher] > resourcemanager.ApplicationMasterService > (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app > attempt : appattempt_1557237478804_0001_01 > 2019-05-07 15:58:02,732 INFO [RM Event dispatcher] attempt.RMAppAttemptImpl > (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 > State change from NEW to SUBMITTED on event = START > 2019-05-07 15:58:02,746 INFO [SchedulerEventDispatcher:Event Processor] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:(207)) - *** In the constructor of > SchedulerApplicationAttempt > 2019-05-07 15:58:02,747 INFO [SchedulerEventDispatcher:Event Processor] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:(230)) - *** Contents of > appSchedulingInfo: [] > 2019-05-07 15:58:02,752 INFO [SchedulerEventDispatcher:Event Processor] > fair.FairScheduler (FairScheduler.java:addApplicationAttempt(546)) - Added > Application Attempt appattempt_1557237478804_0001_01 to scheduler from > user:
[jira] [Commented] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934095#comment-16934095 ] Steven Rand commented on YARN-9848: --- Attached a patch which reverts YARN-4946 on trunk. The revert applied cleanly to the logic in {{RMAppManager}}, but had several conflicts in {{TestAppManager}}. Tagging [~ccondit], [~wangda], [~rkanter], [~snemeth] > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Priority: Major > Attachments: YARN-9848-01.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934091#comment-16934091 ] Steven Rand commented on YARN-4946: --- I created YARN-9848 for reverting. > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-4946.001.patch, YARN-4946.002.patch, > YARN-4946.003.patch, YARN-4946.004.patch > > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > The RM should not consider an app to be fully completed (and thus removed > from its history) until the aggregation status has reached a terminal state > (e.g. SUCCEEDED, FAILED, TIME_OUT). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9848) revert YARN-4946
[ https://issues.apache.org/jira/browse/YARN-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-9848: -- Attachment: YARN-9848-01.patch > revert YARN-4946 > > > Key: YARN-9848 > URL: https://issues.apache.org/jira/browse/YARN-9848 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Priority: Major > Attachments: YARN-9848-01.patch > > > In YARN-4946, we've been discussing a revert due to the potential for keeping > more applications in the state store than desired, and the potential to > greatly increase RM recovery times. > > I'm in favor of reverting the patch, but other ideas along the lines of > YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9848) revert YARN-4946
Steven Rand created YARN-9848: - Summary: revert YARN-4946 Key: YARN-9848 URL: https://issues.apache.org/jira/browse/YARN-9848 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation, resourcemanager Reporter: Steven Rand In YARN-4946, we've been discussing a revert due to the potential for keeping more applications in the state store than desired, and the potential to greatly increase RM recovery times. I'm in favor of reverting the patch, but other ideas along the lines of YARN-9571 would work as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9552) FairScheduler: NODE_UPDATE can cause NoSuchElementException
[ https://issues.apache.org/jira/browse/YARN-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934050#comment-16934050 ] Steven Rand commented on YARN-9552: --- This seems like an important fix since it prevents the RM from crashing – any chance we can backport it to the 3.2 and 3.1 maintenance releases? > FairScheduler: NODE_UPDATE can cause NoSuchElementException > --- > > Key: YARN-9552 > URL: https://issues.apache.org/jira/browse/YARN-9552 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9552-001.patch, YARN-9552-002.patch, > YARN-9552-003.patch, YARN-9552-004.patch > > > We observed a race condition inside YARN with the following stack trace: > {noformat} > 18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR > EventDispatcher: Error in handling event type NODE_UPDATE to the Event > Dispatcher > java.util.NoSuchElementException > at > java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) > at > java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:748) > {noformat} > This is basically the same as the one described in YARN-7382, but the root > cause is different. > When we create an application attempt, we create an {{FSAppAttempt}} object. > This contains an {{AppSchedulingInfo}} which contains a set of > {{SchedulerRequestKey}}. Initially, this set is empty and only initialized a > bit later on a separate thread during a state transition: > {noformat} > 2019-05-07 15:58:02,659 INFO [RM StateStore dispatcher] > recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for > app: application_1557237478804_0001 > 2019-05-07 15:58:02,684 INFO [RM Event dispatcher] rmapp.RMAppImpl > (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change > from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED > 2019-05-07 15:58:02,690 INFO [SchedulerEventDispatcher:Event Processor] > fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted > application application_1557237478804_0001 from user: bacskop, in queue: > root.bacskop, currently num of applications: 1 > 2019-05-07 15:58:02,698 INFO [RM Event dispatcher] rmapp.RMAppImpl > (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change > from SUBMITTED to ACCEPTED on event = APP_ACCEPTED > 2019-05-07 15:58:02,731 INFO [RM Event dispatcher] > resourcemanager.ApplicationMasterService > (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app > attempt : appattempt_1557237478804_0001_01 > 2019-05-07 15:58:02,732 INFO [RM Event dispatcher] attempt.RMAppAttemptImpl > (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_01 > State change from NEW to SUBMITTED on event = START > 2019-05-07 15:58:02,746 INFO [SchedulerEventDispatcher:Event Processor] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:(207)) - *** In the constructor of > SchedulerApplicationAttempt > 2019-05-07 15:58:02,747 INFO [SchedulerEventDispatcher:Event Processor] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:(230)) - *** Contents of > appSchedulingInfo: [] > 2019-05-07 15:58:02,752 INFO [SchedulerEventDispatcher:Event Processor] > fair.FairScheduler (Fair
[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900669#comment-16900669 ] Steven Rand commented on YARN-4946: --- I reverted this patch in our fork, and now RM recovery time is back to normal, and the number of apps being stored in ZK respects the configured maximum again. Friendly ping for [~wangda] and/or [~ccondit] on the question of either reverting this or pursuing a followup along the lines of YARN-9571. > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-4946.001.patch, YARN-4946.002.patch, > YARN-4946.003.patch, YARN-4946.004.patch > > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > The RM should not consider an app to be fully completed (and thus removed > from its history) until the aggregation status has reached a terminal state > (e.g. SUCCEEDED, FAILED, TIME_OUT). -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898599#comment-16898599 ] Steven Rand edited comment on YARN-4946 at 8/2/19 6:17 AM: --- I noticed after upgrading a cluster to 3.2.0 that RM recovery now takes about 20 minutes, whereas before it took less than one minute. I checked the RM's logs, and noticed that it hits the code path added in this patch more than 18 million times {code:java} # The log rotation settings allow for only 20 log files, so actually this number is lower than the real count. $ grep 'but not removing' hadoop--resourcemanager-.log* | wc -l 18092893 {code} I checked in ZK, and according to {{./zkCli.sh ls /rmstore/ZKRMStateRoot/RMAppRoot}}, I have 9,755 apps in the RM state store, even though the configured max is 1,000. I think that what happens when RM recovery starts is: * Some number of apps in the state store cause us to handle an {{APP_COMPLETED}} event during recovery. I'm not sure exactly how many – presumably just those that are finished? * Each time we handle one of these events, we call {{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}}, and in both cases we realize that there are more apps both in ZK and in memory than is allowed (limit for both is 1,000). * So for each of these events, we go through the for loops in both {{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}} that try to remove apps from ZK and from memory. * For whatever reason – probably a separate issue on this cluster – log aggregation isn't complete for any of these apps. So the for loops never manage to delete apps. And since the for loops are deterministic, they try to delete the same apps every time, but never make progress. And I think the repetition of these for loops for each {{APP_COMPLETED}} event explains the 18 million number – if we can have at most 9,755 finished apps in the state store, and for each of those apps we trigger 2 for loops that can have at most 8,755 iterations, we very quickly wind up with a lot of iterations. Because this change can lead to much longer RM recovery times in circumstances like this one, I think I prefer option {{a}} from the two listed above. Or, I think it's also reasonable to modify the patch from YARN-9571 to have a hardcoded TTL. was (Author: steven rand): I noticed after upgrading a cluster to 3.2.0 that RM recovery now takes about 20 minutes, whereas before it took less than one minute. I checked the RM's logs, and noticed that it hits the code path added in this patch more than 18 million times {code:java} # The log rotation settings allow for only 20 log files, so actually this number is lower than the real count. $ grep 'but not removing' hadoop-palantir-resourcemanager-.log* | wc -l 18092893 {code} I checked in ZK, and according to {{./zkCli.sh ls /rmstore/ZKRMStateRoot/RMAppRoot}}, I have 9,755 apps in the RM state store, even though the configured max is 1,000. I think that what happens when RM recovery starts is: * Some number of apps in the state store cause us to handle an {{APP_COMPLETED}} event during recovery. I'm not sure exactly how many – presumably just those that are finished? * Each time we handle one of these events, we call {{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}}, and in both cases we realize that there are more apps both in ZK and in memory than is allowed (limit for both is 1,000). * So for each of these events, we go through the for loops in both {{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}} that try to remove apps from ZK and from memory. * For whatever reason – probably a separate issue on this cluster – log aggregation isn't complete for any of these apps. So the for loops never manage to delete apps. And since the for loops are deterministic, they try to delete the same apps every time, but never make progress. And I think the repetition of these for loops for each {{APP_COMPLETED}} event explains the 18 million number – if we can have at most 9,755 finished apps in the state store, and for each of those apps we trigger 2 for loops that can have at most 8,755 iterations, we very quickly wind up with a lot of iterations. Because this change can lead to much longer RM recovery times in circumstances like this one, I think I prefer option {{a}} from the two listed above. Or, I think it's also reasonable to modify the patch from YARN-9571 to have a hardcoded TTL. > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN
[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898599#comment-16898599 ] Steven Rand commented on YARN-4946: --- I noticed after upgrading a cluster to 3.2.0 that RM recovery now takes about 20 minutes, whereas before it took less than one minute. I checked the RM's logs, and noticed that it hits the code path added in this patch more than 18 million times {code:java} # The log rotation settings allow for only 20 log files, so actually this number is lower than the real count. $ grep 'but not removing' hadoop-palantir-resourcemanager-.log* | wc -l 18092893 {code} I checked in ZK, and according to {{./zkCli.sh ls /rmstore/ZKRMStateRoot/RMAppRoot}}, I have 9,755 apps in the RM state store, even though the configured max is 1,000. I think that what happens when RM recovery starts is: * Some number of apps in the state store cause us to handle an {{APP_COMPLETED}} event during recovery. I'm not sure exactly how many – presumably just those that are finished? * Each time we handle one of these events, we call {{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}}, and in both cases we realize that there are more apps both in ZK and in memory than is allowed (limit for both is 1,000). * So for each of these events, we go through the for loops in both {{removeCompletedAppsFromStateStore}} and {{removeCompletedAppsFromMemory}} that try to remove apps from ZK and from memory. * For whatever reason – probably a separate issue on this cluster – log aggregation isn't complete for any of these apps. So the for loops never manage to delete apps. And since the for loops are deterministic, they try to delete the same apps every time, but never make progress. And I think the repetition of these for loops for each {{APP_COMPLETED}} event explains the 18 million number – if we can have at most 9,755 finished apps in the state store, and for each of those apps we trigger 2 for loops that can have at most 8,755 iterations, we very quickly wind up with a lot of iterations. Because this change can lead to much longer RM recovery times in circumstances like this one, I think I prefer option {{a}} from the two listed above. Or, I think it's also reasonable to modify the patch from YARN-9571 to have a hardcoded TTL. > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-4946.001.patch, YARN-4946.002.patch, > YARN-4946.003.patch, YARN-4946.004.patch > > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > The RM should not consider an app to be fully completed (and thus removed > from its history) until the aggregation status has reached a terminal state > (e.g. SUCCEEDED, FAILED, TIME_OUT). -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9277) Add more restrictions In FairScheduler Preemption
[ https://issues.apache.org/jira/browse/YARN-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766783#comment-16766783 ] Steven Rand commented on YARN-9277: --- {code} +// We should not preempt container which has been running for a long time. +if ((System.currentTimeMillis() - container.getCreationTime()) >= +getQueue().getFSContext().getPreemptionConfig() +.getToBePreemptedContainerRuntimeThreshold()) { + logPreemptContainerPreCheckInfo( + "this container already run a long time!"); + return false; +} + {code} I disagree with this because it allows for situations in which starved applications can't preempt applications that are over their fair shares. If application A is starved and application B is over its fair share, but happens to have all its containers running for more than the threshold, then application A is unable to preempt and will remain starved. It might be reasonable to sort preemptable containers by runtime and preempt those that have started most recently. However, I worry that this unfairly biases the scheduler against applications with shorter-lived tasks. If code can't be optimized, and really does require very long-running tasks, then these jobs can be run in a queue from which preemption isn't allowed via the {{allowPreemptionFrom}} property. > Add more restrictions In FairScheduler Preemption > -- > > Key: YARN-9277 > URL: https://issues.apache.org/jira/browse/YARN-9277 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9277.001.patch, YARN-9277.002.patch > > > > I think we should add more restrictions in fair scheduler preemption. > * We should not preempt self > * We should not preempt high priority job > * We should not preempt container which has been running for a long time. > * ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9041) Optimize FSPreemptionThread#identifyContainersToPreempt method
[ https://issues.apache.org/jira/browse/YARN-9041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702538#comment-16702538 ] Steven Rand commented on YARN-9041: --- bq. If we not allowed relax locality, it will executes three statements before used this patch. Otherwise it executes only one statement after used this patch. So I think reorder the conditions can improve the performance. Yes, but it could also be true that {{bestContainers}} is {{null}}, which would short-circuit the other three checks, or that {{ResourceRequest.isAnyLocation(rr.getResourceName())}} is true, which would also short-circuit the other three. It's not immediately clear to me which condition is most likely to not be met / which one makes the most sense to put first in the hope of short-circuiting the others. Anyway though, all four checks should be very cheap since all just involve looking at some object that's already in memory, and none have to make RPC calls or do any computation. So I'm okay with any order. > Optimize FSPreemptionThread#identifyContainersToPreempt method > -- > > Key: YARN-9041 > URL: https://issues.apache.org/jira/browse/YARN-9041 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler preemption >Reporter: Wanqiang Ji >Assignee: Wanqiang Ji >Priority: Major > Attachments: YARN-9041.001.patch, YARN-9041.002.patch, > YARN-9041.003.patch, YARN-9041.004.patch, YARN-9041.005.patch > > > In FSPreemptionThread#identifyContainersToPreempt method, I suggest if AM > preemption, and locality relaxation is allowed, then the search space is > expanded to all nodes changed to the remaining nodes. The remaining nodes are > equal to all nodes minus the potential nodes. > Judging condition changed to: > # rr.getRelaxLocality() > # !ResourceRequest.isAnyLocation(rr.getResourceName()) > # bestContainers != null > # bestContainers.numAMContainers > 0 > If I understand the deviation, please criticize me. thx~ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9066) Deprecate Fair Scheduler min share
[ https://issues.apache.org/jira/browse/YARN-9066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701348#comment-16701348 ] Steven Rand commented on YARN-9066: --- +1 -- I agree with the attached doc that since a schedulable's fair share is already its guaranteed minimum allocation, it's redundant/confusing to have a min share as well. > Deprecate Fair Scheduler min share > -- > > Key: YARN-9066 > URL: https://issues.apache.org/jira/browse/YARN-9066 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.2.0 >Reporter: Haibo Chen >Priority: Major > Attachments: Proposal_Deprecate_FS_Min_Share.pdf > > > See the attached docs for details -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9041) Optimize FSPreemptionThread#identifyContainersToPreempt method
[ https://issues.apache.org/jira/browse/YARN-9041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699843#comment-16699843 ] Steven Rand commented on YARN-9041: --- Yes, the v2 patch resolves my concern -- thanks [~jiwq] for fixing that. I'm curious, what's the motivation for reordering the conditions in the {{if}} block? > Optimize FSPreemptionThread#identifyContainersToPreempt method > -- > > Key: YARN-9041 > URL: https://issues.apache.org/jira/browse/YARN-9041 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler preemption >Reporter: Wanqiang Ji >Assignee: Wanqiang Ji >Priority: Major > Attachments: YARN-9041.001.patch, YARN-9041.002.patch > > > In FSPreemptionThread#identifyContainersToPreempt method, I suggest if AM > preemption, and locality relaxation is allowed, then the search space is > expanded to all nodes changed to the remaining nodes. The remaining nodes are > equal to all nodes minus the potential nodes. > Judging condition changed to: > # rr.getRelaxLocality() > # !ResourceRequest.isAnyLocation(rr.getResourceName()) > # bestContainers != null > # bestContainers.numAMContainers > 0 > If I understand the deviation, please criticize me. thx~ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9041) Optimize FSPreemptionThread#identifyContainersToPreempt method
[ https://issues.apache.org/jira/browse/YARN-9041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694329#comment-16694329 ] Steven Rand commented on YARN-9041: --- I'm not sure that this is correct. I think that it can lead to failure to preempt in cases where we should be preempting. This will happen if the initial {{potentialNodes}} contain preemptible containers, but the remaining nodes don't. Example to illustrate what I'm thinking: * We have nodes A, B, and C * At first {{potentialNodes}} includes only node A because we're preempting for a node-local request for that node * We find that we can preempt a container on node A, but it's an ApplicationMaster * With this patch, we change the search space to be only nodes B and C (without the patch, the search space becomes A, B, and C) * There are no preemptible containers on nodes B and C The outcome in this example is that we don't preempt at all. However, what we want to do is preempt the AM container on node A. Hopefully that makes sense, but let me know if I'm misunderstanding. > Optimize FSPreemptionThread#identifyContainersToPreempt method > -- > > Key: YARN-9041 > URL: https://issues.apache.org/jira/browse/YARN-9041 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler preemption >Reporter: Wanqiang Ji >Assignee: Wanqiang Ji >Priority: Major > Attachments: YARN-9041.001.patch > > > In FSPreemptionThread#identifyContainersToPreempt method, I suggest if AM > preemption, and locality relaxation is allowed, then the search space is > expanded to all nodes changed to the remaining nodes. The remaining nodes are > equal to all nodes minus the potential nodes. > Judging condition changed to: > # rr.getRelaxLocality() > # !ResourceRequest.isAnyLocation(rr.getResourceName()) > # bestContainers != null > # bestContainers.numAMContainers > 0 > If I understand the deviation, please criticize me. thx~ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8903) when NM becomes unhealthy due to local disk usage, have option to kill application using most space instead of releasing all containers on node
Steven Rand created YARN-8903: - Summary: when NM becomes unhealthy due to local disk usage, have option to kill application using most space instead of releasing all containers on node Key: YARN-8903 URL: https://issues.apache.org/jira/browse/YARN-8903 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 3.1.1 Reporter: Steven Rand We sometimes experience an issue in which a single application, usually a Spark job, causes at least one node in a YARN cluster to become unhealthy by filling up the local dir(s) on that node past the threshold at which the node is considered unhealthy. When this happens, the impact is potentially large depending on what else is running on that node, as all containers on that node are lost. Sometimes not much else is running on the node and it's fine, but other times we lose AM containers from other apps and/or non-AM containers with long-running tasks. I thought that it would be helpful to add an option (default false) whereby if a node is going to become unhealthy due to full local disk(s), it instead identifies the application that's using the most local disk space on that node, and kills that application. (Roughly analogous to how the OOM killer in Linux picks one process to kill rather than letting the machine crash.) The benefit is that only one application is impacted, and no other application loses any containers. This prevents one user's poorly written code that shuffles/spills huge amounts of data from negatively impacting other users. The downside is that we're killing the entire application, not just the task(s) responsible for the local disk usage. I believe it's necessary to kill the whole application instead of identifying the container running the relevant task(s), because doing so would require more knowledge of the internal state of aux services responsible for shuffling than what YARN has according to my understanding. If this seems reasonable, I can work on the implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7903) Method getStarvedResourceRequests() only consider the first encountered resource
[ https://issues.apache.org/jira/browse/YARN-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357923#comment-16357923 ] Steven Rand commented on YARN-7903: --- Agreed that having a concept of delay scheduling for preemption is a good idea and would help with both JIRAs. We might be able to use {{FSAppAttempt.getAllowedLocalityLevel}} or {{FSAppAttempt.getAllowedLocalityLevelByTime}}, since those already have logic for checking whether the app has waited longer than the threshold for requests with some {{SchedulerKey}} (which seems to really just mean priority?). I'll defer to others though on whether it makes sense for delay logic in preemption to match delay logic in allocation -- possibly there are differences between the two that call for separate logic. I'm also quite confused as to how we should be thinking about different RRs from the same app at the same priority. I spent some time digging through the code today, but don't really understand it yet. There are a couple pieces of code I found that deal with deduping/deconflicting RRs, but I wasn't sure how to interpret them: * {{VisitedResourceRequestTracker}} seems to consider RRs with the same priority and capability to be logically the same * {{AppSchedulingInfo#internalAddResourceRequests}} seems to consider RRs with the same {{SchedulerRequestKey}} and resourceName to be logically the same > Method getStarvedResourceRequests() only consider the first encountered > resource > > > Key: YARN-7903 > URL: https://issues.apache.org/jira/browse/YARN-7903 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Yufei Gu >Priority: Major > > We need to specify rack and ANY while submitting a node local resource > request, as YARN-7561 discussed. For example: > {code} > ResourceRequest nodeRequest = > createResourceRequest(GB, node1.getHostName(), 1, 1, false); > ResourceRequest rackRequest = > createResourceRequest(GB, node1.getRackName(), 1, 1, false); > ResourceRequest anyRequest = > createResourceRequest(GB, ResourceRequest.ANY, 1, 1, false); > List resourceRequests = > Arrays.asList(nodeRequest, rackRequest, anyRequest); > {code} > However, method getStarvedResourceRequests() only consider the first > encountered resource, which most likely is ResourceRequest.ANY. That's a > mismatch for locality request. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7655) Avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357913#comment-16357913 ] Steven Rand commented on YARN-7655: --- Thanks [~yufeigu]. I filed YARN-7910 for the {{TODO}} in the unit test. Unfortunately I also realized that I made a mistake in how I interpreted the value of {{ResourceRequest.getRelaxLocality}} -- filed YARN-7911 for that. > Avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-7655-001.patch, YARN-7655-002.patch, > YARN-7655-003.patch, YARN-7655-004.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7911) Method identifyContainersToPreempt uses ResourceRequest#getRelaxLocality incorrectly
Steven Rand created YARN-7911: - Summary: Method identifyContainersToPreempt uses ResourceRequest#getRelaxLocality incorrectly Key: YARN-7911 URL: https://issues.apache.org/jira/browse/YARN-7911 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 3.1.0 Reporter: Steven Rand Assignee: Steven Rand After YARN-7655, in {{identifyContainersToPreempt}} we expand the search space to all nodes if we had previously only considered a subset to satisfy a {{NODE_LOCAL}} or {{RACK_LOCAL}} RR, and were going to preempt AM containers as a result, and the RR allowed locality to be relaxed: {code} // Don't preempt AM containers just to satisfy local requests if relax // locality is enabled. if (bestContainers != null && bestContainers.numAMContainers > 0 && !ResourceRequest.isAnyLocation(rr.getResourceName()) && rr.getRelaxLocality()) { bestContainers = identifyContainersToPreemptForOneContainer( scheduler.getNodeTracker().getAllNodes(), rr); } {code} This turns out to be based on a misunderstanding of what {{rr.getRelaxLocality}} means. I had believed that it means that locality can be relaxed _from_ that level. However, it actually means that locality can be relaxed _to_ that level: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L450. For example, suppose we have {{relaxLocality}} set to {{true}} at the node level, but {{false}} at the rack and {{ANY}} levels. This is saying that we cannot relax locality to the rack level. However, the current behavior after YARN-7655 is to interpret relaxLocality being true at the node level as saying that it's okay to satisfy the request elsewhere. What we should do instead is check whether relaxLocality is enabled for the corresponding RR at the next level. So if we're considering a node-level RR, we should find the corresponding rack-level RR and check whether relaxLocality is enabled for it. And similarly, if we're considering a rack-level RR, we should check the corresponding any-level RR. It may also be better to use {{FSAppAttempt#getAllowedLocalityLevel}} instead of explicitly checking {{relaxLocality}}, but I'm not sure which is correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7910) Fix TODO in TestFairSchedulerPreemption#testRelaxLocalityToNotPreemptAM
Steven Rand created YARN-7910: - Summary: Fix TODO in TestFairSchedulerPreemption#testRelaxLocalityToNotPreemptAM Key: YARN-7910 URL: https://issues.apache.org/jira/browse/YARN-7910 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler, test Affects Versions: 3.1.0 Reporter: Steven Rand Assignee: Steven Rand In YARN-7655, we left a {{TODO}} in the newly added test: {code} // TODO (YARN-7655) The starved app should be allocated 4 containers. // It should be possible to modify the RRs such that this is true // after YARN-7903. verifyPreemption(0, 4); {code} This JIRA is to track resolving that after YARN-7903 is resolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7903) Method getStarvedResourceRequests() only consider the first encountered resource
[ https://issues.apache.org/jira/browse/YARN-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356327#comment-16356327 ] Steven Rand commented on YARN-7903: --- Agreed that it seems weird/wrong to ignore locality when considering which of an app's RRs to preempt for. I think it's worth noting though that if we change the code to choose the most local request, then we increase the frequency of the failure mode described in YARN-6956, where we fail to preempt because {{getStarvedResourceRequests}} returns only {{NODE_LOCAL}} RRs, and there aren't any preemptable containers on those nodes (even though there are preemptable containers on other nodes). I think that we should try to make progress on that JIRA as well as this one. > Method getStarvedResourceRequests() only consider the first encountered > resource > > > Key: YARN-7903 > URL: https://issues.apache.org/jira/browse/YARN-7903 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Yufei Gu >Priority: Major > > We need to specify rack and ANY while submitting a node local resource > request, as YARN-7561 discussed. For example: > {code} > ResourceRequest nodeRequest = > createResourceRequest(GB, node1.getHostName(), 1, 1, false); > ResourceRequest rackRequest = > createResourceRequest(GB, node1.getRackName(), 1, 1, false); > ResourceRequest anyRequest = > createResourceRequest(GB, ResourceRequest.ANY, 1, 1, false); > List resourceRequests = > Arrays.asList(nodeRequest, rackRequest, anyRequest); > {code} > However, method getStarvedResourceRequests() only consider the first > encountered resource, which most likely is ResourceRequest.ANY. That's a > mismatch for locality request. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356308#comment-16356308 ] Steven Rand commented on YARN-7655: --- Sounds good, I revised the patch to mention YARN-7903 in a comment in the test. Is this patch blocked on YARN-7903, or is it enough to leave the {{TODO}} for now and revise the RRs after that JIRA is resolved? Relatedly, I thought it was odd that {{FSAppAttempt#hasContainerForNode}} only considers the size of the off-switch ask, even when there also exists a RACK_LOCAL request and/or a NODE_LOCAL request. I don't understand that code super well though, so it might be correct. > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch, YARN-7655-002.patch, > YARN-7655-003.patch, YARN-7655-004.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7655: -- Attachment: YARN-7655-004.patch > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch, YARN-7655-002.patch, > YARN-7655-003.patch, YARN-7655-004.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7655: -- Attachment: YARN-7655-003.patch > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch, YARN-7655-002.patch, > YARN-7655-003.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353279#comment-16353279 ] Steven Rand commented on YARN-7655: --- The concern I have with all three RRs being the same size is that we don't necessarily consider the {{NODE_LOCAL}} RR for preemption. My understanding is that we might wind up preempting for one of the other RRs, in which case we're no longer testing the change to the production code. Let me know if I'm misunderstanding though. > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch, YARN-7655-002.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351704#comment-16351704 ] Steven Rand commented on YARN-7655: --- Thanks [~yufeigu], new patch is attached. Unfortunately I'm still struggling to have the starved app be allocated the right number of containers in the test (though the preemption part happens correctly). The details of that are in my first comment above. It seems like the options are: * What the current patch does, which is just leave a TODO above where we check for allocation. * Only test that the preemption went as expected, and don't test allocation, i.e., don't call {{verifyPreemption}}. * Find a way to have the allocation work out while still guaranteeing that the RR we consider for preemption is the {{NODE_LOCAL}} one. I thought I'd be able to figure this out, but have to admit I've been unsuccessful. > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch, YARN-7655-002.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7655: -- Attachment: YARN-7655-002.patch > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch, YARN-7655-002.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344335#comment-16344335 ] Steven Rand commented on YARN-7655: --- Sounds good, thanks! > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341964#comment-16341964 ] Steven Rand commented on YARN-7655: --- I'm not sure whether many AMs wind up on a limited number of NMs. It's quite possible -- my guess based on application patterns is that these clusters are running more AMs per node than most other clusters are. Thanks for the two links. it does look like both of those things would let us spread out the AMs better, which should lead to fewer total AM preemptions, though not necessarily prevent local requests from causing them. Do you think the patch is worth pursuing? I'll buy that the clusters I have in mind likely were seeing so many AM preemptions due to a combination of custom config and access patterns involving many YARN applications, and therefore many AMs. On the other hand, the patch is a small change, and should be beneficial if you value not having to retry your app due to AM preemption more than you value the associated loss of locality, which I suspect most people do. > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16325928#comment-16325928 ] Steven Rand edited comment on YARN-7655 at 1/15/18 6:49 AM: Thanks [~yufeigu] for taking a look. The cluster sizes and nodes should be pretty reasonable – for the three clusters I have in mind, the nodes are AWS ec2 instances with around 120 GB of RAM and around 20 vcores. The clusters range in size from double-digits to low triple-digits. That said, there is some configuration in place at these clusters which could explain high rates of AM preemption. Specifically: * The default max AM share is set to -1. Unfortunately the max AM share feature, while totally reasonable as far as I can tell, was causing a good deal of confusion when apps would fail to start for no apparent reason upon hitting the limit, and we disabled it in the hope that having one less variable would make the scheduler's behavior easier to understand. * The default fair share preemption threshold is set to 1.0. This was also an attempt to reduce confusion, as failure to preempt while below fair share (but above fair share * the threshold) was commonly misinterpreted as a bug. * The preemption timeouts for fair share and min share are also non-default – they're set to one second each. Possibly the configuration overrides, along with access patterns that include apps frequently starting up or increasing their demand via Spark's dynamic allocation feature, are the issue here, in which case we don't need to pursue this JIRA further. Data on whether or not other YARN deployments experience this issue would be useful, though not easy to come by, as I had to add custom logging to identify NODE_LOCAL requests as the cause of most AM preemptions at these clusters. was (Author: steven rand): Thanks [~yufeigu] for taking a look. The cluster sizes and nodes should be pretty reasonable -- for the three clusters I have in mind, the nodes are AWS ec2 instances with around 120 GB of RAM and around 20 vcores. The clusters range in size from double-digits to low triple-digits. That said, there is some configuration in place at these clusters which could explain high rates of AM preemption. Specifically: * The default max AM share is set to -1. Unfortunately the max AM share feature, while totally reasonable as far as I can tell, was causing a good deal of confusion when apps would fail to start for no apparently reason upon hitting the limit, and we disabled it in the hope that having one less variable would make the scheduler's behavior easier to understand. * The default fair share preemption threshold is set to 1.0. This was also an attempt to reduce confusion, as failure to preempt while below fair share (but above fair share * the threshold) was commonly misinterpreted as a bug. * The preemption timeouts for fair share and min share are also non-default -- they're set to one second each. Possibly the configuration overrides, along with access patterns that include apps frequently starting up or increasing their demand via Spark's dynamic allocation feature, are the issue here, in which case we don't need to pursue this JIRA further. Data on whether or not other YARN deployments experience this issue would be useful, though not easy to come by, as I had to add custom logging to identify NODE_LOCAL requests as the cause of most AM preemptions at these clusters. > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch
[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16325928#comment-16325928 ] Steven Rand commented on YARN-7655: --- Thanks [~yufeigu] for taking a look. The cluster sizes and nodes should be pretty reasonable -- for the three clusters I have in mind, the nodes are AWS ec2 instances with around 120 GB of RAM and around 20 vcores. The clusters range in size from double-digits to low triple-digits. That said, there is some configuration in place at these clusters which could explain high rates of AM preemption. Specifically: * The default max AM share is set to -1. Unfortunately the max AM share feature, while totally reasonable as far as I can tell, was causing a good deal of confusion when apps would fail to start for no apparently reason upon hitting the limit, and we disabled it in the hope that having one less variable would make the scheduler's behavior easier to understand. * The default fair share preemption threshold is set to 1.0. This was also an attempt to reduce confusion, as failure to preempt while below fair share (but above fair share * the threshold) was commonly misinterpreted as a bug. * The preemption timeouts for fair share and min share are also non-default -- they're set to one second each. Possibly the configuration overrides, along with access patterns that include apps frequently starting up or increasing their demand via Spark's dynamic allocation feature, are the issue here, in which case we don't need to pursue this JIRA further. Data on whether or not other YARN deployments experience this issue would be useful, though not easy to come by, as I had to add custom logging to identify NODE_LOCAL requests as the cause of most AM preemptions at these clusters. > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-7655-001.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310553#comment-16310553 ] Steven Rand commented on YARN-7655: --- Tagging [~yufeigu] and [~templedf] for thoughts. I can work through the above weirdness with the test case, but interested to hear what people think of the proposed change. > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7655-001.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290073#comment-16290073 ] Steven Rand commented on YARN-7655: --- One issue I'm having with the test in the patch is that preemption works as expected, but the starved app doesn't have any containers allocated to it. I think the series of events that causes this is: * For purposes of the test, I'm only interested in requesting resources on a particular node. But as discussed in YARN-7561, this requires me to also make a rack-local request and a request for any node at the same priority. * To make sure that the RR that we consider for preemption is the node-local one, I made the other two RRs too big to be satisfied, so that way {{getStarvedResourceRequests}} skips them. * However, when we go to allocate the preempted resources to the starving app, it turns out that {{FSAppAttempt#hasContainerForNode}} only looks at the capacity of the off-switch ask: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1071. This causes it to decide that the starving app can't be allocated resources on the node, since I intentionally made the off-switch RR too big to fit on any of the test nodes. The fact that the node-local request (for the other node) is small enough to fit on this node gets ignored. I'm having trouble figuring out what to do about this. I had assumed that if relaxLocality was true for an RR, then it would be able to be satisfied on node B even though it asked for node A. Is this not correct? Or should FSAppAttempt#hasContainerForNode be modified to check the sizes of the asks at rack and node-level (if those exist)? > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7655-001.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
[ https://issues.apache.org/jira/browse/YARN-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7655: -- Attachment: YARN-7655-001.patch > avoid AM preemption caused by RRs for specific nodes or racks > - > > Key: YARN-7655 > URL: https://issues.apache.org/jira/browse/YARN-7655 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7655-001.patch > > > We frequently see AM preemptions when > {{starvedApp.getStarvedResourceRequests()}} in > {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs > that request containers on a specific node. Since this causes us to only > consider one node to preempt containers on, the really good work that was > done in YARN-5830 doesn't save us from AM preemption. Even though there might > be multiple nodes on which we could preempt enough non-AM containers to > satisfy the app's starvation, we often wind up preempting one or more AM > containers on the single node that we're considering. > A proposed solution is that if we're going to preempt one or more AM > containers for an RR that specifies a node or rack, then we should instead > expand the search space to consider all nodes. That way we take advantage of > YARN-5830, and only preempt AMs if there's no alternative. I've attached a > patch with an initial implementation of this. We've been running it on a few > clusters, and have seen AM preemptions drop from double-digit occurrences on > many days to zero. > Of course, the tradeoff is some loss of locality, since the starved app is > less likely to be allocated resources at the most specific locality level > that it asked for. My opinion is that this tradeoff is worth it, but > interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7655) avoid AM preemption caused by RRs for specific nodes or racks
Steven Rand created YARN-7655: - Summary: avoid AM preemption caused by RRs for specific nodes or racks Key: YARN-7655 URL: https://issues.apache.org/jira/browse/YARN-7655 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 3.0.0 Reporter: Steven Rand Assignee: Steven Rand We frequently see AM preemptions when {{starvedApp.getStarvedResourceRequests()}} in {{FSPreemptionThread#identifyContainersToPreempt}} includes one or more RRs that request containers on a specific node. Since this causes us to only consider one node to preempt containers on, the really good work that was done in YARN-5830 doesn't save us from AM preemption. Even though there might be multiple nodes on which we could preempt enough non-AM containers to satisfy the app's starvation, we often wind up preempting one or more AM containers on the single node that we're considering. A proposed solution is that if we're going to preempt one or more AM containers for an RR that specifies a node or rack, then we should instead expand the search space to consider all nodes. That way we take advantage of YARN-5830, and only preempt AMs if there's no alternative. I've attached a patch with an initial implementation of this. We've been running it on a few clusters, and have seen AM preemptions drop from double-digit occurrences on many days to zero. Of course, the tradeoff is some loss of locality, since the starved app is less likely to be allocated resources at the most specific locality level that it asked for. My opinion is that this tradeoff is worth it, but interested to hear what others think as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7290: -- Attachment: YARN-7290.005.patch Thanks, [~yufeigu]. Attaching a new patch which removes the list of containers, and changes {{resourcesToPreemptByApp}} to a {{Map>}}. Re: the checkstyle issues, one of them no longer applies to the new patch. The other one is that the new {{containersByApp}} variable should be made private and an accessor method should be created for it. I'm happy to do that, but it also would be inconsistent with the other variables in {{PreemptableContainers}}, which aren't private and don't have getters. I don't have a strong opinion, so happy to handle this however people prefer. > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7290-failing-test.patch, YARN-7290.001.patch, > YARN-7290.002.patch, YARN-7290.003.patch, YARN-7290.004.patch, > YARN-7290.005.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7290: -- Attachment: YARN-7290.004.patch Thanks for reviewing, [~yufeigu]! I've attached a new patch which addresses the comments: * Agreed that it makes sense to move the logic of computing app resource usage after preemption into its own method. I added a new method called {{getUsageAfterPreemptingContainer}} below {{canContainerBePreempted}}. * The unit test does cover the second issue. Actually I hadn't noticed the second issue from looking at the code, and only noticed it when the test still failed after addressing the fist issue. * Agreed that {{identifyContainersToPreemptOnNode}} has quite a lot of logic in it now. I moved the map from appId to resources we're considering for preemption into {{PreemptableContainers}}, which seems like the right place for it, and simplifies that method. * Unused import is now removed -- nice catch. > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7290-failing-test.patch, YARN-7290.001.patch, > YARN-7290.002.patch, YARN-7290.003.patch, YARN-7290.004.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7290: -- Attachment: YARN-7290.003.patch Uploaded a new patch to try to make the test a bit nicer. [~templedf], would it be possible for you or someone else to take a look? This bug seems to still exist on trunk, and I think it'd be good to fix it. > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7290-failing-test.patch, YARN-7290.001.patch, > YARN-7290.002.patch, YARN-7290.003.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7391) Consider square root instead of natural log for size-based weight
[ https://issues.apache.org/jira/browse/YARN-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7391: -- Attachment: YARN-7391-001.patch I know this is still under discussion, but attached a patch just to make the intent/scope of the proposed change totally clear. > Consider square root instead of natural log for size-based weight > - > > Key: YARN-7391 > URL: https://issues.apache.org/jira/browse/YARN-7391 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand > Attachments: YARN-7391-001.patch > > > Currently for size-based weight, we compute the weight of an app using this > code from > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L377: > {code} > if (sizeBasedWeight) { > // Set weight based on current memory demand > weight = Math.log1p(app.getDemand().getMemorySize()) / Math.log(2); > } > {code} > Because the natural log function grows slowly, the weights of two apps with > hugely different memory demands can be quite similar. For example, {{weight}} > evaluates to 14.3 for an app with a demand of 20 GB, and evaluates to 19.9 > for an app with a demand of 1000 GB. The app with the much larger demand will > still have a higher weight, but not by a large amount relative to the sum of > those weights. > I think it's worth considering a switch to a square root function, which will > grow more quickly. In the above example, the app with a demand of 20 GB now > has a weight of 143, while the app with a demand of 1000 GB now has a weight > of 1012. These weights seem more reasonable relative to each other given the > difference in demand between the two apps. > The above example is admittedly a bit extreme, but I believe that a square > root function would also produce reasonable results in general. > The code I have in mind would look something like: > {code} > if (sizeBasedWeight) { > // Set weight based on current memory demand > weight = Math.sqrt(app.getDemand().getMemorySize()); > } > {code} > Would people be comfortable with this change? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7391) Consider square root instead of natural log for size-based weight
[ https://issues.apache.org/jira/browse/YARN-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16223996#comment-16223996 ] Steven Rand edited comment on YARN-7391 at 10/29/17 1:27 PM: - [~templedf] and [~yufeigu], thanks for commenting. Apologies for not including my use case in the original description. We run multiple long-running Spark applications, each of which uses Spark's dynamic allocation feature, and therefore has a demand which fluctuates over time. At any point, the demand of any given app can be quite low (e.g., only an AM container), or quite high (e.g., hundreds of executors). Historically, we've run each app in its own leaf queue, since the Fair Scheduler has not always supported preemption inside a leaf queue. We've found that since the fair share of a parent queue is split evenly among all of its active leaf queues, the fair share of each app is the same, regardless of its demand. This causes our apps with higher demand to have fair shares that are too low for them to preempt enough resources to even get close to meeting their demand. If fair share were based on demand, then our apps with lower demand would be unaffected, but our apps with higher demand could have high enough weights to preempt a reasonable number of resources away from apps that are over their fair shares. This problem led us to consider running more apps inside the same leaf queue, which is no longer an issue now that the Fair Scheduler supports preemption between apps in the same leaf queue. We'd hoped to use the size-based weight feature to achieve the goal of the more demanding apps having high enough fair shares to preempt sufficient resources away from other apps. However, in experimenting with this feature, the results were somewhat underwhelming. Yes, the more demanding apps now have higher fair shares, but not by enough to significantly impact allocation. Consider, for example, the rather extreme case of 10 apps running in a leaf queue, where 9 of them are requesting 20GB each, and 1 of them is requesting 1024GB. The weight of each of the 9 less demanding apps is about 14.3, and the weight of the highly demanding app is about 20.0. So the highly demanding app winds up with about 13.5% (20/148) of the queue's fair share, despite having a demand that's more than 5x that of the other 9 put together, as opposed to the 10% it would have with size-based weight turned off. I know the example is a bit silly, but I wanted to show that even with huge differences in demand, the current behavior of size-based weight doesn't produce major differences in weights. Does that make sense? Happy to provide more info if helpful. was (Author: steven rand): [~templedf] and [~yufeigu], thanks for commenting. Apologies for not including my use case in the original description. We run multiple long-running Spark applications, each of which uses Spark's dynamic allocation feature, and therefore has a demand which fluctuates over time. At any point, the demand of any given app can be quite low (e.g., only an AM container), or quite high (e.g., hundreds of executors). Historically, we've run each app in its own leaf queue, since the Fair Scheduler has not always supported preemption inside a leaf queue. We've found that since the fair share of a parent queue is split evenly among all of its active leaf queues, the fair share of each app is the same, regardless of its demand. This causes our apps with higher demand to have fair shares that are too low for them to preempt enough resources to even get close to meeting their demand. If fair share were based on demand, then our apps with lower demand would be unaffected, but our apps with higher demand could have high enough weights to preempt a reasonable number of resources away from apps that over their fair shares. This problem led us to consider running more apps inside the same leaf queue, which is no longer an issue now that the Fair Scheduler supports preemption inside a leaf queue. We'd hoped to use the size-based weight feature to achieve the goal of the more demanding apps having high enough fair shares to preempt sufficient resources away from other apps. However, in experimenting with this feature, the results were somewhat underwhelming. Yes, the more demanding apps now have higher fair shares, but not by enough to significantly impact allocation. Consider, for example, the rather extreme case of 10 apps running in a leaf queue, where 9 of them are requesting 20GB each, and 1 of them is requesting 1024GB. The weight of each of the 9 less demanding apps is about 14.3, and the weight of the highly demanding app is about 20.0. So the highly demanding app winds up with about 13.5% (20/148) of the queue's fair share, despite having a demand that's more than 5x that of the other 9 put together, as opposed to the
[jira] [Commented] (YARN-7391) Consider square root instead of natural log for size-based weight
[ https://issues.apache.org/jira/browse/YARN-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16223996#comment-16223996 ] Steven Rand commented on YARN-7391: --- [~templedf] and [~yufeigu], thanks for commenting. Apologies for not including my use case in the original description. We run multiple long-running Spark applications, each of which uses Spark's dynamic allocation feature, and therefore has a demand which fluctuates over time. At any point, the demand of any given app can be quite low (e.g., only an AM container), or quite high (e.g., hundreds of executors). Historically, we've run each app in its own leaf queue, since the Fair Scheduler has not always supported preemption inside a leaf queue. We've found that since the fair share of a parent queue is split evenly among all of its active leaf queues, the fair share of each app is the same, regardless of its demand. This causes our apps with higher demand to have fair shares that are too low for them to preempt enough resources to even get close to meeting their demand. If fair share were based on demand, then our apps with lower demand would be unaffected, but our apps with higher demand could have high enough weights to preempt a reasonable number of resources away from apps that over their fair shares. This problem led us to consider running more apps inside the same leaf queue, which is no longer an issue now that the Fair Scheduler supports preemption inside a leaf queue. We'd hoped to use the size-based weight feature to achieve the goal of the more demanding apps having high enough fair shares to preempt sufficient resources away from other apps. However, in experimenting with this feature, the results were somewhat underwhelming. Yes, the more demanding apps now have higher fair shares, but not by enough to significantly impact allocation. Consider, for example, the rather extreme case of 10 apps running in a leaf queue, where 9 of them are requesting 20GB each, and 1 of them is requesting 1024GB. The weight of each of the 9 less demanding apps is about 14.3, and the weight of the highly demanding app is about 20.0. So the highly demanding app winds up with about 13.5% (20/148) of the queue's fair share, despite having a demand that's more than 5x that of the other 9 put together, as opposed to the 10% it would have with size-based weight turned off. I know the example is a bit silly, but I wanted to show that even with huge differences in demand, the current behavior of size-based weight doesn't produce major differences in weights. Does that make sense? Happy to provide more info if helpful. > Consider square root instead of natural log for size-based weight > - > > Key: YARN-7391 > URL: https://issues.apache.org/jira/browse/YARN-7391 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand > > Currently for size-based weight, we compute the weight of an app using this > code from > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L377: > {code} > if (sizeBasedWeight) { > // Set weight based on current memory demand > weight = Math.log1p(app.getDemand().getMemorySize()) / Math.log(2); > } > {code} > Because the natural log function grows slowly, the weights of two apps with > hugely different memory demands can be quite similar. For example, {{weight}} > evaluates to 14.3 for an app with a demand of 20 GB, and evaluates to 19.9 > for an app with a demand of 1000 GB. The app with the much larger demand will > still have a higher weight, but not by a large amount relative to the sum of > those weights. > I think it's worth considering a switch to a square root function, which will > grow more quickly. In the above example, the app with a demand of 20 GB now > has a weight of 143, while the app with a demand of 1000 GB now has a weight > of 1012. These weights seem more reasonable relative to each other given the > difference in demand between the two apps. > The above example is admittedly a bit extreme, but I believe that a square > root function would also produce reasonable results in general. > The code I have in mind would look something like: > {code} > if (sizeBasedWeight) { > // Set weight based on current memory demand > weight = Math.sqrt(app.getDemand().getMemorySize()); > } > {code} > Would people be comfortable with this change? -- This message was sent by Atlassian JIRA (v6.4.14#64029) -
[jira] [Created] (YARN-7391) Consider square root instead of natural log for size-based weight
Steven Rand created YARN-7391: - Summary: Consider square root instead of natural log for size-based weight Key: YARN-7391 URL: https://issues.apache.org/jira/browse/YARN-7391 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 3.0.0-beta1 Reporter: Steven Rand Currently for size-based weight, we compute the weight of an app using this code from https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L377: {code} if (sizeBasedWeight) { // Set weight based on current memory demand weight = Math.log1p(app.getDemand().getMemorySize()) / Math.log(2); } {code} Because the natural log function grows slowly, the weights of two apps with hugely different memory demands can be quite similar. For example, {{weight}} evaluates to 14.3 for an app with a demand of 20 GB, and evaluates to 19.9 for an app with a demand of 1000 GB. The app with the much larger demand will still have a higher weight, but not by a large amount relative to the sum of those weights. I think it's worth considering a switch to a square root function, which will grow more quickly. In the above example, the app with a demand of 20 GB now has a weight of 143, while the app with a demand of 1000 GB now has a weight of 1012. These weights seem more reasonable relative to each other given the difference in demand between the two apps. The above example is admittedly a bit extreme, but I believe that a square root function would also produce reasonable results in general. The code I have in mind would look something like: {code} if (sizeBasedWeight) { // Set weight based on current memory demand weight = Math.sqrt(app.getDemand().getMemorySize()); } {code} Would people be comfortable with this change? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216201#comment-16216201 ] Steven Rand commented on YARN-4227: --- Maybe we could have ClusterNodeTracker#getNode check to see if {{nodes.get(nodeId)}} returns null, and if it does, instead log a warning and return a special subclass of {{FSSchedulerNode}} that overrides all methods to be no-ops? I know it's not pretty, but the advantage is that we don't have to check for null in a bunch of different places. Also, after looking more closely, the particular NPE that I'm seeing turns out to have been fixed by YARN-6432. However, I still think that we want a generic solution so as to be protected against access of unhealthy nodes going forward. > FairScheduler: RM quits processing expired container from a removed node > > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.3.0, 2.5.0, 2.7.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, > YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_12 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_12 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op >OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_12 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216148#comment-16216148 ] Steven Rand commented on YARN-4227: --- Sorry, I was mistaken when I said the patch attached to this JIRA prevents the NPE. Unfortunately the FSPreemptionThread accesses nodes at multiple points, each of which is a new opportunity for the race condition to occur and cause an NPE. It seems impractical to wrap each node access in an {{if (node != null)}} block, though admittedly I don't have any better ideas right now. Are there alternate solutions that I'm failing to consider that would prevent the race condition from happening? Happy to submit a patch if anyone has ideas. > FairScheduler: RM quits processing expired container from a removed node > > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.3.0, 2.5.0, 2.7.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, > YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_12 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_12 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op >OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_12 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210418#comment-16210418 ] Steven Rand commented on YARN-7290: --- Thanks [~templedf]. For what it's worth, I was able to repro this on a live cluster as well as in the test. I let one spark-shell use the entire cluster, and then started a second spark-shell. The second-spark shell was able to preempt all of the first one's containers, including the Application Master. After I applied the patch, the second spark-shell was only able to preempt half of the cluster's resources away from the first one. > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7290-failing-test.patch, YARN-7290.001.patch, > YARN-7290.002.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7290: -- Attachment: YARN-7290.002.patch Adding a new patch to make checkstyles happy. The tests in TestOpportunisticContainerAllocatorAMService all pass for me locally despite the failure in the last Jenkins run. > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7290.001.patch, YARN-7290.002.patch, > YARN-7290-failing-test.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7290: -- Attachment: YARN-7290.001.patch Added a patch which I _think_ fixes both issues. All tests in {{TestFairSchedulerPreemption}} pass for me locally, including the new one, but the details here are tricky. > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7290.001.patch, YARN-7290-failing-test.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192324#comment-16192324 ] Steven Rand commented on YARN-7290: --- An additional problem is that we call {{app.trackContainerForPreemption}} in {{preemptContainers}}, so after {{identifyContainersToPreempt}} has returned. Therefore after we've finished iterating through one container in the value of {{rr.getNumContainers()}}, we will have added some containers to {{containersToPreempt}}, but {{resourcesToBePreempted}} will not have been updated for any app. This allows subsequent calls to {{canContainerBePreempted}} in the same for loop to return {{true}} incorrectly, since we've already decided to preempt some containers, but the apps aren't aware of it yet. > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7290-failing-test.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand reassigned YARN-7290: - Assignee: Steven Rand > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-7290-failing-test.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7290) canContainerBePreempted can return true when it shouldn't
[ https://issues.apache.org/jira/browse/YARN-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-7290: -- Attachment: YARN-7290-failing-test.patch > canContainerBePreempted can return true when it shouldn't > - > > Key: YARN-7290 > URL: https://issues.apache.org/jira/browse/YARN-7290 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Steven Rand > Attachments: YARN-7290-failing-test.patch > > > In FSAppAttempt#canContainerBePreempted, we make sure that preempting the > given container would not put the app below its fair share: > {code} > // Check if the app's allocation will be over its fairshare even > // after preempting this container > Resource usageAfterPreemption = Resources.clone(getResourceUsage()); > // Subtract resources of containers already queued for preemption > synchronized (preemptionVariablesLock) { > Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); > } > // Subtract this container's allocation to compute usage after preemption > Resources.subtractFrom( > usageAfterPreemption, container.getAllocatedResource()); > return !isUsageBelowShare(usageAfterPreemption, getFairShare()); > {code} > However, this only considers one container in isolation, and fails to > consider containers for the same app that we already added to > {{preemptableContainers}} in > FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a > case where we preempt multiple containers from the same app, none of which by > itself puts the app below fair share, but which cumulatively do so. > I've attached a patch with a test to show this behavior. The flow is: > 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated > all the resources (8g and 8vcores) > 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 > containers, each of which is 3g and 3vcores in size. At this point both > greedyApp and starvingApp have a fair share of 4g (with DRF not in use). > 3. For the first container requested by starvedApp, we (correctly) preempt 3 > containers from greedyApp, each of which is 1g and 1vcore. > 4. For the second container requested by starvedApp, we again (this time > incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below > its fair share, but happens anyway because all six times that we call > {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the > value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using > debugger). > So in addition to accounting for {{resourcesToBePreempted}}, we also need to > account for containers that we're already planning on preempting in > FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7290) canContainerBePreempted can return true when it shouldn't
Steven Rand created YARN-7290: - Summary: canContainerBePreempted can return true when it shouldn't Key: YARN-7290 URL: https://issues.apache.org/jira/browse/YARN-7290 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 3.0.0-beta1 Reporter: Steven Rand In FSAppAttempt#canContainerBePreempted, we make sure that preempting the given container would not put the app below its fair share: {code} // Check if the app's allocation will be over its fairshare even // after preempting this container Resource usageAfterPreemption = Resources.clone(getResourceUsage()); // Subtract resources of containers already queued for preemption synchronized (preemptionVariablesLock) { Resources.subtractFrom(usageAfterPreemption, resourcesToBePreempted); } // Subtract this container's allocation to compute usage after preemption Resources.subtractFrom( usageAfterPreemption, container.getAllocatedResource()); return !isUsageBelowShare(usageAfterPreemption, getFairShare()); {code} However, this only considers one container in isolation, and fails to consider containers for the same app that we already added to {{preemptableContainers}} in FSPreemptionThread#identifyContainersToPreemptOnNode. Therefore we can have a case where we preempt multiple containers from the same app, none of which by itself puts the app below fair share, but which cumulatively do so. I've attached a patch with a test to show this behavior. The flow is: 1. Initially greedyApp runs in {{root.preemptable.child-1}} and is allocated all the resources (8g and 8vcores) 2. Then starvingApp runs in {{root.preemptable.child-2}} and requests 2 containers, each of which is 3g and 3vcores in size. At this point both greedyApp and starvingApp have a fair share of 4g (with DRF not in use). 3. For the first container requested by starvedApp, we (correctly) preempt 3 containers from greedyApp, each of which is 1g and 1vcore. 4. For the second container requested by starvedApp, we again (this time incorrectly) preempt 3 containers from greedyApp. This puts greedyApp below its fair share, but happens anyway because all six times that we call {{return !isUsageBelowShare(usageAfterPreemption, getFairShare());}}, the value of {{usageAfterPreemption}} is 7g and 7vcores (confirmed using debugger). So in addition to accounting for {{resourcesToBePreempted}}, we also need to account for containers that we're already planning on preempting in FSPreemptionThread#identifyContainersToPreemptOnNode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5742) Serve aggregated logs of historical apps from timeline service
[ https://issues.apache.org/jira/browse/YARN-5742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162609#comment-16162609 ] Steven Rand commented on YARN-5742: --- Would it also be reasonable for the Timeline Service to enforce retention on aggregated logs? As YARN-2985 points out, there's currently no retention unless the MR JHS is deployed. I was going to try to write a patch that moves retention into the Application History Server, but wasn't sure whether it belongs there or in the Timeline Service. > Serve aggregated logs of historical apps from timeline service > -- > > Key: YARN-5742 > URL: https://issues.apache.org/jira/browse/YARN-5742 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Varun Saxena >Assignee: Rohith Sharma K S > Attachments: YARN-5742-POC-v0.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154671#comment-16154671 ] Steven Rand commented on YARN-6956: --- Friendly ping [~kasha] and/or [~templedf]. I'll fix the checkstyle issues in the next patch, but wanted to gather other feedback as well. > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-6956.001.patch > > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153423#comment-16153423 ] Steven Rand commented on YARN-4227: --- [~wilfreds], I can rebase the patch if you like. It seems to be working quite nicely by the way -- we applied it to a cluster which was periodically exhibiting this problem and haven't seen it since. > FairScheduler: RM quits processing expired container from a removed node > > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.3.0, 2.5.0, 2.7.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, > YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_12 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_12 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op >OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_12 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares
[ https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137446#comment-16137446 ] Steven Rand commented on YARN-6960: --- Thanks, Daniel. Having thought about this some more, I don't think that either of the two patches I've posted is a good solution. In the first patch, inactive queues have fair shares of zero, and AM containers are subject to preemption even when running in high-priority queues. And in the second patch, applications running in idle queues define what their fair shares are irrespective of cluster-side settings, which doesn't make sense. I'll think about this some more and try to come up with a better idea, but I'd also be quite interested in hearing your opinion and those of others. > definition of active queue allows idle long-running apps to distort fair > shares > --- > > Key: YARN-6960 > URL: https://issues.apache.org/jira/browse/YARN-6960 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.1, 3.0.0-alpha4 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-6960.001.patch, YARN-6960.002.patch > > > YARN-2026 introduced the notion of only considering active queues when > computing the fair share of each queue. The definition of an active queue is > a queue with at least one runnable app: > {code} > public boolean isActive() { > return getNumRunnableApps() > 0; > } > {code} > One case that this definition of activity doesn't account for is that of > long-running applications that scale dynamically. Such an application might > request many containers when jobs are running, but scale down to very few > containers, or only the AM container, when no jobs are running. > Even when such an application has scaled down to a negligible amount of > demand and utilization, the queue that it's in is still considered to be > active, which defeats the purpose of YARN-2026. For example, consider this > scenario: > 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of > which have the same weight. > 2. Queues {{root.a}} and {{root.b}} contain long-running applications that > currently have only one container each (the AM). > 3. An application in queue {{root.c}} starts, and uses the whole cluster > except for the small amount in use by {{root.a}} and {{root.b}}. An > application in {{root.d}} starts, and has a high enough demand to be able to > use half of the cluster. Because all four queues are active, the app in > {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the > cluster's resources, while the app in {{root.c}} keeps about 75%. > Ideally in this example, the app in {{root.d}} would be able to preempt the > app in {{root.c}} up to 50% of the cluster, which would be possible if the > idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be > considered active. > One way to address this is to update the definition of an active queue to be > a queue containing 1 or more non-AM containers. This way if all apps in a > queue scale down to only the AM, other queues' fair shares aren't affected. > The benefit of this approach is that it's quite simple. The downside is that > it doesn't account for apps that are idle and using almost no resources, but > still have at least one non-AM container. > There are a couple of other options that seem plausible to me, but they're > much more complicated, and it seems to me that this proposal makes good > progress while adding minimal extra complexity. > Does this seem like a reasonable change? I'm certainly open to better ideas > as well. > Thanks, > Steve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135938#comment-16135938 ] Steven Rand commented on YARN-4227: --- I'm seeing a similar issue on what's roughly branch-2 (CDH 5.11.0), with the error being: {code} 2017-06-27 16:32:39,381 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Preemption Timer,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:687) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread$PreemptContainersTask.run(FSPreemptionThread.java:230) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) {code} This error, which causes the FSPreemptionThead to die, and thereby crashes the RM, seems to be correlated with NodeManagers being marked unhealthy due to lack of local disk space during large shuffles. I haven't confirmed, but presumably the unhealthy nodes are removed while we're waiting for the lock, and no longer exist when we call {{releaseContainer}}. I'm curious as to whether others are seeing this as well on recent versions, in which case maybe this is worth reopening? > FairScheduler: RM quits processing expired container from a removed node > > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.3.0, 2.5.0, 2.7.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, > YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_12 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_12 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op >OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_12 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares
[ https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-6960: -- Attachment: YARN-6960.002.patch Attaching a slightly modified patch that sets the fair share of an inactive queue equal to its current utilization. This doesn't change the behavior for queues with no running applications, since the fair share before the patch and with the patch are both equal to zero. It does protect AM containers in queues that are inactive by the new definition from being preempted though, since queues containing those AMs are no longer over their fair shares. > definition of active queue allows idle long-running apps to distort fair > shares > --- > > Key: YARN-6960 > URL: https://issues.apache.org/jira/browse/YARN-6960 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.1, 3.0.0-alpha4 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-6960.001.patch, YARN-6960.002.patch > > > YARN-2026 introduced the notion of only considering active queues when > computing the fair share of each queue. The definition of an active queue is > a queue with at least one runnable app: > {code} > public boolean isActive() { > return getNumRunnableApps() > 0; > } > {code} > One case that this definition of activity doesn't account for is that of > long-running applications that scale dynamically. Such an application might > request many containers when jobs are running, but scale down to very few > containers, or only the AM container, when no jobs are running. > Even when such an application has scaled down to a negligible amount of > demand and utilization, the queue that it's in is still considered to be > active, which defeats the purpose of YARN-2026. For example, consider this > scenario: > 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of > which have the same weight. > 2. Queues {{root.a}} and {{root.b}} contain long-running applications that > currently have only one container each (the AM). > 3. An application in queue {{root.c}} starts, and uses the whole cluster > except for the small amount in use by {{root.a}} and {{root.b}}. An > application in {{root.d}} starts, and has a high enough demand to be able to > use half of the cluster. Because all four queues are active, the app in > {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the > cluster's resources, while the app in {{root.c}} keeps about 75%. > Ideally in this example, the app in {{root.d}} would be able to preempt the > app in {{root.c}} up to 50% of the cluster, which would be possible if the > idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be > considered active. > One way to address this is to update the definition of an active queue to be > a queue containing 1 or more non-AM containers. This way if all apps in a > queue scale down to only the AM, other queues' fair shares aren't affected. > The benefit of this approach is that it's quite simple. The downside is that > it doesn't account for apps that are idle and using almost no resources, but > still have at least one non-AM container. > There are a couple of other options that seem plausible to me, but they're > much more complicated, and it seems to me that this proposal makes good > progress while adding minimal extra complexity. > Does this seem like a reasonable change? I'm certainly open to better ideas > as well. > Thanks, > Steve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares
[ https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16134416#comment-16134416 ] Steven Rand commented on YARN-6960: --- [~dan...@cloudera.com], I've uploaded a patch proposing a new definition of queue activity. It also needs tests, but I wanted to first see how the community feels about this change, and revise it as necessary based on feedback before writing tests for it. My understanding of a queue's demand is that it's the cumulative current usage of all apps in the queue plus the cumulative requested additional resources for all apps in the queue. Therefore if no apps are requesting additional resources, the demand will be equal to the usage of the AMs. Then, as soon as any app attempts to do anything, it's demand will be greater than the AM usage, and the queue will become active. I've tested this patch and it seems to have the desired effect. Going back to the example in the description, {{root.c}} and {{root.d}} have equal fair shares despite the idle applications in {{root.a}} and {{root.b}}. > definition of active queue allows idle long-running apps to distort fair > shares > --- > > Key: YARN-6960 > URL: https://issues.apache.org/jira/browse/YARN-6960 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.1, 3.0.0-alpha4 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-6960.001.patch > > > YARN-2026 introduced the notion of only considering active queues when > computing the fair share of each queue. The definition of an active queue is > a queue with at least one runnable app: > {code} > public boolean isActive() { > return getNumRunnableApps() > 0; > } > {code} > One case that this definition of activity doesn't account for is that of > long-running applications that scale dynamically. Such an application might > request many containers when jobs are running, but scale down to very few > containers, or only the AM container, when no jobs are running. > Even when such an application has scaled down to a negligible amount of > demand and utilization, the queue that it's in is still considered to be > active, which defeats the purpose of YARN-2026. For example, consider this > scenario: > 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of > which have the same weight. > 2. Queues {{root.a}} and {{root.b}} contain long-running applications that > currently have only one container each (the AM). > 3. An application in queue {{root.c}} starts, and uses the whole cluster > except for the small amount in use by {{root.a}} and {{root.b}}. An > application in {{root.d}} starts, and has a high enough demand to be able to > use half of the cluster. Because all four queues are active, the app in > {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the > cluster's resources, while the app in {{root.c}} keeps about 75%. > Ideally in this example, the app in {{root.d}} would be able to preempt the > app in {{root.c}} up to 50% of the cluster, which would be possible if the > idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be > considered active. > One way to address this is to update the definition of an active queue to be > a queue containing 1 or more non-AM containers. This way if all apps in a > queue scale down to only the AM, other queues' fair shares aren't affected. > The benefit of this approach is that it's quite simple. The downside is that > it doesn't account for apps that are idle and using almost no resources, but > still have at least one non-AM container. > There are a couple of other options that seem plausible to me, but they're > much more complicated, and it seems to me that this proposal makes good > progress while adding minimal extra complexity. > Does this seem like a reasonable change? I'm certainly open to better ideas > as well. > Thanks, > Steve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares
[ https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-6960: -- Attachment: YARN-6960.001.patch > definition of active queue allows idle long-running apps to distort fair > shares > --- > > Key: YARN-6960 > URL: https://issues.apache.org/jira/browse/YARN-6960 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.1, 3.0.0-alpha4 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-6960.001.patch > > > YARN-2026 introduced the notion of only considering active queues when > computing the fair share of each queue. The definition of an active queue is > a queue with at least one runnable app: > {code} > public boolean isActive() { > return getNumRunnableApps() > 0; > } > {code} > One case that this definition of activity doesn't account for is that of > long-running applications that scale dynamically. Such an application might > request many containers when jobs are running, but scale down to very few > containers, or only the AM container, when no jobs are running. > Even when such an application has scaled down to a negligible amount of > demand and utilization, the queue that it's in is still considered to be > active, which defeats the purpose of YARN-2026. For example, consider this > scenario: > 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of > which have the same weight. > 2. Queues {{root.a}} and {{root.b}} contain long-running applications that > currently have only one container each (the AM). > 3. An application in queue {{root.c}} starts, and uses the whole cluster > except for the small amount in use by {{root.a}} and {{root.b}}. An > application in {{root.d}} starts, and has a high enough demand to be able to > use half of the cluster. Because all four queues are active, the app in > {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the > cluster's resources, while the app in {{root.c}} keeps about 75%. > Ideally in this example, the app in {{root.d}} would be able to preempt the > app in {{root.c}} up to 50% of the cluster, which would be possible if the > idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be > considered active. > One way to address this is to update the definition of an active queue to be > a queue containing 1 or more non-AM containers. This way if all apps in a > queue scale down to only the AM, other queues' fair shares aren't affected. > The benefit of this approach is that it's quite simple. The downside is that > it doesn't account for apps that are idle and using almost no resources, but > still have at least one non-AM container. > There are a couple of other options that seem plausible to me, but they're > much more complicated, and it seems to me that this proposal makes good > progress while adding minimal extra complexity. > Does this seem like a reasonable change? I'm certainly open to better ideas > as well. > Thanks, > Steve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125041#comment-16125041 ] Steven Rand edited comment on YARN-6956 at 8/13/17 8:30 PM: Thanks for the clarifications. All three of those suggestions make sense to me. I've attached a patch for considering a configurable number of RRs. It seems simplest to me to create separate JIRAs for prioritizing the RR(s) to check and honoring delay scheduling in preemption -- does that seem reasonable? EDIT: A couple of questions I had about the patch: * I don't have a good sense for how to pick the default number of RRs to look at, and the choice of 10 for {{MIN_RESOURCE_REQUESTS_FOR_PREEMPTION_DEFAULT}} was fairly arbitrary. Happy to change that to something more reasonable if someone else has better intuition there. * If adding a new configuration point as in the patch makes sense, where should I add docs for it? My guess is {{yarn-default.xml}}, but I wasn't completely sure. was (Author: steven rand): Thanks for the clarifications. All three of those suggestions make sense to me. I've attached a patch for considering a configurable number of RRs. It seems simplest to me to create separate JIRAs for prioritizing the RR(s) to check and honoring delay scheduling in preemption -- does that seem reasonable? > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-6956.001.patch > > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125041#comment-16125041 ] Steven Rand commented on YARN-6956: --- Thanks for the clarifications. All three of those suggestions make sense to me. I've attached a patch for considering a configurable number of RRs. It seems simplest to me to create separate JIRAs for prioritizing the RR(s) to check and honoring delay scheduling in preemption -- does that seem reasonable? > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand > Attachments: YARN-6956.001.patch > > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand reassigned YARN-6956: - Assignee: Steven Rand > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand >Assignee: Steven Rand > Attachments: YARN-6956.001.patch > > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-6956: -- Attachment: YARN-6956.001.patch > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand > Attachments: YARN-6956.001.patch > > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119135#comment-16119135 ] Steven Rand commented on YARN-6956: --- Hi [~kasha], thanks for the suggestions. I would definitely like to contribute. A couple questions to make sure I understand: * For prioritizing the RR to check, does that mean sorting the RRs for an app by the value of {{getPriority()}}, and checking the highest priority one first? And if there are multiple RRs with the same priority, the suggestion is to choose the one that's requesting the least number of resources? If so, how do we avoid preempting small amounts at a time, and taking a long time to satisfy starvation? Or is it the responsibility of the app to not prioritize many small RRs? * Considering more than one RR definitely seems like a good idea. Is it reasonable to make sure to include at least one RR for which locality is relaxed, and/or the RR is for a rack or {{*}} in the list of RRs that we check, even if that means checking a lower-priority RR? (Assuming of course that there is at least one such RR.) * Honoring delay scheduling for preemption makes sense -- I don't have any questions about that one. > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares
[ https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16118146#comment-16118146 ] Steven Rand commented on YARN-6960: --- Yep, that concern is definitely valid. I wrote a patch that implements this definition of activity, and ran into exactly the problem you're describing while testing it. A new proposal then would be that a leaf queue is active if either of these conditions is met: * There is at least one non-AM container running in the queue * The cumulative demand of applications in the queue is greater than zero That way, in the example you give above, the fair share of {{root.a}} becomes 1/3 as soon as it attempts to run another job. Backing up a step to the use case, we have interactive Spark applications the expectation for which is that results are returned to the user on the order of seconds, or at worst a few minutes (assuming that the query is reasonable). We don't want to have to create a new {{SparkContext}} and upload + localize JARs for each query, since that would inflate query execution time, so one of these applications will keep the same {{SparkContext}} around indefinitely, and will thus be a long-running YARN application. When one of these apps isn't running any queries/jobs, it'll scale down its executor count to make room for other YARN applications. So sometimes we wind up with multiple YARN applications with minimal resource usage and no demand, and we've observed that this causes unequal distribution of resources between other running applications, even though they're in equally weighted queues. The example in the description is kind of silly/simplistic, but it's essentially what we see happen. > definition of active queue allows idle long-running apps to distort fair > shares > --- > > Key: YARN-6960 > URL: https://issues.apache.org/jira/browse/YARN-6960 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.1, 3.0.0-alpha4 >Reporter: Steven Rand >Assignee: Steven Rand > > YARN-2026 introduced the notion of only considering active queues when > computing the fair share of each queue. The definition of an active queue is > a queue with at least one runnable app: > {code} > public boolean isActive() { > return getNumRunnableApps() > 0; > } > {code} > One case that this definition of activity doesn't account for is that of > long-running applications that scale dynamically. Such an application might > request many containers when jobs are running, but scale down to very few > containers, or only the AM container, when no jobs are running. > Even when such an application has scaled down to a negligible amount of > demand and utilization, the queue that it's in is still considered to be > active, which defeats the purpose of YARN-2026. For example, consider this > scenario: > 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of > which have the same weight. > 2. Queues {{root.a}} and {{root.b}} contain long-running applications that > currently have only one container each (the AM). > 3. An application in queue {{root.c}} starts, and uses the whole cluster > except for the small amount in use by {{root.a}} and {{root.b}}. An > application in {{root.d}} starts, and has a high enough demand to be able to > use half of the cluster. Because all four queues are active, the app in > {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the > cluster's resources, while the app in {{root.c}} keeps about 75%. > Ideally in this example, the app in {{root.d}} would be able to preempt the > app in {{root.c}} up to 50% of the cluster, which would be possible if the > idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be > considered active. > One way to address this is to update the definition of an active queue to be > a queue containing 1 or more non-AM containers. This way if all apps in a > queue scale down to only the AM, other queues' fair shares aren't affected. > The benefit of this approach is that it's quite simple. The downside is that > it doesn't account for apps that are idle and using almost no resources, but > still have at least one non-AM container. > There are a couple of other options that seem plausible to me, but they're > much more complicated, and it seems to me that this proposal makes good > progress while adding minimal extra complexity. > Does this seem like a reasonable change? I'm certainly open to better ideas > as well. > Thanks, > Steve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org F
[jira] [Created] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares
Steven Rand created YARN-6960: - Summary: definition of active queue allows idle long-running apps to distort fair shares Key: YARN-6960 URL: https://issues.apache.org/jira/browse/YARN-6960 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 3.0.0-alpha4, 2.8.1 Reporter: Steven Rand Assignee: Steven Rand YARN-2026 introduced the notion of only considering active queues when computing the fair share of each queue. The definition of an active queue is a queue with at least one runnable app: {code} public boolean isActive() { return getNumRunnableApps() > 0; } {code} One case that this definition of activity doesn't account for is that of long-running applications that scale dynamically. Such an application might request many containers when jobs are running, but scale down to very few containers, or only the AM container, when no jobs are running. Even when such an application has scaled down to a negligible amount of demand and utilization, the queue that it's in is still considered to be active, which defeats the purpose of YARN-2026. For example, consider this scenario: 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of which have the same weight. 2. Queues {{root.a}} and {{root.b}} contain long-running applications that currently have only one container each (the AM). 3. An application in queue {{root.c}} starts, and uses the whole cluster except for the small amount in use by {{root.a}} and {{root.b}}. An application in {{root.d}} starts, and has a high enough demand to be able to use half of the cluster. Because all four queues are active, the app in {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the cluster's resources, while the app in {{root.c}} keeps about 75%. Ideally in this example, the app in {{root.d}} would be able to preempt the app in {{root.c}} up to 50% of the cluster, which would be possible if the idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be considered active. One way to address this is to update the definition of an active queue to be a queue containing 1 or more non-AM containers. This way if all apps in a queue scale down to only the AM, other queues' fair shares aren't affected. The benefit of this approach is that it's quite simple. The downside is that it doesn't account for apps that are idle and using almost no resources, but still have at least one non-AM container. There are a couple of other options that seem plausible to me, but they're much more complicated, and it seems to me that this proposal makes good progress while adding minimal extra complexity. Does this seem like a reasonable change? I'm certainly open to better ideas as well. Thanks, Steve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115847#comment-16115847 ] Steven Rand edited comment on YARN-6956 at 8/6/17 4:35 PM: --- Hi [~dan...@cloudera.com], thanks for the quick reply and explanation. That concern definitely makes sense, and in general YARN-6163 seems like a good change. However, what I'm seeing is that only considering RRs for one node actually causes some of my apps to remain starved for quite a long time. The series of events that happens in a loop is: 1. The app is correctly considered to be starved 2. The app has many RRs, several of which can be satisfied, but only one RR is actually considered for preemption as per this JIRA's description 3. That particular RR happens to be for a node on which no containers can be preempted for the app, so the app remains starved Since the order of the list of RRs is the same each time through the loop, the same RR is always considered, no containers are preempted, and the app remains starved, even though it has other RRs that could be satisfied. I haven't thought enough yet about what a solution would look like, but it seems like we should be able to keep the benefits of YARN-6163 while also avoiding this issue. I'll try to have a patch within the next few days if people agree that we should change the behavior. was (Author: steven rand): Hi [~dan...@cloudera.com], thanks for the quick reply and explanation. That concern definitely makes sense, and in general YARN-6163 seems like a good change. However, what I'm seeing is that only considering RRs for one node actually causes some of my apps to remain starved for quite a long time. The series of events that happens in a loop is: 1. The app is correctly considered to be starved 2. The app has many RRs, several of which can be satisfied, but only one RR is actually considered for preemption as per this JIRA's description 3. That particular RR happens to be for a node on which the no containers can be preempted for the app, so the app remains starved Since the order of the list of RRs is the same each time through the loop, the same RR is always considered, no containers are preempted, and the app remains starved, even though it has other RRs that could be satisfied. I haven't thought enough yet about what a solution would look like, but it seems like we should be able to keep the benefits of YARN-6163 while also avoiding this issue. I'll try to have a patch within the next few days if people agree that we should change the behavior. > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115847#comment-16115847 ] Steven Rand commented on YARN-6956: --- Hi [~dan...@cloudera.com], thanks for the quick reply and explanation. That concern definitely makes sense, and in general YARN-6163 seems like a good change. However, what I'm seeing is that only considering RRs for one node actually causes some of my apps to remain starved for quite a long time. The series of events that happens in a loop is: 1. The app is correctly considered to be starved 2. The app has many RRs, several of which can be satisfied, but only one RR is actually considered for preemption as per this JIRA's description 3. That particular RR happens to be for a node on which the no containers can be preempted for the app, so the app remains starved Since the order of the list of RRs is the same each time through the loop, the same RR is always considered, no containers are preempted, and the app remains starved, even though it has other RRs that could be satisfied. I haven't thought enough yet about what a solution would look like, but it seems like we should be able to keep the benefits of YARN-6163 while also avoiding this issue. I'll try to have a patch within the next few days if people agree that we should change the behavior. > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-6956: -- Description: I'm observing the following series of events on a CDH 5.11.0 cluster, which seem to be possible after YARN-6163: 1. An application is considered to be starved, so {{FSPreemptionThread}} calls {{identifyContainersToPreempt}}, and that calls {{FSAppAttempt#getStarvedResourceRequests}} to get a list of {{ResourceRequest}} instances that are enough to address the app's starvation. 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is enough to address the app's starvation, so we break out of the loop over {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. We return only this one {{ResourceRequest}} back to the {{identifyContainersToPreempt}} method. 3. It turns out that this particular {{ResourceRequest}} happens to have a value for {{getResourceName}} that identifies a specific node in the cluster. This causes preemption to only consider containers on that node, and not the rest of the cluster. [~kasha], does that make sense? I'm happy to submit a patch if I'm understanding the problem correctly. was: I'm observing the following series of events on a CDH 5.11.0 cluster, which seem to be possible after https://issues.apache.org/jira/browse/YARN-6163: 1. An application is considered to be starved, so {{FSPreemptionThread}} calls {{identifyContainersToPreempt}}, and that calls {{FSAppAttempt#getStarvedResourceRequests}} to get a list of {{ResourceRequest}} instances that are enough to address the app's starvation. 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is enough to address the app's starvation, so we break out of the loop over {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. We return only this one {{ResourceRequest}} back to the {{identifyContainersToPreempt}} method. 3. It turns out that this particular {{ResourceRequest}} happens to have a value for {{getResourceName}} that identifies a specific node in the cluster. This causes preemption to only consider containers on that node, and not the rest of the cluster. [~kasha], does that make sense? I'm happy to submit a patch if I'm understanding the problem correctly. > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6956) preemption may only consider resource requests for one node
[ https://issues.apache.org/jira/browse/YARN-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-6956: -- Description: I'm observing the following series of events on a CDH 5.11.0 cluster, which seem to be possible after https://issues.apache.org/jira/browse/YARN-6163: 1. An application is considered to be starved, so {{FSPreemptionThread}} calls {{identifyContainersToPreempt}}, and that calls {{FSAppAttempt#getStarvedResourceRequests}} to get a list of {{ResourceRequest}} instances that are enough to address the app's starvation. 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is enough to address the app's starvation, so we break out of the loop over {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. We return only this one {{ResourceRequest}} back to the {{identifyContainersToPreempt}} method. 3. It turns out that this particular {{ResourceRequest}} happens to have a value for {{getResourceName}} that identifies a specific node in the cluster. This causes preemption to only consider containers on that node, and not the rest of the cluster. [~kasha], does that make sense? I'm happy to submit a patch if I'm understanding the problem correctly. was: I'm observing the following series of events on a CDH 5.11.0 cluster, which seem to be possible after https://issues.apache.org/jira/browse/YARN-6163: 1. An application is considered to be starved, so {{FSPreemptionThread}} calls {{identifyContainersToPreempt}}, and that calls {{FSAppAttempt#getStarvedResourceRequests}} to get a list of {{ResourceRequest}} instances that are enough to address the app's starvation. 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is enough to address the app's starvation, so we break out of the loop over {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. We return only this one {{ResourceRequest}} back to the {{identifyContainersToPreempt}} method. 3. It turns out that this particular {{ResourceRequest}} happens to have a value for {{getResourceName}} that identifies a specific node in the cluster. This cause preemption to only consider containers on that node, and not the rest of the cluster. [~kasha], does that make sense? I'm happy to submit a patch if I'm understanding the problem correctly. > preemption may only consider resource requests for one node > --- > > Key: YARN-6956 > URL: https://issues.apache.org/jira/browse/YARN-6956 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.9.0, 3.0.0-beta1 > Environment: CDH 5.11.0 >Reporter: Steven Rand > > I'm observing the following series of events on a CDH 5.11.0 cluster, which > seem to be possible after https://issues.apache.org/jira/browse/YARN-6163: > 1. An application is considered to be starved, so {{FSPreemptionThread}} > calls {{identifyContainersToPreempt}}, and that calls > {{FSAppAttempt#getStarvedResourceRequests}} to get a list of > {{ResourceRequest}} instances that are enough to address the app's starvation. > 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is > enough to address the app's starvation, so we break out of the loop over > {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. > We return only this one {{ResourceRequest}} back to the > {{identifyContainersToPreempt}} method. > 3. It turns out that this particular {{ResourceRequest}} happens to have a > value for {{getResourceName}} that identifies a specific node in the cluster. > This causes preemption to only consider containers on that node, and not the > rest of the cluster. > [~kasha], does that make sense? I'm happy to submit a patch if I'm > understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6956) preemption may only consider resource requests for one node
Steven Rand created YARN-6956: - Summary: preemption may only consider resource requests for one node Key: YARN-6956 URL: https://issues.apache.org/jira/browse/YARN-6956 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.9.0, 3.0.0-beta1 Environment: CDH 5.11.0 Reporter: Steven Rand I'm observing the following series of events on a CDH 5.11.0 cluster, which seem to be possible after https://issues.apache.org/jira/browse/YARN-6163: 1. An application is considered to be starved, so {{FSPreemptionThread}} calls {{identifyContainersToPreempt}}, and that calls {{FSAppAttempt#getStarvedResourceRequests}} to get a list of {{ResourceRequest}} instances that are enough to address the app's starvation. 2. The first {{ResourceRequest}} that {{getStarvedResourceRequests}} sees is enough to address the app's starvation, so we break out of the loop over {{appSchedulingInfo.getAllResourceRequests()}} after only one iteration: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L1180. We return only this one {{ResourceRequest}} back to the {{identifyContainersToPreempt}} method. 3. It turns out that this particular {{ResourceRequest}} happens to have a value for {{getResourceName}} that identifies a specific node in the cluster. This cause preemption to only consider containers on that node, and not the rest of the cluster. [~kasha], does that make sense? I'm happy to submit a patch if I'm understanding the problem correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2985) YARN should support to delete the aggregated logs for Non-MapReduce applications
[ https://issues.apache.org/jira/browse/YARN-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968609#comment-15968609 ] Steven Rand commented on YARN-2985: --- [~jlowe], thanks for the thoughtful response. Based on that information, it seems like the most straightforward way to proceed, at least for branch-2, is to add a configuration option for running the deletion service in only the timeline server, and not the JHS. Something like {{yarn.log-aggregation.run-in-timeline-server}} that defaults to {{false}} for backcompat, but when set to {{true}}, prevents the JHS from performing retention, and tells the timeline server to do it instead. Does that seem reasonable? If so I'll update the patch to do that, but certainly open to alternatives if there's a better way. For trunk, I imagine it might be worth just removing retention from the JHS and moving it to the timeline server entirely, since my understanding is that the timeline server is supposed to replace the JHS, even for deployments that only run MR jobs, and 3.0 seems like a reasonable enough point at which to require the switch from JHS to timeline server. I might be misunderstanding the relationship between the two though, so please correct me if that doesn't make sense. > YARN should support to delete the aggregated logs for Non-MapReduce > applications > > > Key: YARN-2985 > URL: https://issues.apache.org/jira/browse/YARN-2985 > Project: Hadoop YARN > Issue Type: New Feature > Components: log-aggregation, nodemanager >Affects Versions: 2.8.0 >Reporter: Xu Yang >Assignee: Steven Rand > Attachments: YARN-2985-branch-2-001.patch > > > Before Hadoop 2.6, the LogAggregationService is started in NodeManager. But > the AggregatedLogDeletionService is started in mapreduce`s JobHistoryServer. > Therefore, the Non-MapReduce application can aggregate their logs to HDFS, > but can not delete those logs. Need the NodeManager take over the function of > aggregated log deletion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2985) YARN should support to delete the aggregated logs for Non-MapReduce applications
[ https://issues.apache.org/jira/browse/YARN-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated YARN-2985: -- Attachment: YARN-2985-branch-2-001.patch Attaching a patch for branch-2. I've tested this experimentally by deploying a patched Timeline Server to a cluster, running a Spark job on that cluster, and validating that the aggregated logs disappeared from HDFS after the configured amount of time had elapsed. The Timeline Server's logs confirm that it performed the deletion. I'm not sure how to add tests though. The existing tests for the {{TestAggregatedLogDeletionService}} are good enough to test that the service works -- the more interesting thing is verifying that when a Timeline Server is deployed, log aggregation is enforced for non-MR applications. I don't know how to test non-MR applications from the hadoop-yarn project tests though. > YARN should support to delete the aggregated logs for Non-MapReduce > applications > > > Key: YARN-2985 > URL: https://issues.apache.org/jira/browse/YARN-2985 > Project: Hadoop YARN > Issue Type: New Feature > Components: log-aggregation, nodemanager >Reporter: Xu Yang >Assignee: Steven Rand > Attachments: YARN-2985-branch-2-001.patch > > > Before Hadoop 2.6, the LogAggregationService is started in NodeManager. But > the AggregatedLogDeletionService is started in mapreduce`s JobHistoryServer. > Therefore, the Non-MapReduce application can aggregate their logs to HDFS, > but can not delete those logs. Need the NodeManager take over the function of > aggregated log deletion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-2985) YARN should support to delete the aggregated logs for Non-MapReduce applications
[ https://issues.apache.org/jira/browse/YARN-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand reassigned YARN-2985: - Assignee: Steven Rand > YARN should support to delete the aggregated logs for Non-MapReduce > applications > > > Key: YARN-2985 > URL: https://issues.apache.org/jira/browse/YARN-2985 > Project: Hadoop YARN > Issue Type: New Feature > Components: log-aggregation, nodemanager >Reporter: Xu Yang >Assignee: Steven Rand > > Before Hadoop 2.6, the LogAggregationService is started in NodeManager. But > the AggregatedLogDeletionService is started in mapreduce`s JobHistoryServer. > Therefore, the Non-MapReduce application can aggregate their logs to HDFS, > but can not delete those logs. Need the NodeManager take over the function of > aggregated log deletion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-6120) add retention of aggregated logs to Timeline Server
[ https://issues.apache.org/jira/browse/YARN-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand resolved YARN-6120. --- Resolution: Duplicate I now have the ability to submit a patch for YARN-2985, so this duplicate JIRA is unnecessary. > add retention of aggregated logs to Timeline Server > --- > > Key: YARN-6120 > URL: https://issues.apache.org/jira/browse/YARN-6120 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation, timelineserver >Affects Versions: 2.7.3 >Reporter: Steven Rand > Attachments: YARN-6120.001.patch > > > The MR History Server performs retention of aggregated logs for MapReduce > applications. However, there is no way of enforcing retention on aggregated > logs for other types of applications. This JIRA proposes to add log retention > to the Timeline Server. > Also, this is arguably a duplicate of > https://issues.apache.org/jira/browse/YARN-2985, but I could not find a way > to attach a patch for that issue. If someone closes this as a duplicate, > could you please assign that issue to me? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6308) Fix TestAMRMClient compilation errors
[ https://issues.apache.org/jira/browse/YARN-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902269#comment-15902269 ] Steven Rand commented on YARN-6308: --- Attached a new patch to HADOOP-14062, though I think this issue should have been fixed by the previous patch being reverted. > Fix TestAMRMClient compilation errors > - > > Key: YARN-6308 > URL: https://issues.apache.org/jira/browse/YARN-6308 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha3 >Reporter: Manoj Govindassamy > > Looks like fixes committed for HADOOP-14062 and YARN-6218 had conflicts and > left TestAMRMClient in a dangling state with compilation errors. > TestAMRMClient needs a fix. > {code} > [ERROR] COMPILATION ERROR : > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[145,5] > non-static variable yarnCluster cannot be referenced from a static context > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[145,71] > non-static variable nodeCount cannot be referenced from a static context > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[146,5] > non-static variable yarnCluster cannot be referenced from a static context > .. > .. > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[204,9] > non-static variable attemptId cannot be referenced from a static context > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[207,20] > non-static variable attemptId cannot be referenced from a static context > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[206,13] > non-static variable yarnCluster cannot be referenced from a static context > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[874,5] > cannot find symbol > [ERROR] symbol: method tearDown() > [ERROR] location: class org.apache.hadoop.yarn.client.api.impl.TestAMRMClient > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[876,5] > cannot find symbol > [ERROR] symbol: method startApp() > [ERROR] location: class org.apache.hadoop.yarn.client.api.impl.TestAMRMClient > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[881,5] > cannot find symbol > [ERROR] symbol: method tearDown() > [ERROR] location: class org.apache.hadoop.yarn.client.api.impl.TestAMRMClient > [ERROR] > /Users/manoj/work/ups-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java:[885,5] > cannot find symbol > [ERROR] symbol: method startApp() > [ERROR] location: class org.apache.hadoop.yarn.client.api.impl.TestAMRMClient > [ERROR] -> [Help 1] > [ERROR] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org