[jira] [Commented] (YARN-7291) Better input parsing for resource in allocation file
[ https://issues.apache.org/jira/browse/YARN-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911814#comment-16911814 ] Wilfred Spiegelenburg commented on YARN-7291: - The change looks good +1 (non binding) All old tests are still passing and new ones added so we should not have regressed. > Better input parsing for resource in allocation file > > > Key: YARN-7291 > URL: https://issues.apache.org/jira/browse/YARN-7291 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Yufei Gu >Assignee: Zoltan Siegl >Priority: Minor > Labels: newbie > Attachments: YARN-7291.001.patch, YARN-7291.002.patch, > YARN-7291.003.patch, YARN-7291.004.patch, YARN-7291.005.patch, > YARN-7291.005.patch > > > When you set max/min share for queues in fair scheduler allocation file, > "1024 mb, 2 4 vcores" is parsed the same as "1024 mb, 4 vcores" without any > issue, the same to "50% memory, 50% 100%cpu" which is parsed the same as "50% > memory, 100%cpu". That causes confusing. We should fix it. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899703#comment-16899703 ] Wilfred Spiegelenburg commented on YARN-1655: - The "new" checkstyle is not really new and is triggered by a layout change. Renaming the stateMachineFactory to comply is a far bigger change. I already fixed up a lot of the layout issues in the RMContanerImpl class and will leave this one alone. The second test failure is known as YARN-9333 [~snemeth] can you have a look please? > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch, YARN-1655.004.patch, YARN-1655.005.patch > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-1655: Attachment: YARN-1655.005.patch > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch, YARN-1655.004.patch, YARN-1655.005.patch > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898466#comment-16898466 ] Wilfred Spiegelenburg commented on YARN-1655: - the failed unit test is flaky as per YARN-8433 and not related to the change. I fixed up the checkstyle issues. Most of the change in the RMContainerImpl is a layout change to clean up the incorrect state machine layout. [^YARN-1655.005.patch] > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch, YARN-1655.004.patch, YARN-1655.005.patch > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897819#comment-16897819 ] Wilfred Spiegelenburg commented on YARN-1655: - Thank you for the feedback [~snemeth], sorry that it took this long. I have updated the patch and fixed all the remarks. All except for 4 are straight forward simple changes. To fix 4 I did the following: - make a new {{allocate}} method in the MockRM that takes no arguments and calls the real allocate with _nulls_ - updated the calls in the test code to use the new method and added a comment to what it does (i.e. process outstanding requests) - split the other {{allocate}} call in the test code into two steps: a separate alloc of the request and a call to {{allocate}} on the app master That should clear point 4 up. [^YARN-1655.004.patch] > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch, YARN-1655.004.patch > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-1655: Attachment: YARN-1655.004.patch > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch, YARN-1655.004.patch > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897745#comment-16897745 ] Wilfred Spiegelenburg commented on YARN-7621: - You will need to support duplicate leaf queue names in the CS. As [~cane] mentioned in his update the FS has the concept of a real hierarchy. This means that you can have the following config: {code:java} root +^+ parent1 parent2 +--^--+ +--^--+ childA ChildB ChildAChildB {code} Stripping off the last part of the queue thus will collapse the structure and cause issues. Applications that ran i different queues are now ending up in the same queue. If the parent queue ACLs or resource settings are different then you will have an even bigger problem. This could also break with the current placement rules that are currently used in the FS. If I generate a queue dynamically via a placement rule and there is a parent rule set. The way the queue hierarchy is implemented in the CS needs to be updated to remove the limitation that every leaf queue must be unique in the CS. This is more work than what is covered in this patch. > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9516) move application between queues,not check target queue acl permission
[ https://issues.apache.org/jira/browse/YARN-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YARN-9516. - Resolution: Duplicate This has been fixed as YARN-5554 MoveApplicationAcrossQueues does not check user permission on the target queue in 3.0 > move application between queues,not check target queue acl permission > - > > Key: YARN-9516 > URL: https://issues.apache.org/jira/browse/YARN-9516 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: powerinf >Priority: Critical > > User test1 can subbmit a application on queue root.test.test1,but not to > queue root.test.test2.when I subbmit a application on queue root.test.test1 > using user test1, and try to move the application to root.test.test2, it can > move successfully,not check target queue acl permission, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9431) Fix flaky junit test fair.TestAppRunnability after YARN-8967
[ https://issues.apache.org/jira/browse/YARN-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807337#comment-16807337 ] Wilfred Spiegelenburg commented on YARN-9431: - Thank you [~giovanni.fumarola] for the commit and [~pbacsko] for confirming the fix > Fix flaky junit test fair.TestAppRunnability after YARN-8967 > > > Key: YARN-9431 > URL: https://issues.apache.org/jira/browse/YARN-9431 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, test >Affects Versions: 3.3.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Minor > Fix For: 3.3.0 > > Attachments: YARN-9431.001.patch > > > In YARN-4901 one of the scheduler tests failed. This seems to be linked to > the changes around the placement rules introduced in YARN-8967. > Applications submitted in the tests are accepted and rejected at the same > time: > {code} > 2019-04-01 12:00:57,269 INFO [main] fair.FairScheduler > (FairScheduler.java:addApplication(540)) - Accepted application > application_0_0001 from user: user1, in queue: root.user1, currently num of > applications: 1 > 2019-04-01 12:00:57,269 INFO [AsyncDispatcher event handler] > fair.FairScheduler (FairScheduler.java:rejectApplicationWithMessage(1344)) - > Reject application application_0_0001 submitted by user user1 application > rejected by placement rules. > {code} > This should never happen and is most likely due to the way the tests > generates the application and events. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9431) flaky junit test fair.TestAppRunnability after YARN-8967
Wilfred Spiegelenburg created YARN-9431: --- Summary: flaky junit test fair.TestAppRunnability after YARN-8967 Key: YARN-9431 URL: https://issues.apache.org/jira/browse/YARN-9431 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Affects Versions: 3.3.0 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg In YARN-4901 one of the scheduler tests failed. This seems to be linked to the changes around the placement rules introduced in YARN-8967. Applications submitted in the tests are accepted and rejected at the same time: {code} 2019-04-01 12:00:57,269 INFO [main] fair.FairScheduler (FairScheduler.java:addApplication(540)) - Accepted application application_0_0001 from user: user1, in queue: root.user1, currently num of applications: 1 2019-04-01 12:00:57,269 INFO [AsyncDispatcher event handler] fair.FairScheduler (FairScheduler.java:rejectApplicationWithMessage(1344)) - Reject application application_0_0001 submitted by user user1 application rejected by placement rules. {code} This should never happen and is most likely due to the way the tests generates the application and events. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4901) MockRM should clear the QueueMetrics when it starts
[ https://issues.apache.org/jira/browse/YARN-4901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806326#comment-16806326 ] Wilfred Spiegelenburg commented on YARN-4901: - I have run the test over 2500 times and cannot get the failure to reproduce. I do see some weird things in my local run which could explain the failure. Opened a new jira for this:[YARN-9431|https://issues.apache.org/jira/browse/YARN-9431] > MockRM should clear the QueueMetrics when it starts > --- > > Key: YARN-4901 > URL: https://issues.apache.org/jira/browse/YARN-4901 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Daniel Templeton >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-4901-001.patch > > > The {{ResourceManager}} rightly assumes that when it starts, it's starting > from naught. The {{MockRM}}, however, violates that assumption. For > example, in {{TestNMReconnect}}, each test method creates a new {{MockRM}} > instance. The {{QueueMetrics.queueMetrics}} field is static, which means > that when multiple {{MockRM}} instances are created, the {{QueueMetrics}} > bleed over. Having the MockRM clear the {{QueueMetrics}} when it starts > should resolve the issue. I haven't looked yet at scope to see how hard easy > that is to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9417) Implement FS equivalent of AppNameMappingPlacementRule
Wilfred Spiegelenburg created YARN-9417: --- Summary: Implement FS equivalent of AppNameMappingPlacementRule Key: YARN-9417 URL: https://issues.apache.org/jira/browse/YARN-9417 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.3.0 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg The AppNameMappingPlacementRule is only available for the CS. We need the same kind of rule for the FS. The rule should use the application name as set in the submission context. This allows spark, mr or tez jobs to be run in their own queues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9416) Add filter options to FS placement rules
[ https://issues.apache.org/jira/browse/YARN-9416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802761#comment-16802761 ] Wilfred Spiegelenburg edited comment on YARN-9416 at 3/27/19 1:22 PM: -- The proposal is to add a new child entry to all rules, like the parent rule we have now. Name of the xml node: * filter Name of the attributes supported for each: * type (_allow_ or _deny_) * users (comma separated list) * groups (comma separated _ordered_ list) The type attribute is required. One of the users and groups attributes can be omitted or left empty. If both are left empty the filter is ignored. The ordering only has an impact on the secondary group rule, and thus the group filter, in combination with the _allow_ type. That is the only rule that has a loop running over a number of values that are returned in a random order by the OS. The order in which the list is specified will be the order in which the secondary groups are evaluated in the rule. When a rule has a filter set we check the filter before we decide if the queue found will be returned. This is independent of the ACLs. was (Author: wilfreds): The proposal is to add a new child entry to all rules, like the parent rule we have now. Name of the xml node: * userfilter * groupfilter Name of the attributes supported for each: * type (order, allow or deny) * members (comma separated ordered list) When a rule has a filter set we check the filter before we decide if the queue found will be returned. This is independent of the ACLs. > Add filter options to FS placement rules > > > Key: YARN-9416 > URL: https://issues.apache.org/jira/browse/YARN-9416 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.3.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > The placement rules should allow filtering of the groups and or users that > match the rule. > In the case of the user rule you might want it to only match if the users are > member of a specific group. An other example would be to only allow specific > users to match the specified rule. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9416) Add filter options to FS placement rules
[ https://issues.apache.org/jira/browse/YARN-9416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802761#comment-16802761 ] Wilfred Spiegelenburg commented on YARN-9416: - The proposal is to add a new child entry to all rules, like the parent rule we have now. Name of the xml node: * userfilter * groupfilter Name of the attributes supported for each: * type (order, allow or deny) * members (comma separated ordered list) When a rule has a filter set we check the filter before we decide if the queue found will be returned. This is independent of the ACLs. > Add filter options to FS placement rules > > > Key: YARN-9416 > URL: https://issues.apache.org/jira/browse/YARN-9416 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.3.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > The placement rules should allow filtering of the groups and or users that > match the rule. > In the case of the user rule you might want it to only match if the users are > member of a specific group. An other example would be to only allow specific > users to match the specified rule. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9416) Add filter options to FS placement rules
Wilfred Spiegelenburg created YARN-9416: --- Summary: Add filter options to FS placement rules Key: YARN-9416 URL: https://issues.apache.org/jira/browse/YARN-9416 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 3.3.0 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg The placement rules should allow filtering of the groups and or users that match the rule. In the case of the user rule you might want it to only match if the users are member of a specific group. An other example would be to only allow specific users to match the specified rule. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8793) QueuePlacementPolicy bind more information to assigning result
[ https://issues.apache.org/jira/browse/YARN-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802721#comment-16802721 ] Wilfred Spiegelenburg commented on YARN-8793: - The PlacementRule and PlacementManager have standardised the way a chain is terminated and what is communicated back. The FS has moved to using that interfaces to handle queue placements. Placements are handled outside the scheduler. > QueuePlacementPolicy bind more information to assigning result > -- > > Key: YARN-8793 > URL: https://issues.apache.org/jira/browse/YARN-8793 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Affects Versions: 3.1.1 >Reporter: Shuai Zhang >Assignee: Shuai Zhang >Priority: Major > Attachments: YARN-8793.001.patch, YARN-8793.002.patch, > YARN-8793.003.patch, YARN-8793.004.patch, YARN-8793.005.patch, > YARN-8793.006.patch, YARN-8793.007.patch, YARN-8793.008.patch > > > Fair scheduler's QueuePlacementPolicy should bind more information to > assigning result: > # Whether to terminate the chain of responsibility > # The reason to reject a request -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5387) FairScheduler: add the ability to specify a parent queue to all placement rules
[ https://issues.apache.org/jira/browse/YARN-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YARN-5387. - Resolution: Implemented This has been included as part of the YARN-8967 changes. Documentation is still outstanding and will be added as part of YARN-9415. > FairScheduler: add the ability to specify a parent queue to all placement > rules > --- > > Key: YARN-5387 > URL: https://issues.apache.org/jira/browse/YARN-5387 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: supportability > > In the current placement policy there all rules generate a queue name under > the root. The only exception is the nestedUserQueue rule. This rule allows a > queue to be created under a parent queue defined by a second rule. > Instead of creating new rules to also allow nested groups, secondary groups > or nested queues for new rules that we think of we should generalise this by > allowing a parent attribute to be specified in each rule like the create flag. > The optional parent attribute for a rule should allow the following values: > - empty (which is the same as not specifying the attribute) > - a rule > - a fixed value (with or without the root prefix) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8795) QueuePlacementRule move to separate files
[ https://issues.apache.org/jira/browse/YARN-8795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802708#comment-16802708 ] Wilfred Spiegelenburg commented on YARN-8795: - The rules have been moved as part of the move to a new interface. The rules are now all using the PlacementRule and are now all located in their own file(s). > QueuePlacementRule move to separate files > - > > Key: YARN-8795 > URL: https://issues.apache.org/jira/browse/YARN-8795 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Affects Versions: 3.1.1 >Reporter: Shuai Zhang >Assignee: Shuai Zhang >Priority: Major > Attachments: YARN-8795.002.patch, YARN-8795.003.patch, > YARN-8795.004.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8792) Revisit FairScheduler QueuePlacementPolicy
[ https://issues.apache.org/jira/browse/YARN-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802707#comment-16802707 ] Wilfred Spiegelenburg commented on YARN-8792: - None of these changes fit into the integrated way we currently implement the rules in the FS and CS. As part of YARN-8948, YARN-9298 and finally integrated in YARN-8967 this has been changed. Both schedulers now use the same placement manager and placement rule code. The placement of the application in a queue is moved out of the FS also. > Revisit FairScheduler QueuePlacementPolicy > --- > > Key: YARN-8792 > URL: https://issues.apache.org/jira/browse/YARN-8792 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.1.1 >Reporter: Shuai Zhang >Assignee: Shuai Zhang >Priority: Major > > Fair scheduler use `QueuePlacementPolicy` to map a request to queue. There > are several problems: > # The termination of the responsibility chain should bind to the assigning > result instead of the rule. > # It should provide a reason when rejecting a request. > # Still need more useful rules: > ## RejectNonLeafQueue > ## RejectDefaultQueue > ## RejectUsers > ## RejectQueues > ## DefaultByUser -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-2257) Add user to queue mappings to automatically place users' apps into specific queues
[ https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YARN-2257. - Resolution: Duplicate This has been fixed as part of YARN-8948, YARN-9298 and finally integrated in YARN-8967. Both schedulers use the same placement manager and placement rule code. The rules are different for both schedulers as the FS uses a slightly different setup with rule chaining and creation of queues that do not exist. The fix is in 3.3 and later: marking this as a duplicate of YARN-8967 > Add user to queue mappings to automatically place users' apps into specific > queues > -- > > Key: YARN-2257 > URL: https://issues.apache.org/jira/browse/YARN-2257 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Patrick Liu >Assignee: Vinod Kumar Vavilapalli >Priority: Major > Labels: features > > Currently, the fair-scheduler supports two modes, default queue or individual > queue for each user. > Apparently, the default queue is not a good option, because the resources > cannot be managed for each user or group. > However, individual queue for each user is not good enough. Especially when > connecting yarn with hive. There will be increasing hive users in a corporate > environment. If we create a queue for a user, the resource management will be > hard to maintain. > I think the problem can be solved like this: > 1. Define user->queue mapping in Fair-Scheduler.xml. Inside each queue, use > aclSubmitApps to control user's ability. > 2. Each time a user submit an app to yarn, if the user has mapped to a queue, > the app will be scheduled to that queue; otherwise, the app will be submitted > to default queue. > 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9415) Document FS placement rule changes from YARN-8967
Wilfred Spiegelenburg created YARN-9415: --- Summary: Document FS placement rule changes from YARN-8967 Key: YARN-9415 URL: https://issues.apache.org/jira/browse/YARN-9415 Project: Hadoop YARN Issue Type: Improvement Components: documentation, fairscheduler Affects Versions: 3.3.0 Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg With the changes introduced by YARN-8967 we now allow parent rules on all existing rules. This should be documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6567) Flexible Workload Management
[ https://issues.apache.org/jira/browse/YARN-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg reassigned YARN-6567: --- Assignee: Wilfred Spiegelenburg > Flexible Workload Management > > > Key: YARN-6567 > URL: https://issues.apache.org/jira/browse/YARN-6567 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ajai Omtri >Assignee: Wilfred Spiegelenburg >Priority: Minor > Labels: features > > Yarn workload management can be little more dynamic. > 1. Create yarn pool by specifying more than one Secondary AD group. > Scenario: > In a multi-tenant cluster there can be hundreds of AD groups per tenant and > hundreds of users per AD group. We want a way to group like workloads into > single yarn pool by specifying multiple secondary AD Groups. > Ex: All the ETL workloads of tenants needs to go into one yarn pool. This > requires addition of all ETL related AD groups into one yarn pool. > 2. Demotions > Scenario: A particular workload/job has been started in a high priority yarn > pool based on the assumption that it would finish quickly but due to some > data issue/change in the code/query etc. - now it is running longer and > consuming high amounts of resources for long time. In this case we want to > demote this to a lower resource allocated yarn pool. We don’t want this one > run-away workload/job to dominate the cluster because our assumption was > wrong. > Ex: If any workload in the yarn pool runs for X minutes and/or consumes Y > resources either alert me or push to another yarn pool. User can keep > demoting and can push to a yarn pool which has capped resources - like > Penalty box. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801415#comment-16801415 ] Wilfred Spiegelenburg commented on YARN-8967: - Thank you [~yufeigu] for the commit. > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch, > YARN-8967.009.patch, YARN-8967.010.patch, YARN-8967.011.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799020#comment-16799020 ] Wilfred Spiegelenburg commented on YARN-8967: - 3) I need two pieces back from the child when we have a rule so that is what was hampering the simple move. I was also hesitant because of the possibility to add new child nodes, beside the parent rule, specifically for introducing filters on some of the rules. I think the use of a method for just retrieving the element is simple enough and does not hamper the changes I have been looking at. 5) yes they should have been private and final Updated the patch with the two changes [^YARN-8967.011.patch] > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch, > YARN-8967.009.patch, YARN-8967.010.patch, YARN-8967.011.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: YARN-8967.011.patch > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch, > YARN-8967.009.patch, YARN-8967.010.patch, YARN-8967.011.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798645#comment-16798645 ] Wilfred Spiegelenburg commented on YARN-8967: - Junit test failure is not related. The checkstyle is from this patch but it makes the internal class RuleMap so much simpler that I propose we leave it like it is. [~yufeigu]: the checkstyle was why I introduced the getters etc which is the basis for your comment number 5) before. > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch, > YARN-8967.009.patch, YARN-8967.010.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797873#comment-16797873 ] Wilfred Spiegelenburg commented on YARN-8967: - 1) I missed that one too, fixed now 3) The two for loops run over different lists. Take this example: {code} {code} The first for loop run over the top level list of nodes (entries: specified and nestedQueue). The second loop runs over the children of each entry in that list. You cannot see the children of the top level nodes until you call {{getChildren()}} on it. For that you need to cast the Node to an Element. I thus cannot collapse the loop into one loop. The list also does not have an iterator to change it to a for-each construct. The xml files also return child Nodes that are not of the Element type for a correct configuration which means we have to filter while traversing the list. 4) I added the same test case. We now handle that case and having no parent rule for the nestedUserQueue correctly. I have great difficulty removing the create and init for the first rule as I don't know at the point that I find the first rule that I am going to find a second one. I would need to wait until after the loop to create/init which makes the code even more complex. 5) I had that to start with and changed it because the IDE kept complaining. Not sure why but it works now without complaints and without the getter methods. I might have had slightly different access modifiers. Looks far more like a wrapper class now. I also found that we do not correctly test and handle the cases which have entries which are not _rules_. I updated the test cases and found that we had a possible NPE due to the way we process the policy. These ones are covered in {{testBrokenRules()}} and the updated tests in {{testNestedUserQueueParsingErrors()}} > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch, > YARN-8967.009.patch, YARN-8967.010.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: YARN-8967.010.patch > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch, > YARN-8967.009.patch, YARN-8967.010.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796823#comment-16796823 ] Wilfred Spiegelenburg commented on YARN-9278: - Thank for the update [~uranus] I don't think that we can add a lot of tests, as it would become really difficult to inspect the results. The three tests I can think of are: * setting the batch value to 0 (-1 is the default and we do a tests with that). * setting the value for the batch to something larger than the number of nodes to show that we do not run out of the node list and somehow fail * setting the batch value to {{#NMs -1}} (i.e. batch=4 in a 5 node cluster) and do multiple runs. Even with bad random we should flip over the end of the list. None of the tests should fail iterating nodes. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9278.001.patch, YARN-9278.002.patch, > YARN-9278.003.patch > > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795640#comment-16795640 ] Wilfred Spiegelenburg commented on YARN-8967: - Cleaned up the checkstyle issues and fixed the junit test failures. Also removed a partial diff that crept in from YARN-9314. > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch, > YARN-8967.009.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: YARN-8967.009.patch > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch, > YARN-8967.009.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795049#comment-16795049 ] Wilfred Spiegelenburg commented on YARN-8967: - Thank you for the review [~yufeigu] 1) yes it did clean up nicely 2) The class is marked as {{@ Unstable}} that should cover the change. Leaving the old constructors in could allow you to create a new {{AllocationFileLoaderService}} without a scheduler reference. That would cause a NPE on scheduler init and every single time the reload thread would run, leaving the RM in a failed state. I don't think it would be wise to leave them in. Based on all this I do think I need to file a follow up jira to fix the Hive SHIM that uses the policy at the moment and move that to the new code in a backward compatible way. 3) fixed that 4) fixed that 5) The difference between recovery and normal is just two if statements: in the first we ignore and empty context on recovery and the second one is to not generate an event on recovery. Moving the code out would not help. The checks are on opposite sides of the method and simple. 6) We could still have an empty queue that was why I left it. I just noticed that that case would be caught by the {{getLeafQueue}} so we should be OK with removing. 7) fixed that, it should have been removed 1) I have chosen to use the utility class solution and clean up a bit more. Keeping the QueuePlacementPolicy around in the allocation does not really help as the rules are really only relevant in the QueuePlacementManager in the new setup. There is no logic beside the rule list which is not 1:1 with the config that we could keep around. 2) fixed the reference (I used javadoc as there was nothing for other comments, now it is just a plain comment) 3) removed the comment and code 4) fixed 5) the tests look really similar but they are not. They test slight variations: the first two checks make sure the specified rule and create user rule trigger correctly. The last two make sure that the specified rule triggers but not the user rule and that the default rule does the catch it correctly. 6) fixed that, I left it at first with a view on possible extension later with other bits. I now moved the parent create code out and left the loop for elements which clears things up. 7) added a RuleMap class based on the suggestion 8) I think it is better to file a follow up jira as the same has happened in all new rule classes. We must have overlooked them in the previous jira when we did the cleanup. I checked and the exception is logged in the client service so it can be done. > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795049#comment-16795049 ] Wilfred Spiegelenburg edited comment on YARN-8967 at 3/18/19 2:15 PM: -- Thank you for the review [~yufeigu] AllocationFileLoaderService file: 1) yes it did clean up nicely 2) The class is marked as {{@Unstable}} that should cover the change. Leaving the old constructors in could allow you to create a new {{AllocationFileLoaderService}} without a scheduler reference. That would cause a NPE on scheduler init and every single time the reload thread would run, leaving the RM in a failed state. I don't think it would be wise to leave them in. _Based on all this I do think I need to file a follow up jira to fix the Hive SHIM that uses the policy at the moment and move that to the new code in a backward compatible way._ 3) fixed that 4) fixed that 5) The difference between recovery and normal is just two if statements: in the first we ignore and empty context on recovery and the second one is to not generate an event on recovery. Moving the code out would not help. The checks are on opposite sides of the method and simple. 6) We could still have an empty queue that was why I left it. I just noticed that that case would be caught by the {{getLeafQueue}} so we should be OK with removing. 7) fixed that, it should have been removed QueuePlacementPolicy file: 1) I have chosen to use the utility class solution and clean up a bit more. Keeping the QueuePlacementPolicy around in the allocation does not really help as the rules are really only relevant in the QueuePlacementManager in the new setup. There is no logic beside the rule list which is not 1:1 with the config that we could keep around. 2) fixed the reference (I used javadoc as there was nothing for other comments, now it is just a plain comment) 3) removed the comment and code 4) fixed 5) the tests look really similar but they are not. They test slight variations: the first two checks make sure the specified rule and create user rule trigger correctly. The last two make sure that the specified rule triggers but not the user rule and that the default rule does the catch it correctly. 6) fixed that, I left it at first with a view on possible extension later with other bits. I now moved the parent create code out and left the loop for elements which clears things up. 7) added a RuleMap class based on the suggestion 8) I think it is better to file a follow up jira as the same has happened in all new rule classes. We must have overlooked them in the previous jira when we did the cleanup. I checked and the exception is logged in the client service so it can be done. was (Author: wilfreds): Thank you for the review [~yufeigu] 1) yes it did clean up nicely 2) The class is marked as {{@ Unstable}} that should cover the change. Leaving the old constructors in could allow you to create a new {{AllocationFileLoaderService}} without a scheduler reference. That would cause a NPE on scheduler init and every single time the reload thread would run, leaving the RM in a failed state. I don't think it would be wise to leave them in. Based on all this I do think I need to file a follow up jira to fix the Hive SHIM that uses the policy at the moment and move that to the new code in a backward compatible way. 3) fixed that 4) fixed that 5) The difference between recovery and normal is just two if statements: in the first we ignore and empty context on recovery and the second one is to not generate an event on recovery. Moving the code out would not help. The checks are on opposite sides of the method and simple. 6) We could still have an empty queue that was why I left it. I just noticed that that case would be caught by the {{getLeafQueue}} so we should be OK with removing. 7) fixed that, it should have been removed 1) I have chosen to use the utility class solution and clean up a bit more. Keeping the QueuePlacementPolicy around in the allocation does not really help as the rules are really only relevant in the QueuePlacementManager in the new setup. There is no logic beside the rule list which is not 1:1 with the config that we could keep around. 2) fixed the reference (I used javadoc as there was nothing for other comments, now it is just a plain comment) 3) removed the comment and code 4) fixed 5) the tests look really similar but they are not. They test slight variations: the first two checks make sure the specified rule and create user rule trigger correctly. The last two make sure that the specified rule triggers but not the user rule and that the default rule does the catch it correctly. 6) fixed that, I left it at first with a view on possible extension later with other bits. I now moved the parent create code out and left the loop for elements which clears things up. 7) added a RuleMap class
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: YARN-8967.008.patch > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch, YARN-8967.008.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794653#comment-16794653 ] Wilfred Spiegelenburg commented on YARN-9278: - Thank you for the update. [~uranus] The code change looks good: a couple of minor remarks: * We need to get either add some tests or explain why we cannot add tests. * Can you fix the newly introduced checkstyle issues please. * The text for the property is much better however this part does not make sense to me: {{The max trial nodes num to identify containers for one starved container}} I think you want to say: {{The maximum number of NodeManagers to check per pre-emption check for one starved container.}} > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9278.001.patch, YARN-9278.002.patch > > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791169#comment-16791169 ] Wilfred Spiegelenburg commented on YARN-8967: - [~haibochen] or [~templedf] could either of you review this please? > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790501#comment-16790501 ] Wilfred Spiegelenburg commented on YARN-9278: - I have a couple of comments on the patch. * I think we need a default that turns this functionality off by default. An administrator should actively turn this on. A default of -1 or 0 is better and we should skip over the calculation for that value. * Your code does not wrap at the end of the list. That is why I changed it to a do while loop. as an example: I have a batch size of 100 and I have 350 nodes. I could start at node 300 and still want to check those 100 nodes and want to check nodes 300-350 and 0-49. We should never double check a node, never go past the start. I updated the example code with a wrapping end calculation. * We need to cater for a setup that has the batch value set and a node list, for whatever reason or point in time, is smaller than the batch size. Your code does not handle this. We should just process all nodes at that point and not stop at the end of the list. * The text for the property is not clear at all. (See below) * Please look at adding a test for this change. Text for the property remarks: {code} The max trial nodes num to identify containers for one starved container. Defaults to 0. {code} It does not explain what it does and why it is there. It is too cryptic. We should explain the following: * what is it used for (use a partial list of nodes to check for preemptable containers in a large cluster) * when should it be used, or when it takes effect. * what is the impact: AM container impact, > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9278.001.patch > > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9344) FS should not reserve when container capability is bigger than node total resource
[ https://issues.apache.org/jira/browse/YARN-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790510#comment-16790510 ] Wilfred Spiegelenburg commented on YARN-9344: - Thank you [~uranus] for adding the test. Could you clean up the test a bit? There is an unused queue in the config and the {{}} does distract from the real test and it should work without it. The asserts from the junit test are much clearer when they have a message that is printed when the test fails example: {code} assertEquals("Application has live containers and it should have none", 0, scheduler.getSchedulerApp(attId1).getLiveContainers().size()); {code} The test does only check memory, we should also cover other resource types in the test not just the memory resource (vcores, custom resource types). > FS should not reserve when container capability is bigger than node total > resource > -- > > Key: YARN-9344 > URL: https://issues.apache.org/jira/browse/YARN-9344 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9344.001.patch, YARN-9344.002.patch, > YARN-9344.003.patch, YARN-9344.004.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780101#comment-16780101 ] Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:22 PM: --- Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; stop = current; } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. was (Author: wilfreds): Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9278.001.patch > > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780101#comment-16780101 ] Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:25 PM: --- Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; stop = (preEmptionBatchSize > size) ? current : ((current + preEmptionBatchSize) % size); } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. was (Author: wilfreds): Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; stop = current; } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9278.001.patch > > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To
[jira] [Commented] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786734#comment-16786734 ] Wilfred Spiegelenburg commented on YARN-9314: - Hi [~fengyongshe], thank you for filing this and providing a patch. I have a couple comments: * The text in the exception needs clarification: {{queuename (" + queueName + ") repeated defining in Allocation File}} something like this is clearer: {{queue name (" + queueName + ") is defined multiple times, queues can only be defined once.}} * The {{exists}} method can be simplified: {code} public boolean exists(String queueName) { for (FSQueueType queueType : FSQueueType.values()) { if (configuredQueues.get(queueType).contains(queueName)) { return true; } } return false; } {code} * instead of checking the text of the message in the exception it is better to use the {{(expected = AllocationConfigurationException.class)}} on the test. If we change the text the test would still pass making maintenance easier. We already do that in a number of tests like {{testQueueAlongsideRoot}} as an example. * the patch introduces a number of new checkstyle issues which should be fixed. > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: fengyongshe >Priority: Major > Attachments: Fair Scheduler Mistake when configured same queue at > same level.png, YARN-9341.patch > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > {color:#33}The Queue root.deva configured last will override existing > root.deva{color}{color:#33} in root.deva.sample, like the > {color}attachment > > root.deva > ||Used Resources:|| > ||Min Resources:|. => should be <3072mb,3vcores>| > ||Max Resources:|. => should be <4096mb,4vcores>| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9343) Replace isDebugEnabled with SLF4J parameterized log messages
[ https://issues.apache.org/jira/browse/YARN-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786572#comment-16786572 ] Wilfred Spiegelenburg commented on YARN-9343: - yes I am fine with that. This patch is big enough to leave it like this. I did not see any issues beside the ones to open new jiras for in the latest patch +1 (non binding) > Replace isDebugEnabled with SLF4J parameterized log messages > > > Key: YARN-9343 > URL: https://issues.apache.org/jira/browse/YARN-9343 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9343-001.patch, YARN-9343-002.patch, > YARN-9343-003.patch > > > Replace isDebugEnabled with SLF4J parameterized log messages. > https://www.slf4j.org/faq.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9343) Replace isDebugEnabled with SLF4J parameterized log messages
[ https://issues.apache.org/jira/browse/YARN-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786336#comment-16786336 ] Wilfred Spiegelenburg edited comment on YARN-9343 at 3/7/19 3:28 AM: - Thank you for the update [~Prabhu Joseph] I can see that we still have 200+ {{LOG.isDebugEnabled()}} calls in the code. two things: # There are a lot of simple one parameter calls which could easily be converted to unguarded calls, examples: ** NvidiaDockerV1CommandPlugin.java ** FSParentQueue.java ** Application.java # Some of the calls to {{LOG.debug}} that are guarded inside those guards have not been changed to parameterised calls yet. Do you want to file a followup jira for that or should that also be part of these changes? was (Author: wilfreds): Thank you for the update [~Prabhu Joseph] I can see that we still have 200+ {{LOG.isDebugEnabled()}} calls in the code. two things: # There are a lot of simple one parameter calls which could easily be converted to unguarded calls, examples: * NvidiaDockerV1CommandPlugin.java * FSParentQueue.java * Application.java # Some of the calls to {{LOG.debug}} that are guarded inside those have not been changed to parameterised calls yet. Do you want to file a followup jira for that or should that also be part of these changes? > Replace isDebugEnabled with SLF4J parameterized log messages > > > Key: YARN-9343 > URL: https://issues.apache.org/jira/browse/YARN-9343 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9343-001.patch, YARN-9343-002.patch, > YARN-9343-003.patch > > > Replace isDebugEnabled with SLF4J parameterized log messages. > https://www.slf4j.org/faq.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9343) Replace isDebugEnabled with SLF4J parameterized log messages
[ https://issues.apache.org/jira/browse/YARN-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786336#comment-16786336 ] Wilfred Spiegelenburg commented on YARN-9343: - Thank you for the update [~Prabhu Joseph] I can see that we still have 200+ {{LOG.isDebugEnabled()}} calls in the code. two things: # There are a lot of simple one parameter calls which could easily be converted to unguarded calls, examples: * NvidiaDockerV1CommandPlugin.java * FSParentQueue.java * Application.java # Some of the calls to {{LOG.debug}} that are guarded inside those have not been changed to parameterised calls yet. Do you want to file a followup jira for that or should that also be part of these changes? > Replace isDebugEnabled with SLF4J parameterized log messages > > > Key: YARN-9343 > URL: https://issues.apache.org/jira/browse/YARN-9343 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9343-001.patch, YARN-9343-002.patch, > YARN-9343-003.patch > > > Replace isDebugEnabled with SLF4J parameterized log messages. > https://www.slf4j.org/faq.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9344) FS should not reserve when container capability is bigger than node total resource
[ https://issues.apache.org/jira/browse/YARN-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786314#comment-16786314 ] Wilfred Spiegelenburg commented on YARN-9344: - The test failures are also not related: TestApplicationMasterServiceFair failed because it ran with the CapacityScheduler... Not sure what happened there. [~uranus] This change should be easily testable in a junit test. We should not have a -1 from test4tests. Can you please add tests to TestFSAppAttempt to make sure that this is working as expected? > FS should not reserve when container capability is bigger than node total > resource > -- > > Key: YARN-9344 > URL: https://issues.apache.org/jira/browse/YARN-9344 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9344.001.patch, YARN-9344.002.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786306#comment-16786306 ] Wilfred Spiegelenburg commented on YARN-8967: - Fixed the newly introduced checkstyle issues. The build should now not have any white space issues anymore. Test failures are not related to the patch, uploading patch 007. > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: YARN-8967.007.patch > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.007.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9326) Fair Scheduler configuration defaults are not documented in case of min and maxResources
[ https://issues.apache.org/jira/browse/YARN-9326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786290#comment-16786290 ] Wilfred Spiegelenburg commented on YARN-9326: - The white space issues are fixed via YARN-9348. A new build should not show them anymore, The text looks good to me now. [~templedf] you did a lot of the work around resource types. Does this change look good to you from that perspective or should we extend the new format examples with a resource type tag like this to make it really clear: {code} "vcores=X, memory-mb=Y, GPU=5" {code} > Fair Scheduler configuration defaults are not documented in case of min and > maxResources > > > Key: YARN-9326 > URL: https://issues.apache.org/jira/browse/YARN-9326 > Project: Hadoop YARN > Issue Type: Improvement > Components: docs, documentation, fairscheduler, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-9326.001.patch, YARN-9326.002.patch, > YARN-9326.003.patch, YARN-9326.004.patch, YARN-9326.005.patch > > > The FairScheduler's configuration has the following defaults (from the code: > javadoc): > {noformat} > In new style resources, any resource that is not specified will be set to > missing or 0%, as appropriate. Also, in the new style resources, units are > not allowed. Units are assumed from the resource manager's settings for the > resources when the value isn't a percentage. The missing parameter is only > used in the case of new style resources without percentages. With new style > resources with percentages, any missing resources will be assumed to be 100% > because percentages are only used with maximum resource limits. > {noformat} > This is not documented in the hadoop yarn site FairScheduler.html. It is > quite intuitive, but still need to be documented though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: (was: YARN-8967.006.patch) > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: YARN-8967.006.patch > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch, YARN-8967.006.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9343) Replace isDebugEnabled with SLF4J parameterized log messages
[ https://issues.apache.org/jira/browse/YARN-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785182#comment-16785182 ] Wilfred Spiegelenburg commented on YARN-9343: - Hi [~Prabhu Joseph] I see a lot of changes between patch 1 and patch 2. Patch 2 contains about 50 more files that has changed. Can you explain what was done? I see that there are a large number of new files in patch 2 but I also miss some files that were in patch 1 in patch2: * 55 new files in patch 2 * 5 files removed from patch 2 > Replace isDebugEnabled with SLF4J parameterized log messages > > > Key: YARN-9343 > URL: https://issues.apache.org/jira/browse/YARN-9343 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9343-001.patch, YARN-9343-002.patch > > > Replace isDebugEnabled with SLF4J parameterized log messages. > https://www.slf4j.org/faq.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785164#comment-16785164 ] Wilfred Spiegelenburg commented on YARN-9298: - Thank you [~yufeigu] I will follow up with the real integration as part of YARN-8967. > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch, YARN-9298.004.patch, YARN-9298.005.patch, > YARN-9298.006.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9341) Reentrant lock() before try
[ https://issues.apache.org/jira/browse/YARN-9341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785162#comment-16785162 ] Wilfred Spiegelenburg commented on YARN-9341: - So we do have one and also one {{lockInterruptibly()}} in another part. The change as proposed by [~Prabhu Joseph] have left those two unchanged, neither are covered under the description of the jira either. It just talks about the {{lock()}} cases. The only replacements that have been made are the direct calls to {{lock()}} in the patch neither of the two other ones have been touched. That is where I based my +1 on > Reentrant lock() before try > --- > > Key: YARN-9341 > URL: https://issues.apache.org/jira/browse/YARN-9341 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: YARN-9341-001.patch > > > As a best practice - Reentrant lock has to be acquired before try clause. > https://stackoverflow.com/questions/31058681/java-locking-structure-best-pattern > There are many places where lock is obtained inside try. > {code} > try { >this.writeLock.lock(); > > } finally { > this.writeLock.unlock(); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9344) FS should not reserve when container capability is bigger than node total resource
[ https://issues.apache.org/jira/browse/YARN-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785103#comment-16785103 ] Wilfred Spiegelenburg commented on YARN-9344: - HI [~uranus] yes this is a problem nice catch. However don't we have a more generic problem with the fact that we offer the node to this application attempt at all? The reservation is one thing but I think we should shortcut this assignment completely. If the specific request does not fit at all we need to move to the next request for the application attempt. That would mean we need to move it one call up into the {{assignContainer(FSSchedulerNode node, boolean reserved)}} instead of where it is now. Does that make sense to you? > FS should not reserve when container capability is bigger than node total > resource > -- > > Key: YARN-9344 > URL: https://issues.apache.org/jira/browse/YARN-9344 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9344.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9341) Reentrant lock() before try
[ https://issues.apache.org/jira/browse/YARN-9341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783987#comment-16783987 ] Wilfred Spiegelenburg commented on YARN-9341: - An {{IllegalMonitorStateException}} can only happen on unlock if the current thread is not the owner of the lock. I don't think we use {{tryLock}} or {{lockInterruptibly}} anywhere in our code and thus do not need to worry about the {{IllegalMonitorStateException}}. When you call lock the thread is blocked until the point you acquire the lock. We should thus never proceed beyond the lock line and the finally clause should never be executed until after the thread has lock. The change proposed is even following the java [API doc|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReentrantLock.html] for the locking. +1 (non binding) > Reentrant lock() before try > --- > > Key: YARN-9341 > URL: https://issues.apache.org/jira/browse/YARN-9341 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: YARN-9341-001.patch > > > As a best practice - Reentrant lock has to be acquired before try clause. > https://stackoverflow.com/questions/31058681/java-locking-structure-best-pattern > There are many places where lock is obtained inside try. > {code} > try { >this.writeLock.lock(); > > } finally { > this.writeLock.unlock(); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9343) Replace isDebugEnabled with SLF4J parameterized log messages
[ https://issues.apache.org/jira/browse/YARN-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783952#comment-16783952 ] Wilfred Spiegelenburg commented on YARN-9343: - Hi [~Prabhu Joseph], thank you for this update. I have looked at it a couple of times and just updated the parts that I touched. It is good to have this done globally. I do have some remarks: * I saw an inconsistency in how we log exception. In some places we use {{debug(ex.getMessage());}} while in other we just use {{debug({}, ex);}} would be good to come to a standard way of logging them. * Again for consistency sake: in the case that we just log the exception it would be nice to add that to the message text itself so we know that it is ignored, we do it in a number of places but not everywhere. * In {{CombinedResourceCalculator}} we have two consecutive LOG.debug statements in the diff, only one is replaced. * Do we need to use {{String.valueOf(pullImageTimeMs)}} in {{DockerLinuxContainerRuntime}} can we not just pass the object? * In {{ResourceLocalizationService}} you have missed a object reference in the text: {code:java} LOG.debug("Skip downloading resource: {} since it's in" + " state: ", key, rsrc.getState()); {code} * In {{AmIpFilter}} you have removed the guard but not changed the format string etc. {code} LOG.debug("Could not find " + WebAppProxyServlet.PROXY_USER_COOKIE_NAME + " cookie, so user will not be set"); {code} I saw a couple of cases in which we are doing expensive operations in preparing the objects just for logging. Should we not keep the guard around them to prevent the overhead: * TimelineUtils.dumpTimelineRecordtoJSON(entity) * Arrays.toString(fullCommandArray) * StringUtils.join(",", assignedResources) Can you also check the checkstyle issues and clean up the line breaks string concats you are using? > Replace isDebugEnabled with SLF4J parameterized log messages > > > Key: YARN-9343 > URL: https://issues.apache.org/jira/browse/YARN-9343 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9343-001.patch > > > Replace isDebugEnabled with SLF4J parameterized log messages. > https://www.slf4j.org/faq.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782932#comment-16782932 ] Wilfred Spiegelenburg commented on YARN-9298: - # saw that, had them already fixed in a new version # fixed that one and also made the {{QueueManager}} private and introduced a getter for it. It is only set in the class itself and needed outside when the parent rule is run (that fixes the 3rd checkstyle issue) # It should have been true from the start, changed the init to true. Removed two unneeded casts also in the {{setConfig}} method. I think that is it [~yufeigu] > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch, YARN-9298.004.patch, YARN-9298.005.patch, > YARN-9298.006.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-9298: Attachment: YARN-9298.006.patch > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch, YARN-9298.004.patch, YARN-9298.005.patch, > YARN-9298.006.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6487) FairScheduler: remove continuous scheduling (YARN-1010)
[ https://issues.apache.org/jira/browse/YARN-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782897#comment-16782897 ] Wilfred Spiegelenburg commented on YARN-6487: - It works both ways: Before we can schedule we need to update the current usage and shares etc. This runs in an update thread. Continuous scheduling triggers that update. The heartbeats when they are processed do the same. This updating requires a lock of the scheduler as does the scheduling process itself. The extra update demand is the trigger/ So you get into a state that the heartbeat, the updates and the scheduling itself are all waiting for the lock. The larger the number of nodes, the larger the number of applications is (in most cases) and the larger the number of queues (again in most cases). All this combined causes processing to start lagging and the continuous scheduling really loses its function. Node numbers influence continuous scheduling and the other way around. > FairScheduler: remove continuous scheduling (YARN-1010) > --- > > Key: YARN-6487 > URL: https://issues.apache.org/jira/browse/YARN-6487 > Project: Hadoop YARN > Issue Type: Task > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > Remove deprecated FairScheduler continuous scheduler code -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9326) Fair Scheduler configuration defaults are not documented in case of min and maxResources
[ https://issues.apache.org/jira/browse/YARN-9326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781720#comment-16781720 ] Wilfred Spiegelenburg commented on YARN-9326: - Thank you for the update [~adam.antal] I see the same message when building with those options. For remark 4: it should have been {{maxContainerAllocation}} as you said. Please update the text for the vcores. For all the {{max}} settings if the new definition is used all resource types are given and or set. Old ones will only set memory and cores and leave unspecified ones set to 0. All including the unspecified ones are checked recursively up the queue tree. The *root* queue values are set via yarn.scheduler.maximum* and the resource type config. I might not have been completely clear in my comment #6. I am missing the fact that the {{maxResources}} limit is also enforced recursively. A queue will not be assigned a container if that assignment would put the queue or its parent(s) over the maximum resources. It is the same for maxima assigned via {{maxResources}} on static queues and {{maxChildResources}} for dynamic queues. > Fair Scheduler configuration defaults are not documented in case of min and > maxResources > > > Key: YARN-9326 > URL: https://issues.apache.org/jira/browse/YARN-9326 > Project: Hadoop YARN > Issue Type: Improvement > Components: docs, documentation, fairscheduler, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-9326.001.patch, YARN-9326.002.patch, > YARN-9326.003.patch, YARN-9326.004.patch > > > The FairScheduler's configuration has the following defaults (from the code: > javadoc): > {noformat} > In new style resources, any resource that is not specified will be set to > missing or 0%, as appropriate. Also, in the new style resources, units are > not allowed. Units are assumed from the resource manager's settings for the > resources when the value isn't a percentage. The missing parameter is only > used in the case of new style resources without percentages. With new style > resources with percentages, any missing resources will be assumed to be 100% > because percentages are only used with maximum resource limits. > {noformat} > This is not documented in the hadoop yarn site FairScheduler.html. It is > quite intuitive, but still need to be documented though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781661#comment-16781661 ] Wilfred Spiegelenburg commented on YARN-9298: - Last version also fixes the {{createQueue}} flag and removes unchecked casts from the test code. [~yufeigu] Please ignore patch 004 and check 005. > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch, YARN-9298.004.patch, YARN-9298.005.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-9298: Attachment: YARN-9298.005.patch > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch, YARN-9298.004.patch, YARN-9298.005.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781648#comment-16781648 ] Wilfred Spiegelenburg commented on YARN-9298: - # I added the abstract {{FSPlacementRule}} and moved things into it. I do not want to add the {{FSPlacementRule}} into the {{PlacementFactory}}, because of that I want to keep a blank {{setConfig}} in that definition. I am not happy with the {{createQueue}} that is left there and am still trying to get to a fix for that without to much impact. # That was a good catch, the build and my check of the build did not pick that one up. # I can live with either solution, changed it to your preferred way. > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch, YARN-9298.004.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-9298: Attachment: YARN-9298.004.patch > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch, YARN-9298.004.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6487) FairScheduler: remove continuous scheduling (YARN-1010)
[ https://issues.apache.org/jira/browse/YARN-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780108#comment-16780108 ] Wilfred Spiegelenburg commented on YARN-6487: - The removal of continuous scheduling was/is based on performance numbers and locking issues. Continuous scheduling was introduced to help speed up allocating containers in a small cluster that did not have a large number of heartbeats coming in. This would happen in clusters that were running a mixed load of containers with an emphasis on longer running containers. In those clusters the NM heartbeats would hold up assigning containers when a burst of requests would come in. The side effect is however that when a cluster grows (100+ nodes) the number of heartbeats that needed processing started interfering with the continuous scheduling thread and other internal threads. This does cause thread starvation and in the worst case scheduling comes to a standstill. The improvements that have been made in the scheduler that now allows you to assign multiple containers per heartbeat and still spread the load over multiple nodes have made continuous scheduling unneeded in all but the smallest clusters. In those clusters changing NM heartbeat intervals can be used to workaround that. So we really do not need it anymore. If turned on in large clusters it can cause a lot of side effect that is why we decided to deprecate it. We could think about completely decoupling scheduling from the NM heartbeat to remove the locking but that would be a far bigger task which affects all schedulers. > FairScheduler: remove continuous scheduling (YARN-1010) > --- > > Key: YARN-6487 > URL: https://issues.apache.org/jira/browse/YARN-6487 > Project: Hadoop YARN > Issue Type: Task > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > Remove deprecated FairScheduler continuous scheduler code -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780101#comment-16780101 ] Wilfred Spiegelenburg commented on YARN-9278: - Two things: * I still think limiting the number of nodes is something we need to approach with care. * randomising a 10,000 entry long list each time we pre-empt will also become expensive. I was thinking more of something like this: {code:java} int preEmptionBatchSize = conf.getPreEmptionBatchSize(); List potentialNodes = scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName()); int size = potentialNodes.size(); int stop = 0; int current = 0; // find a start point somewhere in the list if it is long if (size > preEmptionBatchSize) { Random rand = new Random(); current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize; } do { FSSchedulerNode mine = potentialNodes.get(current); // Identify the containers current++; // flip at the end of the list if (current > size) { current = 0; } } while (current != stop); {code} Pre-emption runs in a loop and we could be considering different applications one after the other. Shuffling that node list continually is not good from a performance perspective. A simple cut in like above gives the same kind of behaviour. We could then still limit the number of "batches" we process. With some more smarts the stop condition could be based on the fact that we have processed as an example 10 * the batch size in nodes (a batch of nodes could be deemed equivalent with the number of nodes in a rack): {code} stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * preEmptionBatchSize) + current) % size);); {code} That gives a lot of flexibility and still a decent performance in a large cluster. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779358#comment-16779358 ] Wilfred Spiegelenburg edited comment on YARN-9298 at 2/28/19 2:11 AM: -- 1) oops copy past error, fixed now 2) yep, you're right replaced the text 3) added 4) The tests we have in YARN-8967 are up a level: they test the rules as part of a list of rules and not really every rule independently. They do do not check the rule config/init parts. I have added new tests for all rules in the {{TestPlacementRuleFS}} class for config and init. I would like to leave the placement checks in the policy for clarity. 5) You cannot use a switch with an Object as the input as per the [java docs|https://docs.oracle.com/javase/tutorial/java/nutsandbolts/switch.html]. To do that we would need to switch on a string object compared to the Class name which I don't think is a good idea as it is discouraged due to false positives/negatives and class loader dependencies. 6) For the {{setConfig()}}: * moving the Object check out will pollute the abstract class with FairScheduler dependencies and two extra {{setConfig()}} methods. Those 2 methods will be _noop_ implementations in the abstract class. I think more confusing when you look at it from other schedulers. * The only part that could possibly be pulled out is getting the create flag out that is done in this version of the patch. 6) I looked at {{initialize()}} but that is not really possible: * Moving the scheduler check out is not possible, especially not into the abstract class. * The check for the parent rule outside the class itself does not make it any cleaner. Two different cases are handled in the same code lines (not allowed and not the same class). Moving them makes it really messy. was (Author: wilfreds): 1) oops copy past error, fixed now 2) yep, you're right replaced the text 3) added 4) The tests we have in YARN-8967 are up a level: they test the rules as part of a list of rules and not really every rule independently. They do do not check the rule config/init parts. I have added new tests for all rules in the {{TestPlacementRuleFS}} class for config and init. I would like to leave the placement checks in the policy for clarity. 5) You cannot use a switch with an Object as the input as per the [java docs|https://docs.oracle.com/javase/tutorial/java/nutsandbolts/switch.html]. To do that we would need to switch on a string object compared to the Class name which I don't think is a good idea as it is discouraged due to false positives/negatives and class loader dependencies. 6) For the {{setConfig()}}: * moving the Object check out will pollute the abstract class with FairScheduler dependencies and two extra {{setConfig()}} methods. Those 2 methods will be _noop_ implementations in the abstract class. I think more confusing when you look at it from other schedulers. * The only part that could possibly be pulled out is getting the create flag out that is done in this version of the patch. 6) I looked at {{initialize()}} but that is not really possible: * Moving the scheduler check out is not possible, especially not into the abstract class. * The check for the parent rule outside the class itself does not make it any cleaner. Two different cases are handled in the same code lines (not allowed and not the same class). Moving them makes it really messy. > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9326) Fair Scheduler configuration defaults are not documented in case of min and maxResources
[ https://issues.apache.org/jira/browse/YARN-9326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780015#comment-16780015 ] Wilfred Spiegelenburg commented on YARN-9326: - Thank you for the updated patch [~adam.antal] Some further text comments: * For minResources this text is no longer correct when you take into account resource types: {{For the single-resource fairness policy, the vcores value is ignored.}} It should mention that it only uses the memory setting, since we can have more resource types than just 2. * This is not correct {{**maxResources**: maximum resources a queue will allocated.}} it should be something like _can be allocated_ not _will allocated_ * Same text here needs to be fixed: {{**maxChildResources**: maximum resources an ad hoc child queue will allocated.}} * For the maxChildResources this is not correct: {{It's default value is **yarn.scheduler.maximum-allocation-mb**.}} as it ignores types and even the vcores: ** It should mention the vcore equivalent for the yarn config. ** The scheduler max allocation which is again a resource object and thus can set a limit on all resource types. (via the resource type config file) * This sentence should not be in the maxChildresource: {{In the latter case the units will be inferred from the default units configured for that resource.}} * A child queue limit is enforced recursively and thus will not be assigned a container if that assignment would put the child queue or its parent(s) over the maximum resources. * In the last changed sentence: _or maximum_ change to _or to the maximum_ > Fair Scheduler configuration defaults are not documented in case of min and > maxResources > > > Key: YARN-9326 > URL: https://issues.apache.org/jira/browse/YARN-9326 > Project: Hadoop YARN > Issue Type: Improvement > Components: docs, documentation, fairscheduler, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-9326.001.patch, YARN-9326.002.patch, > YARN-9326.003.patch > > > The FairScheduler's configuration has the following defaults (from the code: > javadoc): > {noformat} > In new style resources, any resource that is not specified will be set to > missing or 0%, as appropriate. Also, in the new style resources, units are > not allowed. Units are assumed from the resource manager's settings for the > resources when the value isn't a percentage. The missing parameter is only > used in the case of new style resources without percentages. With new style > resources with percentages, any missing resources will be assumed to be 100% > because percentages are only used with maximum resource limits. > {noformat} > This is not documented in the hadoop yarn site FairScheduler.html. It is > quite intuitive, but still need to be documented though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: YARN-8967.006.patch > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch, > YARN-8967.006.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-9298: Attachment: YARN-9298.003.patch > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch, > YARN-9298.003.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779358#comment-16779358 ] Wilfred Spiegelenburg commented on YARN-9298: - 1) oops copy past error, fixed now 2) yep, you're right replaced the text 3) added 4) The tests we have in YARN-8967 are up a level: they test the rules as part of a list of rules and not really every rule independently. They do do not check the rule config/init parts. I have added new tests for all rules in the {{TestPlacementRuleFS}} class for config and init. I would like to leave the placement checks in the policy for clarity. 5) You cannot use a switch with an Object as the input as per the [java docs|https://docs.oracle.com/javase/tutorial/java/nutsandbolts/switch.html]. To do that we would need to switch on a string object compared to the Class name which I don't think is a good idea as it is discouraged due to false positives/negatives and class loader dependencies. 6) For the {{setConfig()}}: * moving the Object check out will pollute the abstract class with FairScheduler dependencies and two extra {{setConfig()}} methods. Those 2 methods will be _noop_ implementations in the abstract class. I think more confusing when you look at it from other schedulers. * The only part that could possibly be pulled out is getting the create flag out that is done in this version of the patch. 6) I looked at {{initialize()}} but that is not really possible: * Moving the scheduler check out is not possible, especially not into the abstract class. * The check for the parent rule outside the class itself does not make it any cleaner. Two different cases are handled in the same code lines (not allowed and not the same class). Moving them makes it really messy. > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9326) Fair Scheduler configuration defaults are not documented in case of min and maxResources
[ https://issues.apache.org/jira/browse/YARN-9326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779296#comment-16779296 ] Wilfred Spiegelenburg commented on YARN-9326: - Thanks [~adam.antal] for following up on YARN-8662 * I looked at the documentation and am missing the changes to the {{minResources}}. As I stated in YARN-8662 the {{minResources}} tag also handles % signs in its definition. That is not mentioned in any style example. We need to add it for new and old style definitions. * New style resources for all settings can use % which is not shown. I am thus still missing the example for new style resources that use the percentage in all settings: {code} vcores=X%, memory-mb=Y% {code} * The other thing that is still not clear in the update is that the only case in which we default the resource types not specified to either 0 or the maximum is when we use the new style resources. That should be combined with making it even clearer that resource types should *not* be used in combination with old style definitions. > Fair Scheduler configuration defaults are not documented in case of min and > maxResources > > > Key: YARN-9326 > URL: https://issues.apache.org/jira/browse/YARN-9326 > Project: Hadoop YARN > Issue Type: Improvement > Components: docs, documentation, fairscheduler, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: YARN-9326.001.patch, YARN-9326.002.patch > > > The FairScheduler's configuration has the following defaults (from the code: > javadoc): > {noformat} > In new style resources, any resource that is not specified will be set to > missing or 0%, as appropriate. Also, in the new style resources, units are > not allowed. Units are assumed from the resource manager's settings for the > resources when the value isn't a percentage. The missing parameter is only > used in the case of new style resources without percentages. With new style > resources with percentages, any missing resources will be assumed to be 100% > because percentages are only used with maximum resource limits. > {noformat} > This is not documented in the hadoop yarn site FairScheduler.html. It is > quite intuitive, but still need to be documented though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776502#comment-16776502 ] Wilfred Spiegelenburg commented on YARN-9278: - [~uranus] I can understand that you want to limit the number of nodes to look at for pre-emption in large clusters. I could speed things up in certain cases. However when I look at the way we identify we already break out of the loop when we get to a node that gives back a container list without AMs. In {{identifyContainersToPreemptForOneContainer}} we break out of the loop checking nodes when {{numAMContainers}} was 0. So we do already break out of the loop looking for suitable nodes. Based on your comment this will change will introduce a trade of between AMs and nodes. You propose to stop checking nodes even if we still have AMs in the list. In other words you are willing to accept some AMs in the list even if that has side effects on those applications. I don't think that that is a good idea. I do agree with you that for the ANY resource we probably want to do something else and not just grab the first nodes out of the list all the time. The list that comes back from the node tracker is unsorted and just a copy of what is known without a filter. We should introduce some logic to not just use a for loop to run over the list from the start. If we use a seeded start point somewhere in the list which moves around we spread our preemption better. We could base the starting point on the current time (second) and the size of the list returned. I don't think we need that if the list is smaller than a hard coded number (maybe 50 or 100) but it would really help in large clusters. > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9323) FSLeafQueue#computeMaxAMResource does not override zero values for custom resources
[ https://issues.apache.org/jira/browse/YARN-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774779#comment-16774779 ] Wilfred Spiegelenburg commented on YARN-9323: - Hi [~snemeth] Some comments on this change as it includes a number of changes that are not related to fixing the issue. These changes just increase the size of the fix: * The {{FairScheduler}} change seems to be just a layout change * in the FSLeafQueue we have similar changes around {{setMemorySize}} and {{setVirtualCores}} * {{computeMaxAMResource}} javadoc changes are unneeded * import re-ordering in the TestFSLeafQueue is unneeded These two should be fixed: * checkstyle issue: _MAX_AM_SHARE_ in {{TestFSLeafQueue}} should be final * whitespace issue: line 219 of the patch The rest should wait until we have a test run with YARN-9322 committed > FSLeafQueue#computeMaxAMResource does not override zero values for custom > resources > --- > > Key: YARN-9323 > URL: https://issues.apache.org/jira/browse/YARN-9323 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9323.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8662) Fair Scheduler stops scheduling when a queue is configured only CPU and memory
[ https://issues.apache.org/jira/browse/YARN-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774708#comment-16774708 ] Wilfred Spiegelenburg commented on YARN-8662: - Hi [~Sen Zhao], Thank you for filing this and providing a patch. I had some time and finally got around to looking at this for a review. Are you still willing to work on this? It looks like this issue only happen if you use old style resource definitions for the __ entries. The java doc for {{parseResourceConfigValue}} states: {code} * The {@code missing} parameter is only used in the case of new style * resources without percentages. With new style resources with percentages, * any missing resources will be assumed to be 100% because percentages are * only used with maximum resource limits. {code} Which means that the code is doing what it is documented. You are using old style resources definitions. Your change is going to break this as it will now use the missing parameter also for old style resource definition without percentages. The workaround would be to use the new style declaration and the maximum would be set according to what you would expect. Old style declarations are there for backwards compatibility. When using resource types you really should be using the new style definitions. If we still want to go down this path and make old style behave more like the new style then we have a number of other changes that need to be made: * make a change similar to what you have now * clean up the java doc * clean up user documentation as minimum can take a percentage which is not documented at all * fix the percentage for old style: we need to handle min resources too as now the min for any custom type is 100% of the cluster. If we do not go through we should at least fix the two documentation points and document that you should use the new style definitions for min and max when you use resource types. > Fair Scheduler stops scheduling when a queue is configured only CPU and memory > -- > > Key: YARN-8662 > URL: https://issues.apache.org/jira/browse/YARN-8662 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Sen Zhao >Assignee: Sen Zhao >Priority: Major > Attachments: NonResourceToSchedule.png, YARN-8662.001.patch > > > Add a new resource type in resource-types.xml, eg: resource1. > In Fair scheduler when queue's MaxResources is configured like: > {code}4096 mb, 4 vcores{code} > When submit a application which need resource like: > {code} 1536 mb, 1 vcores, 10 resource1{code} > The application will be pending. Because there is no resource1 in this queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773708#comment-16773708 ] Wilfred Spiegelenburg commented on YARN-9298: - Thank you for the review [~yufeigu] it took a bit longer than expected working on 4 and 5 without polluting the code too much. 1) done added to all files changed 2) added tests for: * FairQueuePlacementUtils * PlacementFactory * PlacementRule (FS added parts) 3) removed the extra line 4) That is how I started the implementation. I ran into a number of problems while instantiating the rules in the policy and then moved to this model. I have it working now without polluting the factory and or rule with lots of FS specific classes. 5) Done that as part of the rewrite for 4) 6) updated the javadoc for the method 7) fixed 8) removed, the exception is already logged higher up in the stack > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-9298: Attachment: YARN-9298.002.patch > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch, YARN-9298.002.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16768900#comment-16768900 ] Wilfred Spiegelenburg commented on YARN-9298: - [~cheersyang] Can you please check this? > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9308) fairscheduler-statedump.log gets generated regardless of service again after the merge of HDFS-7240
[ https://issues.apache.org/jira/browse/YARN-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-9308: Attachment: YARN-9308.001.patch > fairscheduler-statedump.log gets generated regardless of service again after > the merge of HDFS-7240 > --- > > Key: YARN-9308 > URL: https://issues.apache.org/jira/browse/YARN-9308 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, scheduler >Affects Versions: 3.2.0 >Reporter: Akira Ajisaka >Assignee: Wilfred Spiegelenburg >Priority: Blocker > Attachments: YARN-9308.001.patch > > > After the merge of HDFS-7240, YARN-6453 occurred again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9308) fairscheduler-statedump.log gets generated regardless of service again after the merge of HDFS-7240
[ https://issues.apache.org/jira/browse/YARN-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16768891#comment-16768891 ] Wilfred Spiegelenburg commented on YARN-9308: - The changes from [HDFS-7240 git commit fixup|https://github.com/apache/hadoop/commit/2adda92de1535c0472c0df33a145fa1814703f4f] added the log config lines back without the comment marks I will upload a patch to fix it up again. > fairscheduler-statedump.log gets generated regardless of service again after > the merge of HDFS-7240 > --- > > Key: YARN-9308 > URL: https://issues.apache.org/jira/browse/YARN-9308 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, scheduler >Affects Versions: 3.2.0 >Reporter: Akira Ajisaka >Assignee: Wilfred Spiegelenburg >Priority: Blocker > > After the merge of HDFS-7240, YARN-6453 occurred again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9308) fairscheduler-statedump.log gets generated regardless of service again after the merge of HDFS-7240
[ https://issues.apache.org/jira/browse/YARN-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg reassigned YARN-9308: --- Assignee: Wilfred Spiegelenburg > fairscheduler-statedump.log gets generated regardless of service again after > the merge of HDFS-7240 > --- > > Key: YARN-9308 > URL: https://issues.apache.org/jira/browse/YARN-9308 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, scheduler >Affects Versions: 3.2.0 >Reporter: Akira Ajisaka >Assignee: Wilfred Spiegelenburg >Priority: Blocker > > After the merge of HDFS-7240, YARN-6453 occurred again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767164#comment-16767164 ] Wilfred Spiegelenburg commented on YARN-1655: - The junit test failures are not related to this change. [~asuresh] could you please review this as you did the unifying code work? > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767167#comment-16767167 ] Wilfred Spiegelenburg commented on YARN-9298: - Junit test failure seems unrelated no tests is correct those will follow with the integration into the scheduler/ > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766813#comment-16766813 ] Wilfred Spiegelenburg commented on YARN-8967: - After talking off line with a number of people the request was to divide this change into two parts due to its size: * _part 1_ for the new rules and changes to the existing PlacementRule code * _part 2_ for the FS changes and integration It is the only way that the change can be split and make them compile separately. A new jira YARN-9298 is open for _part 1_ and we'll keep this jira for _part 2_. Removing patch available until that one is checked in. It will also allow work to start on enhancing the rules with filters etc which have existing open jiras. > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9277) Add more restrictions In FairScheduler Preemption
[ https://issues.apache.org/jira/browse/YARN-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766811#comment-16766811 ] Wilfred Spiegelenburg commented on YARN-9277: - I agree with [~Steven Rand] sorting could be good but setting a hard no go could cause issues. Can you also explain how we can pre-empt a container that is owned by the application itself? I thought that we would only allow containers to be pre-empted if the application is over its fair share and even then only if pre-empting the container would not drop the application below its fair share. The {{FSPreemptionThread.identifyContainersToPreemptOnNode()}} calls {{app.canContainerBePreempted()}} which contains that check and the container is not added. Since the app we are pre-empting for is under its fair share any container of the app itself should be filtered out by that. Am I reading this all wrong or have you found cases that we did pre-empt a container for its own app and it is not working as expected? > Add more restrictions In FairScheduler Preemption > -- > > Key: YARN-9277 > URL: https://issues.apache.org/jira/browse/YARN-9277 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9277.001.patch, YARN-9277.002.patch > > > > I think we should add more restrictions in fair scheduler preemption. > * We should not preempt self > * We should not preempt high priority job > * We should not preempt container which has been running for a long time. > * ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-9298: Attachment: YARN-9298.001.patch > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9298) Implement FS placement rules using PlacementRule interface
Wilfred Spiegelenburg created YARN-9298: --- Summary: Implement FS placement rules using PlacementRule interface Key: YARN-9298 URL: https://issues.apache.org/jira/browse/YARN-9298 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Wilfred Spiegelenburg Assignee: Wilfred Spiegelenburg Implement existing placement rules of the FS using the PlacementRule interface. Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766771#comment-16766771 ] Wilfred Spiegelenburg commented on YARN-1655: - Updated test to make it more robust. locally ran all new tests 250 times have not seen a failure. > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-1655: Attachment: YARN-1655.003.patch > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8655) FairScheduler: FSStarvedApps is not thread safe
[ https://issues.apache.org/jira/browse/YARN-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765949#comment-16765949 ] Wilfred Spiegelenburg commented on YARN-8655: - Hi [~uranus], I am not saying that what we do now is 100% correct. I am only doubting how often this occurs and what the impact on the application and scheduling activities is. Based on the analysis I did I think we need a solution for this case that has far less impact. Do we know any of the following: How badly does it affect the running applications, do we pre-empt double what we should? Does not handling this correctly slow down pre-emption? Is there another impact of not handling the edge case? Pre-emption currently runs almost continually and is gated by the {{take()}}: when there is a pre-emption waiting we handle it. The patch changes this into one pre-emption per second. It effectively throttles down the pre-emption from processing applications based on their arrival to slow scheduled trickle. When I look at how we calculate and decide if the application is marked as minimum share starved the cases should be limited. Even if the application is fair share starved and the queue is min share starved we do not automatically mark the application as min share starved. We thus only have this edge case for a small number of applications. Fixing that edge case by slowing down all pre-emption handling is what I think is not right. > FairScheduler: FSStarvedApps is not thread safe > --- > > Key: YARN-8655 > URL: https://issues.apache.org/jira/browse/YARN-8655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-8655.002.patch, YARN-8655.patch > > > *FSStarvedApps is not thread safe, this may make one starve app is processed > for two times continuously.* > For example, when app1 is *fair share starved*, it has been added to > appsToProcess. After that, app1 is taken but appBeingProcessed is not yet > update to app1. At the moment, app1 is *starved by min share*, so this app > is added to appsToProcess again! Because appBeingProcessed is null and > appsToProcess also have not this one. > {code:java} > void addStarvedApp(FSAppAttempt app) { > if (!app.equals(appBeingProcessed) && !appsToProcess.contains(app)) { > appsToProcess.add(app); > } > } > FSAppAttempt take() throws InterruptedException { > // Reset appBeingProcessed before the blocking call > appBeingProcessed = null; > // Blocking call to fetch the next starved application > FSAppAttempt app = appsToProcess.take(); > appBeingProcessed = app; > return app; > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8655) FairScheduler: FSStarvedApps is not thread safe
[ https://issues.apache.org/jira/browse/YARN-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765733#comment-16765733 ] Wilfred Spiegelenburg commented on YARN-8655: - Looking at how we get to adding an application to the starved list I don't think this is a thread safety issue. I do agree that we could process the application twice. Fair share starvation and min share starvation are two different things. The queue is starved for min share and the application is starved for fair share. This does not mean that it is a problem. If the application is starved for fair share the calculation of the queue min share starvation already takes that fact into account. The {{updateStarvedAppsMinshare()}} deducts any fair share starvation already processed for applications from the possible min share starvation. This means two things for an application that is marked for min share starvation # the application fair share starvation is less than the distributed min share starvation of the queue # the application has an outstanding demand that is higher than its fair share starvation The chance that an application is starved for fair share with a demand that is higher than its fair share starvation combined with the distributed queue minimum share that is higher than the fair share starvation is small. It could be worth the fix if it has a high impact. Looking at the way you are proposing to fix it in the patch is however not the way. You introduce a {{Thread.sleep()}} call in the pre-emption thread which is not correct. Currently the pre-emption will happen when a starved app is added and no pre-emption is in progress. With the change there is only 1 pre-emption per second. This is a high impact change and I think we need to come up with a smarter way to handle this case with less of an impact on the pre-emption itself. > FairScheduler: FSStarvedApps is not thread safe > --- > > Key: YARN-8655 > URL: https://issues.apache.org/jira/browse/YARN-8655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-8655.002.patch, YARN-8655.patch > > > *FSStarvedApps is not thread safe, this may make one starve app is processed > for two times continuously.* > For example, when app1 is *fair share starved*, it has been added to > appsToProcess. After that, app1 is taken but appBeingProcessed is not yet > update to app1. At the moment, app1 is *starved by min share*, so this app > is added to appsToProcess again! Because appBeingProcessed is null and > appsToProcess also have not this one. > {code:java} > void addStarvedApp(FSAppAttempt app) { > if (!app.equals(appBeingProcessed) && !appsToProcess.contains(app)) { > appsToProcess.add(app); > } > } > FSAppAttempt take() throws InterruptedException { > // Reset appBeingProcessed before the blocking call > appBeingProcessed = null; > // Blocking call to fetch the next starved application > FSAppAttempt app = appsToProcess.take(); > appBeingProcessed = app; > return app; > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765663#comment-16765663 ] Wilfred Spiegelenburg commented on YARN-1655: - testDecreaseAfterIncreaseWithAllocationExpiration is logged as YARN-5684 testContainersFromPreviousAttemptsWithRMRestart is logged as YARN-8433 patch updated to fix the checkstyle issues that could be fixed, left the RMContainerImpl as it as they change lines up with the current indents > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-1655: Attachment: YARN-1655.002.patch > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-1655: Attachment: YARN-1655.001.patch > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764807#comment-16764807 ] Wilfred Spiegelenburg edited comment on YARN-1655 at 2/11/19 9:50 AM: -- Adding resizing to the FS. Some background around the changes outside the FS: # the {{RMContainerImpl}} logs a message when the temporary containers for resizing are released because they are in the wrong state. The new transitions clean those up # Normalising requests has been moved from the {{CapacityScheduler}} into the {{AbstractYarnScheduler}} as it is used by both schedulers. # Resizing would only use the ANY request and leave node and rack requests hanging around which caused the FS to allocate strange containers. {{AppSchedulingInfo}} now allows for cleaning up the unneeded requests from the {{ContainerUpdateContext}} was (Author: wilfreds): Adding resizing to the FS. Some background around the changes outside the FS: # the {{RMContainerImpl}} logs a message when the temporary containers for resizing are released because they are in the wrong state. The new transitions clean those up # Normalising requests has been moved from the {{CapacityScheduler}} into the {{AbstractYarnScheduler} as it is used by both schedulers. # Resizing would only use the ANY request and leave node and rack requests hanging around which caused the FS to allocate strange containers. {{AppSchedulingInfo}} now allows for cleaning up the unneeded requests from the {{ContainerUpdateContext}} > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-8967: Attachment: YARN-8967.005.patch > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759579#comment-16759579 ] Wilfred Spiegelenburg commented on YARN-8967: - Rebased to trunk, the mockito changes prevented it from being applied. Diff is basically the same just a 2 line difference in one patch chunk for imports: [^YARN-8967.005.patch] > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch, YARN-8967.004.patch, YARN-8967.005.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9262) TestRMAppAttemptTransitions is failing with an NPE
[ https://issues.apache.org/jira/browse/YARN-9262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758145#comment-16758145 ] Wilfred Spiegelenburg commented on YARN-9262: - The failure occurs because no {{Allocation}} object comes back as the {{when}} mock call does not fit the arguments. When the AM gets allocated we have {{null}} values in 3 places and the {{when}} should look like this: {code} when(scheduler.allocate(any(ApplicationAttemptId.class), any(List.class), any(), any(List.class), any(), any(), any(ContainerUpdates.class))). {code} > TestRMAppAttemptTransitions is failing with an NPE > -- > > Key: YARN-9262 > URL: https://issues.apache.org/jira/browse/YARN-9262 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.0, 3.1.2, 3.3.0 >Reporter: Sunil Govindan >Assignee: lujie >Priority: Critical > > hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions > fails due to an NPE post YARN-9194 > {code} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:1202) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:1182) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:915) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > {code} > cc [~xiaoheipangzi] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758120#comment-16758120 ] Wilfred Spiegelenburg commented on YARN-1655: - I have started working on this already. I have a working code change for trunk based on the changes from YARN-6216. YARN-6216 by itself is not enough to implement the resizing, we do need some FS changes. The only thing that is still bothering me is a new junit test I wrote that keeps failing. The failure is caused by lingering resource requests. There does not seem to be a proper clean up of resource requests in all cases. This only seems to happen when an increase results in a reservation. If we then later cancel or the increase request that caused the reservation it leaves the _node_ and _rack_ request behind and just removes the _any_ part of the request. This looks similar to what is mentioned in YARN-5540 around leaving requests behind which should not be there. This issue does affect both schedulers but does not seem to cause a junit failure in the capacity scheduler. > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg reassigned YARN-1655: --- Assignee: Wilfred Spiegelenburg (was: Sandy Ryza) > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org