[ https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265208#comment-17265208 ]
Gergely Pollak edited comment on YARN-10506 at 1/14/21, 8:29 PM: ----------------------------------------------------------------- Thank you [~wangda] [~pbacsko] for your insights! Actually the queue creation checks were present in the legacy engine ([https://github.com/apache/hadoop/blob/9d5144e66d71551ad6e1a8d3f1670e49ba181999/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java#L333-L340]), if I see correctly since 3.0. But the first commit seems much older: [https://github.com/apache/hadoop/commit/0987a7b8cbbbb2c1e4c2095821d98a7db19644df] . I've just kept it during the refactor, and as Peter pointed out now we actually build on it, since we need to be able to determine if a rule will be executed successfully (ie. the queue can be created), and if it cannot we can move onto the next rule. This fallback behaviour is part of the FS Placement engine as well, so we need it in order to server FS customers, so it was somewhat lucky that it was already in place. (In the original implementation we just used this check to throw exceptions, which was quite pointless, since the CS would have rejected the application if the queue was uncreatable anyway) So this is why I say we can safely remove those flags from the application placement context, this way we can reduce the code complexity. This way we only need what [~pbacsko] mentioned: {code:java} 1) "create" flag on the rule itself in the JSON - true/false 2) "yarh.scheduler.capacity.<path>.auto-queue-creation-v2.enabled" - true/false{code} Also I'm currently adopting this change in the CSMappingPlacementRule, and I don't use these flags. What I'll need in the future a better, centralised way to determine if a path can be created, and both CS and the MappingEngine should use the same methods, to do it. Currently we are implementing the same logic differently, and it's much harder to maintain the code this way, but during the followup code cleanup jiras we'll take care of this. About the race condition, I think (and I might be very wrong, because race conditions can be really sneaky), we cannot do much about, theoretically it can happen that the Mapping engine determines a rule can be executed, because the target queue exists or there is a proper dynamic parent, then a config change occurs, and by the time we get to the actual creation the queue structure changes, at this point the rejection is the proper action to take since the submission IS against the configuration. And I don't see a way around it since at the time the MappingEngine makes the decision that decision is the correct one, but we shouldn't enforce this decision if the configuration changes. But this race condition should only occur when the user changes the queue configuration, and in this case they should expect failing application submissions for the queues which have been changed or removed. was (Author: shuzirra): Actually the queue creation checks were present in the legacy engine (https://github.com/apache/hadoop/blob/9d5144e66d71551ad6e1a8d3f1670e49ba181999/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/placement/UserGroupMappingPlacementRule.java#L333-L340), if I see correctly since 3.0. But the first commit seems much older: [https://github.com/apache/hadoop/commit/0987a7b8cbbbb2c1e4c2095821d98a7db19644df] . I've just kept it during the refactor, and as Peter pointed out now we actually build on it, since we need to be able to determine if a rule will be executed successfully (ie. the queue can be created), and if it cannot we can move onto the next rule. This fallback behaviour is part of the FS Placement engine as well, so we need it in order to server FS customers, so it was somewhat lucky that it was already in place. (In the original implementation we just used this check to throw exceptions, which was quite pointless, since the CS would have rejected the application if the queue was uncreatable anyway) So this is why I say we can safely remove those flags from the application placement context, this way we can reduce the code complexity. This way we only need what [~pbacsko] mentioned: {code:java} 1) "create" flag on the rule itself in the JSON - true/false 2) "yarh.scheduler.capacity.<path>.auto-queue-creation-v2.enabled" - true/false{code} Also I'm currently adopting this change in the CSMappingPlacementRule, and I don't use these flags. What I'll need in the future a better, centralised way to determine if a path can be created, and both CS and the MappingEngine should use the same methods, to do it. Currently we are implementing the same logic differently, and it's much harder to maintain the code this way, but during the followup code cleanup jiras we'll take care of this. About the race condition, I think (and I might be very wrong, because race conditions can be really sneaky), we cannot do much about, theoretically it can happen that the Mapping engine determines a rule can be executed, because the target queue exists or there is a proper dynamic parent, then a config change occurs, and by the time we get to the actual creation the queue structure changes, at this point the rejection is the proper action to take since the submission IS against the configuration. And I don't see a way around it since at the time the MappingEngine makes the decision that decision is the correct one, but we shouldn't enforce this decision if the configuration changes. But this race condition should only occur when the user changes the queue configuration, and in this case they should expect failing application submissions for the queues which have been changed or removed. > Update queue creation logic to use weight mode and allow the flexible > static/dynamic creation > --------------------------------------------------------------------------------------------- > > Key: YARN-10506 > URL: https://issues.apache.org/jira/browse/YARN-10506 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Benjamin Teke > Assignee: Andras Gyori > Priority: Major > Attachments: YARN-10506-006-10504-010.patch, > YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, > YARN-10506-012.patch, YARN-10506.001.patch, YARN-10506.002.patch, > YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, > YARN-10506.006-combined.patch, YARN-10506.006.patch, YARN-10506.007.patch, > YARN-10506.009.patch, YARN-10506.011.patch > > > The queue creation logic should be updated to use weight mode and support the > flexible creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org