[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013171#comment-17013171
 ] 

Szilard Nemeth commented on YARN-9879:
--------------------------------------

Thank you for your feedbacks [~wilfreds], [~pbacsko], [~leftnoteasy] and thans 
[~shuzirra] for moving this forward!

Chiming in a bit late, but I would like to give my opinion as well. Apologies 
for being so lengthy, but I wanted to quote and reply some parts of the 
comments had been given so far.
Please note that some parts of what I'm saying here is taking the FS to CS 
migration into account.

1. Quoting from Wilfred's 
[comment|https://issues.apache.org/jira/browse/YARN-9879?focusedCommentId=16997290&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16997290]:

A. {quote}
If we simply relax the rule that the leaf queue must be unique in the system in 
favour of the fact that a queue must be unique based on the full queue path. 
This does not break existing configurations as the unique leaf queue is also 
unique when you take into account the whole path. That means there is nothing 
for the current clusters that needs to change. Internally the scheduler does 
have to change to make sure that all references use the queue path. This will 
require a lot of changes throughout the scheduler when you look up a queue and 
the way we store the reference if it is not directly to the leaf queue.
{quote}
I do agree with this, existing configurations does not break. However, allowing 
non-unique leaf queues can make the CS behave weirdly and inconsistently as 
queues are being added later on, as [~shuzirra] described this lately in his 
comment. More on this matter later.

B. {quote}

When an application is submitted with just a queue name (not a path) we expect 
that the name is a unique leaf queue name. If that queue does not exist or is 
not uniquely identifiable we reject the application submission. Resolution of 
the real leaf queue follows the same steps as it does now. The queue name in 
the end is converted to the correct leaf queue identified by the a path. For 
existing configurations nothing has changed. Internally we hide all the changes.
{quote}

AND 

C.
{quote}
The important part is applying a new configuration. If the configuration adds a 
leaf queue that is not unique the configuration update currently is rejected. 
With this change we would allow that config to become active. This could break 
existing applications when they try to submit to the leaf queue that is no 
longer unique.
We should at least log and warn clearly in the response of the update. Maybe 
even show it in the UI or we could ask for a confirmation. The first update 
that adds a non unique queue to the configuration should always fail 
complaining loudly. It should then keep warning the user and rejecting the 
update unless a confirmation flag is set to force the update through. After the 
first update that would not be needed anymore.
Reading a config from a file or store which is used to initialise the scheduler 
should not trigger such behaviour. We still should show a warning in the logs 
to make sure it is not lost.
{quote}

About what you said: "This could break existing applications when they try to 
submit to the leaf queue that is no longer unique."
Again, this is really problematic and error-prone, as detailed by [~shuzirra] 
in his comment above. 
We discussed this together with Gergo and I'm trying to give more weight on how 
tricky this case can get and what I'm doing now is just reiterate on Gergo's 
[comment|https://issues.apache.org/jira/browse/YARN-9879?focusedCommentId=17011993&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17011993]

Imagine the following queue setup: 

{code:java}
root
  |____a
  |
  |____b
  |    |____b
  |    |____c
  |    |____d
  |
  |____c
       |____e
       |____f
{code}


If someone is adding root.c.b, queue 'b' is non-unique anymore without even 
knowing or taking care of root.b.b is the other queue 'b'.
Even if we log and warn in the response of the update, adding the root.c.b 
queue can break the app submitted to queue root.b.b.
I can imagine many real-world scenarios where users were submitting apps to a 
queue (root.b.b) with just the leaf queue name ('b') and their app submission 
would get suddenly rejected since another queue do exist with the same name. 
Just consider a user-group mapping and two same usernames exist under different 
organizations/user groups and we can be in the situation what I depicted. I 
also agree to Gergo that we can't expect or assume every user and apps 
submission is guarded by monitoring or warning systems. 
I think it's definitely not a good idea to break existing users and app 
submissions with this change.


2. Quoting from Wangda's 
[comment|https://issues.apache.org/jira/browse/YARN-9879?focusedCommentId=17009975&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17009975]:

{quote}
I personally think it is not a big deal if application reject reasons from RM 
can clearly guide users to use full qualified queue path when duplicated queue 
names exists. It is like if a team has only one Peter we can use the first name 
only otherwise we will add last name to avoid confusion. It isn't 
counter-intuitive to me.
{quote}

How can we know it's not a big deal? As we are not fully familiar how users are 
submitting apps: They can have scripts (or even many scripts), cron jobs, other 
unknown setup and whatnot that triggers app submissions. We can't and shouldn't 
do any non-backward compatible changes. For me, this approach is not an 
acceptable one as described with my ASCII art + description in detail.


3. Quoting from Gergo's 
[comment|https://issues.apache.org/jira/browse/YARN-9879?focusedCommentId=17011993&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17011993]

A.
{quote}
I think this can be a major issue. In a scenario when the user have let's say a 
job which runs daily, and it's been working like that for years, can simply 
break, because an other user, totally unrelated to his team or even department 
creates a queue with the same name. The suddenly this application will start 
failing, and even if we add logs which explicitly tell what is the reason 
behind the application rejection, user one might not even notice it started 
failing. (We shouldn't assume everyone has proper monitoring and warning 
systems). So technically any user who can create a queue, can disable an other 
user's application if it is started by using single queue name reference. And 
it concerns me a bit.
{quote}
Agree with this, the whole point of my comment was based around this fact Gergo 
shared.

B.
{quote}
However Adam Antal had an idea to even fix this issue, and then we can move 
forwards with Wilfred's suggestion:

We should make it possible to flag QUEUES for using full queue name reference, 
and these flags should mean ALL queues under the flagged queue can ONLY be 
referenced by full queue name. For regular queues (non-flagged ones), we would 
still enforce the unique leaf queue policy, while newer or migrating users 
could stick to the full queue reference. This proposal can also helps gradual 
migration, for older CS users buy slowly flagging queues in which the 
applications are already started by queue name.
{quote}

*Some things I have in mind, so listing the ADVANTAGES and some properties of 
this approach:
*
1. First and foremost, duplicate queue names cannot be added under unflagged 
queues.
2. Using the department / user analogy here. We don't have the problem anymore 
that some other department adds a user and this user becomes non-unique as 
other department has the same username in it, so app submissions can work well 
for all departments / users.
As the new department added a user (root.c.b) has their subqueues flagged, app 
submissions should only work with full queue paths here and at the same time, 
the another department had the same user (root.b.b), app submission will still 
work well as all queues under root.c should be referenced only referenced by 
full queue path.
3. Customers can gradually move queues to use full queue path, as much as they 
want.
4. FS to CS migration tool should add the flag to the root queue if moved from 
FS.


4. Quoting from Wangda's 
[comment|https://issues.apache.org/jira/browse/YARN-9879?focusedCommentId=17012282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17012282]:
{quote}
Thanks Gergely Pollak, I think adding a flag (suggestion from Adam Antal) will 
prevent admin to change it accidentally, but it is hard to be understand 
(thinking about a regular Hadoop user). And we need to maintain it in a long 
run.
{quote}

I'm not sure why this is harder to understand than regular queue and scheduler 
config parameters, which can be overly complex to setup.
With this approach, at least we would give something in our user's hands: They 
can use the flag to move queues under to the new way to address queues (queue 
path) gradually. Without the flag-based approach, we don't give users anything 
but we require them to change their submission code everytime a leaf queue is 
added that is already in the hierarchy with the same name. If we don't consider 
this as a hard requirement, apps would fail randomly, from the perspective of 
the users.


*As a summary:
Please note that I did not dive into the CS code to decide what is the 
development cost of such a flag-based solution, this is up to [~shuzirra] to 
give some estimates as the jira is on his name.
However, I wanted to state and share my thoughts: Rejecting apps on submission 
time that worked before and won't work afterwards seems like very far from an 
acceptable solution.*

> Allow multiple leaf queues with the same name in CS
> ---------------------------------------------------
>
>                 Key: YARN-9879
>                 URL: https://issues.apache.org/jira/browse/YARN-9879
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Gergely Pollak
>            Assignee: Gergely Pollak
>            Priority: Major
>         Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to