[ 
https://issues.apache.org/jira/browse/YARN-11114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528191#comment-17528191
 ] 

Szilard Nemeth commented on YARN-11114:
---------------------------------------

Let me update this with my progress.
I can see 3 ways to solve this.


Option 1. Just running apps, no other apps (Current implementation)

The current solution in the PR implements this.
1. Queries running apps by short and full queue names
2. It doesn't / can't query non-running apps by other name than the submitted 
name. 
For example, if the application is subbmitted to "root.default", only this 
exact name can be queried, so the query with value of "default" won't return 
the application.
This is the downside of how the queue is stored inside RmAppImpl as the 
submitted queue is stored, not both versions (leaf name, full queue path).
As there's a clear way to translate only running apps leaf queue to full path 
and vice versa, running apps can be queried by both queue notation.
However, I don't really like this solution as for non-running apps, it works 
differently due to the shortcoming mentioned above.

Advantages: 
- No API / interface change is required

Disadvantages: 
- Inconsistent API responses for running vs. non-running apps

 

Option 2. Store short queue name / full queue path in RmAppImpl with new fields

This could be achieved in: RMAppManager#createAndPopulateNewRMApp

Advantages
- ClientRMService#getApplications could clearly filter for queue name / full 
queue path, without any hassle.

Disadvantages
- RmAppImpl should be touched and new fields should be added
- Impact on RM State store
- Impact on all schedulers: They need to translate between leaf queue / full 
queue path in order to store both values.

 


Option 3. Resolve full queue path from leaf queue name and vice-versa
As ClientRMService has a reference to the scheduler (type: YarnScheduler), a 
new method could be added to resolve full queue path from the given queue name.

Advantages
- ClientRMService#getApplications could clearly filter for both queue notations

Disadvantages
- Impact on the YarnScheduler interface
- Impact on all scheduler implementations

> RMWebServices returns only apps matching exactly the submitted queue name
> -------------------------------------------------------------------------
>
>                 Key: YARN-11114
>                 URL: https://issues.apache.org/jira/browse/YARN-11114
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler, webapp
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I've added 2 testcases that demonstrate the issue with [this 
> commit|https://github.com/szilard-nemeth/hadoop/commit/88dcf40f4dab564477542b8efb82f4f20d132eee].
> 1. With 'testAppsQueryByQueueShortname', there's a finishedApp submitted to 
> "root.default" and there's a runningApp that is submitted to "default".
> The testcase queries the apps by queue name "default" and the response only 
> contains the runningApp, which is submitted to "default" so the other app 
> that is submitted to "root.default" is not returned.
> 2. With 'testAppsQueryByQueueFullname', there's a finishedApp submitted to 
> "root.default" and there's a runningApp that is submitted to "default" (same 
> setup as above).
> The testcase queries the apps by queue name "root.default" (which is the full 
> queue path) and the response only contains the finishedApp, which is 
> submittted to "root.default" so the other app that is submitted to "default" 
> is not returned.
> A trivial conclusion of this is that only those applications are included in 
> the response that exactly match the queue name where the application is 
> submitted to, either specified explicity at submission or resolved by the 
> placement engine.
> Before YARN-9879 was implemented, Capacity Scheduler was only capable of 
> definining a leaf queue with a specific name in the whole hierarchy once, 
> meaning that leaf queue names were unique.
> For example root.a.testQueue and root.b.testQueue couldn't coexist, as the 
> leaf queue name is the same.
> At this point, I supposed that YARN-9879 is causing this issue, but as the 
> behaviour of CS before YARN-9879 was merged didn't allow two leaf queues with 
> the same name, a query of "root.default" and "default" could easily work as 
> it was guaranteed that there's not another "default" leaf queue in the 
> hierarchy, just one. I digged a bit further.
> I also noticed that YARN-8659 ([commit 
> link|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797])
>  could have introduced this issue a long time ago, as it removed the iterator 
> logic that queried the applications with method YarnScheduler#getAppsInQueue 
> (see 
> [this|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797#diff-5b432bf3a8eb3e039878300ffb9db1f728226b9e3f63c4eb53be5ed5a833390aL843]).
> Let's follow the implementation of YarnScheduler#getAppsInQueue for CS: 
> 1. First of all, 
> [here|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L2501-L2509]
>  is the method definition.
> [CapacityScheduler#getQueue|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L824-L829]
>  is called from here.
> 2. 
> [CapacityScheduler#getQueue|https://github.com/apache/hadoop/blob/4c05d257ba3f3311b5bbc993f6e5e35637487d88/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L824-L829]
>  is then calling 
> [QueueManager#getQueue|https://github.com/apache/hadoop/blob/da09d68056d4e6a9490ddc6d9ae816b65217e117/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerQueueManager.java#L136-L138].
> 3. 
> [QueueManager#getQueue|https://github.com/apache/hadoop/blob/da09d68056d4e6a9490ddc6d9ae816b65217e117/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerQueueManager.java#L136-L138]
>  is then calling [CSQueueStore#get|#get].
> 4. [CSQueueStore#get|#get] calls the 'getMap' fields getOrDefault method 
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L260].
> 4.1 CSQueueStore#getMap (field) stores the Queue objects mapped to their 
> short and full names (e.g. 'default' and 'root.default').
> [CSQueueStore#add|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L122-L152]
>  is the method that is responsible for adding the CSQueue objects.
> 4.2 The first getMap.put call is invoked 
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L134]
>  with the full queue name.
> 4.3 The second getMap.put call is invoked via 
> [CSQueueStore#updateGetMapForShortName|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L102-L120]
>  
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueStore.java#L113].
> As a conclusion, in 
> [ClientRMService#getApplications|https://github.com/apache/hadoop/blob/d2869940094d330434f3e82d16b1cad3c6023437/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L880-L993],
>  the app filtering by queues seems wrong for me. 
> The block that filters by queues is 
> [here|https://github.com/apache/hadoop/blob/d2869940094d330434f3e82d16b1cad3c6023437/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L915-L918].
> This should be enhanced by querying the apps from 
> YarnScheduler#getAppsInQueue, as it both handles the short and full queue 
> names for CS in the end.
> It's crucial to not just fall back to the logic that was replaced by 
> YARN-8659 ([commit 
> link|https://github.com/apache/hadoop/commit/7c13872cbbb6f1b0b1c2dde894885b41186b3797]).
> As the original issue was there that rmContext.getRMApps() returns both 
> running and finished apps, while scheduler.getAppsInQueue only returns 
> running apps.
> h2. NOTES
> *NOTE #1:* 
> As there's no way to get the short queue name + the full queue name from 
> RmApp / RmAppImpl, it's currently not possible to compare the queue filter of 
> the RM client request with both type of queue names of the application.
> *NOTE #2:*
> scheduler.getAppsInQueue(queue) will only return running apps, so for running 
> apps, it's possible to retrieve the apps by queue name, and it will work with 
> both short and full names. However, for non-running apps, only the submitted 
> app name would work for filtering.
> *NOTE #3 (plan for implementation):*
> It would be completely reasonable to consider both running and non-running 
> apps while querying, however I think it never worked that way.
> Before YARN-8659, only running apps were considered and before YARN-9879, 
> both running + non-running apps were considered but only the stored queue 
> name (in RmAppImpl) was compared to the app filter's queue name, which was 
> either the short or the full queue name.
> All in all, I don't want to change this behavior and also I think it would 
> make the code more convoluted if RmAppImpl would store the short and the full 
> queue names as well.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to