[ 
https://issues.apache.org/jira/browse/YARN-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10787:
----------------------------------
    Fix Version/s: 3.4.0

> Queue submit ACL check is wrong when CS queue is ambiguous
> ----------------------------------------------------------
>
>                 Key: YARN-10787
>                 URL: https://issues.apache.org/jira/browse/YARN-10787
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.4.0
>            Reporter: Szilard Nemeth
>            Assignee: Gergely Pollák
>            Priority: Major
>             Fix For: 3.4.0
>
>         Attachments: YARN-10787.001.patch
>
>
> Let's suppose we have a Capacity Scheduler configuration with 2 or more leaf 
> queues with the same name in the queue hierarchy. That's what we call an 
> ambiguous queue name.
>  Let's also enable ACL checks and define acl_submit_applications / 
> acl_administer_queue configs with the correct value, adding the username to 
> the ACL value there.
> Here's a minimalistic YARN + CS config:
> h2. 1. YARN config snippet:
> {code:java}
> <property><name>yarn.acl.enable</name><value>true</value>
> {code}
> h2. 2. CS config snippet:
> {code:java}
> <property>
>       <name>yarn.scheduler.capacity.root.someparent1.queues</name>
>       <value>anyotherqueue1,somequeue,anyotherqueue2</value>
> </property>
> <property>
>       <name>yarn.scheduler.capacity.root.someparent2.queues</name>
>       <value>anyotherqueue3,somequeue,anyotherqueue4</value>
> </property>
> <property>
>       
> <name>yarn.scheduler.capacity.root.someparent1.somequeue.acl_submit_applications</name>
>       <value>someuser1 </value>
> </property>
> <property>
>       
> <name>yarn.scheduler.capacity.root.someparent2.somequeue.acl_submit_applications</name>
>       <value>someuser1 </value>
> </property>
> <property>
>       
> <name>yarn.scheduler.capacity.root.someparent1.somequeue.acl_administer_queue</name>
>       <value>someuser1 </value>
> </property>
> <property>
>       
> <name>yarn.scheduler.capacity.root.someparent2.somequeue.acl_administer_queue</name>
>       <value>someuser1 </value>
> </property>
> {code}
> So in this case, we have an ambiguous queue named "somequeue" under 2 
> different paths:
>  - root.someparent1.somequeue
>  - root.someparent2.somequeue
> When a user submits an application correctly with the full queue path e.g. 
> root.someparent1.somequeue, YARN will still fail to place the application to 
> that queue and will use the short name in case ACL checking is enabled.
> h2. 3. LOG SNIPPET
> {code:java}
> 2021-05-20 22:04:32,031 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.placement.CSMappingPlacementRule:
>  Placement final result 'root.someparent1.somequeue' for application 
> 'application_1621540945412_0001'
>  2021-05-20 22:04:32,031 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Placed 
> application with ID application_1621540945412_0001 in queue: somequeue, 
> original submission queue was: root.someparent1.somequeue
>  2021-05-20 22:04:32,031 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Ambiguous queue reference: somequeue please use full queue path instead.
>  2021-05-20 22:04:32,031 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application 'application_1621540945412_0001' is submitted without priority 
> hence considering default queue/cluster priority: 0
>  2021-05-20 22:04:32,032 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Priority '0' is acceptable in queue : somequeue for application: 
> application_1621540945412_0001
>  2021-05-20 22:04:32,993 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Exception in 
> submitting application_1621540945412_0001
>  org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.security.AccessControlException: User someuser1 does not 
> have permission to submit application_1621540945412_0001 to queue somequeue
>  at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
> {code}
> h2. 4. FULL STACKTRACE:
> {code:java}
>  org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.security.AccessControlException: User someuser1 does not 
> have permission to submit application_1621540945412_0001 to queue somequeue
>       at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:433)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:330)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:650)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>       at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> Caused by: org.apache.hadoop.security.AccessControlException: User someuser1 
> does not have permission to submit application_1621540945412_0001 to queue 
> somequeue
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:436)
>       ... 12 more
> 2021-05-20 22:04:32,994 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=someuser1   
>    IP=172.17.61.133        OPERATION=Submit Application Request    
> TARGET=ClientRMService  RESULT=FAILURE  DESCRIPTION=Exception in submitting 
> application PERMISSIONS=org.apache.hadoop.security.AccessControlException: 
> User someuser1 does not have permission to submit 
> application_1621540945412_0001 to queue somequeue      
> APPID=application_1621540945412_0001    QUEUENAME=somequeue
> {code}
> h1. DETAILS:
> *1. The whole thing happens in RMAppManager#createAndPopulateNewRMApp:*
>  Class / method: 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager#createAndPopulateNewRMApp
> [LINK|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L407]
> *2. RMAppManager#copyPlacementQueueToSubmissionContext is called* for 
> applications that are new, meaning we are not recovering, an application is 
> submitted in a normal way:
>  Class / method: 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager#copyPlacementQueueToSubmissionContext
> [Called 
> at|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L420]
> [Method 
> link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L991]
> The problem is that copyPlacementQueueToSubmissionContext sets the queue of 
> context (ApplicationSubmissionContext object) from placementContext.getQueue 
> (ApplicationPlacementContext object). If placementcontext holds the queue 
> name in the short form, this will override the default submission queue 
> value, let's suppose it was the full queue path.
>  An example of a generated log from this method:
> {code:java}
>  2021-05-20 22:04:32,031 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Placed 
> application with ID application_1621540945412_0001 in queue: somequeue, 
> original submission queue was: root.someparent1.somequeue
> {code}
> *3. The problematic code block is here:* [Code 
> block|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L446-L475]
> 3.1 First, the short queuename will be gathered from submissionContext, as it 
> was overridden by 'copyPlacementQueueToSubmissionContext': 
> [Link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L448]
>  This is a bad design, as here we are relying on the fact that the queue name 
> was overridden in the submission context object.
> 3.2 Since the queue name will be in the short form and it's ambiguous, the 
> call to 
> [scheduler.getQueue()|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L450]
>  will return null, as it's implemented like this by design: If the queue name 
> is ambiguous, it returns null.
> 3.3 The condition of checking if csqueue is null AND placementContext is not 
> null will evaluate to true 
> [here|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L452]
> *3.4. The Parent queue will be queried from CS* by the parent queue name of 
> the placement context: 
> [Link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L456]
> *3.5 Finally, the ACL check fails* as csqueue is the queue object of the 
> parent queue of the queue 'root.someparent1.somequeue' which will be the 
> queue: 'root.someparent1'.
>  In this case, the user don't have a submission ACL set for the parent queue, 
> but the leaf queue so the ACL check fails.
> h2. LIST OF THINGS TO FIX / DO:
>  - Add a unit testcase that replicates the above config and the issue.
>  - Rename copyPlacementQueueToSubmissionContext: This method not really 
> copies anything, it simply overrides the queue value.
>  - Add Debug log to print csqueue object before the authorization code: [Auth 
> code 
> block|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L459-L475]
>  - Fix log messages: As 'copyPlacementQueueToSubmissionContext' overrides 
> (not copies) the original queue name with the queue name from the 
> PlacementContext, all calls to submissionContext.getQueue() will return the 
> short queue name. This results in very misleading log messages as well, 
> including the exception message itself:
> {code:java}
>  org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.security.AccessControlException: User someuser1 does not 
> have permission to submit application_1621540945412_0001 to queue somequeue
> {code}
> All log messages should print the original submission queue, if possible.
>  - Actual code fix for the issue: Use full queue path to get the queue object.
>  Again, this is the code block where the fix should happen: 
> [LINK|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L447-L458]
> 'queueName' should have the value set from: 
> *org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext#getFullQueuePath.*
> The equivalent of this in the linked code block:
> {code:java}
> placementContext.getFullQueuePath()
> {code}
> This should happen only if placementContext is not null.
> h2. LONG TERM FIX:
> Investigate if it's possible to eliminate 
> copyPlacementQueueToSubmissionContext.
>  This could introduce nasty backward incompatible issues with recovery, so it 
> should be thought through really carefully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to