[ https://issues.apache.org/jira/browse/YARN-10787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Szilard Nemeth updated YARN-10787: ---------------------------------- Fix Version/s: 3.4.0 > Queue submit ACL check is wrong when CS queue is ambiguous > ---------------------------------------------------------- > > Key: YARN-10787 > URL: https://issues.apache.org/jira/browse/YARN-10787 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.4.0 > Reporter: Szilard Nemeth > Assignee: Gergely Pollák > Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10787.001.patch > > > Let's suppose we have a Capacity Scheduler configuration with 2 or more leaf > queues with the same name in the queue hierarchy. That's what we call an > ambiguous queue name. > Let's also enable ACL checks and define acl_submit_applications / > acl_administer_queue configs with the correct value, adding the username to > the ACL value there. > Here's a minimalistic YARN + CS config: > h2. 1. YARN config snippet: > {code:java} > <property><name>yarn.acl.enable</name><value>true</value> > {code} > h2. 2. CS config snippet: > {code:java} > <property> > <name>yarn.scheduler.capacity.root.someparent1.queues</name> > <value>anyotherqueue1,somequeue,anyotherqueue2</value> > </property> > <property> > <name>yarn.scheduler.capacity.root.someparent2.queues</name> > <value>anyotherqueue3,somequeue,anyotherqueue4</value> > </property> > <property> > > <name>yarn.scheduler.capacity.root.someparent1.somequeue.acl_submit_applications</name> > <value>someuser1 </value> > </property> > <property> > > <name>yarn.scheduler.capacity.root.someparent2.somequeue.acl_submit_applications</name> > <value>someuser1 </value> > </property> > <property> > > <name>yarn.scheduler.capacity.root.someparent1.somequeue.acl_administer_queue</name> > <value>someuser1 </value> > </property> > <property> > > <name>yarn.scheduler.capacity.root.someparent2.somequeue.acl_administer_queue</name> > <value>someuser1 </value> > </property> > {code} > So in this case, we have an ambiguous queue named "somequeue" under 2 > different paths: > - root.someparent1.somequeue > - root.someparent2.somequeue > When a user submits an application correctly with the full queue path e.g. > root.someparent1.somequeue, YARN will still fail to place the application to > that queue and will use the short name in case ACL checking is enabled. > h2. 3. LOG SNIPPET > {code:java} > 2021-05-20 22:04:32,031 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.placement.CSMappingPlacementRule: > Placement final result 'root.someparent1.somequeue' for application > 'application_1621540945412_0001' > 2021-05-20 22:04:32,031 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Placed > application with ID application_1621540945412_0001 in queue: somequeue, > original submission queue was: root.someparent1.somequeue > 2021-05-20 22:04:32,031 ERROR > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Ambiguous queue reference: somequeue please use full queue path instead. > 2021-05-20 22:04:32,031 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Application 'application_1621540945412_0001' is submitted without priority > hence considering default queue/cluster priority: 0 > 2021-05-20 22:04:32,032 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Priority '0' is acceptable in queue : somequeue for application: > application_1621540945412_0001 > 2021-05-20 22:04:32,993 INFO > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Exception in > submitting application_1621540945412_0001 > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.security.AccessControlException: User someuser1 does not > have permission to submit application_1621540945412_0001 to queue somequeue > at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > {code} > h2. 4. FULL STACKTRACE: > {code:java} > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.security.AccessControlException: User someuser1 does not > have permission to submit application_1621540945412_0001 to queue somequeue > at > org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:433) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:330) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:650) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894) > Caused by: org.apache.hadoop.security.AccessControlException: User someuser1 > does not have permission to submit application_1621540945412_0001 to queue > somequeue > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:436) > ... 12 more > 2021-05-20 22:04:32,994 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=someuser1 > IP=172.17.61.133 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=FAILURE DESCRIPTION=Exception in submitting > application PERMISSIONS=org.apache.hadoop.security.AccessControlException: > User someuser1 does not have permission to submit > application_1621540945412_0001 to queue somequeue > APPID=application_1621540945412_0001 QUEUENAME=somequeue > {code} > h1. DETAILS: > *1. The whole thing happens in RMAppManager#createAndPopulateNewRMApp:* > Class / method: > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager#createAndPopulateNewRMApp > [LINK|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L407] > *2. RMAppManager#copyPlacementQueueToSubmissionContext is called* for > applications that are new, meaning we are not recovering, an application is > submitted in a normal way: > Class / method: > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager#copyPlacementQueueToSubmissionContext > [Called > at|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L420] > [Method > link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L991] > The problem is that copyPlacementQueueToSubmissionContext sets the queue of > context (ApplicationSubmissionContext object) from placementContext.getQueue > (ApplicationPlacementContext object). If placementcontext holds the queue > name in the short form, this will override the default submission queue > value, let's suppose it was the full queue path. > An example of a generated log from this method: > {code:java} > 2021-05-20 22:04:32,031 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Placed > application with ID application_1621540945412_0001 in queue: somequeue, > original submission queue was: root.someparent1.somequeue > {code} > *3. The problematic code block is here:* [Code > block|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L446-L475] > 3.1 First, the short queuename will be gathered from submissionContext, as it > was overridden by 'copyPlacementQueueToSubmissionContext': > [Link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L448] > This is a bad design, as here we are relying on the fact that the queue name > was overridden in the submission context object. > 3.2 Since the queue name will be in the short form and it's ambiguous, the > call to > [scheduler.getQueue()|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L450] > will return null, as it's implemented like this by design: If the queue name > is ambiguous, it returns null. > 3.3 The condition of checking if csqueue is null AND placementContext is not > null will evaluate to true > [here|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L452] > *3.4. The Parent queue will be queried from CS* by the parent queue name of > the placement context: > [Link|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L456] > *3.5 Finally, the ACL check fails* as csqueue is the queue object of the > parent queue of the queue 'root.someparent1.somequeue' which will be the > queue: 'root.someparent1'. > In this case, the user don't have a submission ACL set for the parent queue, > but the leaf queue so the ACL check fails. > h2. LIST OF THINGS TO FIX / DO: > - Add a unit testcase that replicates the above config and the issue. > - Rename copyPlacementQueueToSubmissionContext: This method not really > copies anything, it simply overrides the queue value. > - Add Debug log to print csqueue object before the authorization code: [Auth > code > block|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L459-L475] > - Fix log messages: As 'copyPlacementQueueToSubmissionContext' overrides > (not copies) the original queue name with the queue name from the > PlacementContext, all calls to submissionContext.getQueue() will return the > short queue name. This results in very misleading log messages as well, > including the exception message itself: > {code:java} > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.security.AccessControlException: User someuser1 does not > have permission to submit application_1621540945412_0001 to queue somequeue > {code} > All log messages should print the original submission queue, if possible. > - Actual code fix for the issue: Use full queue path to get the queue object. > Again, this is the code block where the fix should happen: > [LINK|https://github.com/apache/hadoop/blob/2541efa496ba0e7e096ee5ec3c08d64b62036402/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L447-L458] > 'queueName' should have the value set from: > *org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext#getFullQueuePath.* > The equivalent of this in the linked code block: > {code:java} > placementContext.getFullQueuePath() > {code} > This should happen only if placementContext is not null. > h2. LONG TERM FIX: > Investigate if it's possible to eliminate > copyPlacementQueueToSubmissionContext. > This could introduce nasty backward incompatible issues with recovery, so it > should be thought through really carefully. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org