Wang, Xinglong created YARN-9980: ------------------------------------ Summary: App hangs in accepted when moved from DEFAULT_PARTITION queue to an exclusive partition queue Key: YARN-9980 URL: https://issues.apache.org/jira/browse/YARN-9980 Project: Hadoop YARN Issue Type: Improvement Reporter: Wang, Xinglong Assignee: Wang, Xinglong Attachments: Screen Shot 2019-11-14 at 5.11.39 PM.png
App hangs in accpeted when moved from DEFAULT_PARTITION queue to an exclusive partition queue. queue_root queue_a ----- default_partition queue_b ----- exclusive partition x, default partition is x When an app is submitted to queue_a, with AM_LABEL_EXPRESSION unset, RM will give default_partition as AM_LABEL_EXPRESSION to this app, then it gets an am1 and runs. And if later, the app is moved to queue_b, and the am1 is preempted/killed/failed, it will schedule another am2 if am retry number allows. But this time the resource request for this am2 is with AM_LABEL_EXPRESSION = default_partition, the issue is queue_b don't have any resource with default_partition, then this app will be in accepted state forever in RM UI. My understanding is that, since the app was submitted with no AM_LABEL_EXPRESSION, And in the code base, we allow in our code for such kind of app to run with current queue's default partition. Here for the move queue scenario, we should also let the app to run successfully. That means am2 should get queue_b's default partition x resource to run instead of pending forever. In our production, we have a landing queue with default_partition, we have some kind of route mechanism to route apps in this queue to other queues including queues with exclusive partition. !Screen Shot 2019-11-14 at 5.11.39 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org