YCozy created YARN-10231: ---------------------------- Summary: When a NM is partitioned away, YARN service will complain about "Queue's AM resource limit exceeded" Key: YARN-10231 URL: https://issues.apache.org/jira/browse/YARN-10231 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.0 Reporter: YCozy
We were testing YARN's RM failover code under network partition, and we observed the following failure. We think this is a bug and would like to confirm with you. Basically, we were testing the following scenario: # Start a YARN cluster with two RMs (e.g., RM1 and RM2) and one NM. # Make RM1 active. # Start a YARN service, e.g., the built-in sleeper service. Name it sleeper1. # Failover from RM1 to RM2. # Stop the sleeper1 and start another YARN service, e.g., still the sleeper service, and call it sleeper2. When no network partition happens, everything is fine (e.g., sleeper2 can start successfully). However, if the NM is partitioned after the RM failover, sleeper2 will fail to start: After polling sleeper2's status for 30 seconds, its application report is still as follows: {code:java} Application Report : Application-Id : application_4_0001 Application-Name : sleeper2 Application-Type : yarn-service User : root Queue : default Application Priority : 0 Start-Time : 1585525063950 Finish-Time : 0 Progress : 0% State : ACCEPTED Final-State : UNDEFINED Tracking-URL : N/A RPC Port : -1 AM Host : N/A Aggregate Resource Allocation : 0 MB-seconds, 0 vcore-seconds Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds Log Aggregation Status : DISABLED Diagnostics : [Sun Mar 29 23:37:44 +0000 2020] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM Resource Usage = <memory:1024, vCores:1>; Unmanaged Application : false Application Node Label Expression : <Not set> AM container Node Label Expression : <DEFAULT_PARTITION> TimeoutType : LIFETIME ExpiryTime : UNLIMITED RemainingTime : -1seconds {code} Since the only fault happens is network partition, the "queue's AM resource limit" shouldn't be exceeded. We can reliably reproduce this bug using our fault injection engine. Please let us know if you need any info for debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org