YCozy created YARN-10231:
----------------------------

             Summary: When a NM is partitioned away, YARN service will complain 
about "Queue's AM resource limit exceeded" 
                 Key: YARN-10231
                 URL: https://issues.apache.org/jira/browse/YARN-10231
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.3.0
            Reporter: YCozy


We were testing YARN's RM failover code under network partition, and we 
observed the following failure. We think this is a bug and would like to 
confirm with you.

Basically, we were testing the following scenario:
 # Start a YARN cluster with two RMs (e.g., RM1 and RM2) and one NM.
 # Make RM1 active.
 # Start a YARN service, e.g., the built-in sleeper service. Name it sleeper1.
 # Failover from RM1 to RM2.
 # Stop the sleeper1 and start another YARN service, e.g., still the sleeper 
service, and call it sleeper2.

When no network partition happens, everything is fine (e.g., sleeper2 can start 
successfully).

However, if the NM is partitioned after the RM failover, sleeper2 will fail to 
start: After polling sleeper2's status for 30 seconds, its application report 
is still as follows:
{code:java}
Application Report :
    Application-Id : application_4_0001
    Application-Name : sleeper2
    Application-Type : yarn-service
    User : root
    Queue : default
    Application Priority : 0
    Start-Time : 1585525063950
    Finish-Time : 0
    Progress : 0%
    State : ACCEPTED 
    Final-State : UNDEFINED 
    Tracking-URL : N/A 
    RPC Port : -1 
    AM Host : N/A Aggregate Resource Allocation : 0 MB-seconds, 0 vcore-seconds
    Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds 
    Log Aggregation Status : DISABLED
    Diagnostics : [Sun Mar 29 23:37:44 +0000 2020] Application is added to the 
scheduler and is not yet activated. Queue's AM resource limit exceeded.  
Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = 
<memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:1024, vCores:1>; 
User AM Resource Limit of the queue = <memory:1024, vCores:1>; Queue AM 
Resource Usage = <memory:1024, vCores:1>;  
    Unmanaged Application : false 
    Application Node Label Expression : <Not set> 
    AM container Node Label Expression : <DEFAULT_PARTITION> 
    TimeoutType : LIFETIME ExpiryTime : UNLIMITED RemainingTime : -1seconds
{code}
Since the only fault happens is network partition, the "queue's AM resource 
limit" shouldn't be exceeded.

We can reliably reproduce this bug using our fault injection engine. Please let 
us know if you need any info for debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to