Lucas Winkelmann created YARN-5374:
--------------------------------------

             Summary: Preemption causing communication loop
                 Key: YARN-5374
                 URL: https://issues.apache.org/jira/browse/YARN-5374
             Project: Hadoop YARN
          Issue Type: Bug
          Components: capacityscheduler, nodemanager, resourcemanager, yarn
    Affects Versions: 2.7.1
         Environment: Yarn version: Hadoop 2.7.1-amzn-0

AWS EMR Cluster running:
1 x r3.8xlarge (Master)
52 x r3.8xlarge (Core)

Spark version : 1.6.0
Scala version: 2.10.5
Java version: 1.8.0_51

Input size: ~10 tb
Input coming from S3

Queue Configuration:
Dynamic allocation: enabled
Preemption: enabled
Q1: 70% capacity with max of 100%
Q2: 30% capacity with max of 100%

Job Configuration:
Driver memory = 10g
Executor cores = 6
Executor memory = 10g
Deploy mode = cluster
Master = yarn
maxResultSize = 4g
Shuffle manager = hash
            Reporter: Lucas Winkelmann
            Priority: Blocker


Here is the scenario:
I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
I wait between 15-30 mins ( for this job to complete with 100% of the cluster 
available takes about 1hr so job 1 is between 25-50% complete). Note that if I 
wait less time then the issue sometimes does not occur, it appears to be only 
after the job 1 is at least 25% complete.
I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to 
allow 70% of cluster utilization.
At this point job 1 basically halts progress while job 2 continues to execute 
as normal and finishes. Job 2 either:
- Fails its attempt and restarts. By the time this attempt fails the other job 
is already complete meaning the second attempt has full cluster availability 
and finishes.
- The job remains at its current progress and simply does not finish ( I have 
waited ~6 hrs until finally killing the application ).
 
Looking into the error log there is this constant error message:
WARN NettyRpcEndpointRef: Error sending message [message = 
RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: 
ip-NUMBERS.ec2.internal was preempted.)] in X attempts
 
My observations have led me to believe that the application master does not 
know about this container being killed and continuously asks the container to 
remove the executor until eventually failing the attempt or continue trying to 
remove the executor.
 
I have done much digging online for anyone else experiencing this issue but 
have come up with nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to