[ https://issues.apache.org/jira/browse/YARN-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375761#comment-15375761 ]
Lucas Winkelmann commented on YARN-5374: ---------------------------------------- I will go ahead and file a Spark JIRA ticket now. > Preemption causing communication loop > ------------------------------------- > > Key: YARN-5374 > URL: https://issues.apache.org/jira/browse/YARN-5374 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager, yarn > Affects Versions: 2.7.1 > Environment: Yarn version: Hadoop 2.7.1-amzn-0 > AWS EMR Cluster running: > 1 x r3.8xlarge (Master) > 52 x r3.8xlarge (Core) > Spark version : 1.6.0 > Scala version: 2.10.5 > Java version: 1.8.0_51 > Input size: ~10 tb > Input coming from S3 > Queue Configuration: > Dynamic allocation: enabled > Preemption: enabled > Q1: 70% capacity with max of 100% > Q2: 30% capacity with max of 100% > Job Configuration: > Driver memory = 10g > Executor cores = 6 > Executor memory = 10g > Deploy mode = cluster > Master = yarn > maxResultSize = 4g > Shuffle manager = hash > Reporter: Lucas Winkelmann > Priority: Blocker > > Here is the scenario: > I launch job 1 into Q1 and allow it to grow to 100% cluster utilization. > I wait between 15-30 mins ( for this job to complete with 100% of the cluster > available takes about 1hr so job 1 is between 25-50% complete). Note that if > I wait less time then the issue sometimes does not occur, it appears to be > only after the job 1 is at least 25% complete. > I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to > allow 70% of cluster utilization. > At this point job 1 basically halts progress while job 2 continues to execute > as normal and finishes. Job 2 either: > - Fails its attempt and restarts. By the time this attempt fails the other > job is already complete meaning the second attempt has full cluster > availability and finishes. > - The job remains at its current progress and simply does not finish ( I have > waited ~6 hrs until finally killing the application ). > > Looking into the error log there is this constant error message: > WARN NettyRpcEndpointRef: Error sending message [message = > RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: > ip-NUMBERS.ec2.internal was preempted.)] in X attempts > > My observations have led me to believe that the application master does not > know about this container being killed and continuously asks the container to > remove the executor until eventually failing the attempt or continue trying > to remove the executor. > > I have done much digging online for anyone else experiencing this issue but > have come up with nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org