[ https://issues.apache.org/jira/browse/YARN-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280191#comment-15280191 ]
Nathan Roberts commented on YARN-5039: -------------------------------------- Thanks [~milesc]. This seems to be an Amazon emr thing (unless I'm misunderstanding the log messages). Here are the important pieces: Every time the scheduler is trying to schedule on a node with sufficient room, it is bailing out claiming it's not on the right type of emr node: {noformat} # egrep -i "node being looked for|is excluded" whole-scheduler-at-debug.log 2016-05-11 00:55:46,818 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (ResourceManager Event Processor): Node being looked for scheduling ip-10-12-40-239.us-west-2.compute.internal:8041 availableResource: <memory:241664, vCores:64> 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerAppUtils (ResourceManager Event Processor): node ip-10-12-40-239.us-west-2.compute.internal with emrlabel:TASK is excluded to request with emrLabel:MASTER,CORE {noformat} And below you see it consider the 0041 application and everything looks promising until the node is excluded. This is an emr-specific check which is why it wasn't making a lot of sense as to how this could happen. {noformat} 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue (ResourceManager Event Processor): pre-assignContainers for application application_1462722347496_0041 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue (ResourceManager Event Processor): User limit computation for ai2service in queue default userLimit=100 userLimitFactor=1.0 required: <memory:50688, vCores:1> consumed: <memory:101376, vCores:2> limit: <memory:4833280, vCores:1> queueCapacity: <memory:4833280, vCores:1> qconsumed: <memory:3107648, vCores:55> currentCapacity: <memory:4833280, vCores:1> activeUsers: 1 clusterCapacity: <memory:4833280, vCores:1280> 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt (ResourceManager Event Processor): showRequests: application=application_1462722347496_0041 headRoom=<memory:1725632, vCores:1> currentConsumption=0 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt (ResourceManager Event Processor): showRequests: application=application_1462722347496_0041 request={Priority: 0, Capability: <memory:50688, vCores:1>, # Containers: 1, Location: *, Relax Locality: true} 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue (ResourceManager Event Processor): needsContainers: app.#re-reserve=636 reserved=2 nodeFactor=0.20974576 minAllocFactor=0.99986756 starvation=251 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue (ResourceManager Event Processor): User limit computation for ai2service in queue default userLimit=100 userLimitFactor=1.0 required: <memory:50688, vCores:1> consumed: <memory:101376, vCores:2> limit: <memory:4833280, vCores:1> queueCapacity: <memory:4833280, vCores:1> qconsumed: <memory:3107648, vCores:55> currentCapacity: <memory:4833280, vCores:1> activeUsers: 1 clusterCapacity: <memory:4833280, vCores:1280> 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue (ResourceManager Event Processor): Headroom calculation for user ai2service: userLimit=<memory:4833280, vCores:1> queueMaxAvailRes=<memory:4833280, vCores:1> consumed=<memory:101376, vCores:2> headroom=<memory:1725632, vCores:1> 2016-05-11 00:55:46,819 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerAppUtils (ResourceManager Event Processor): node ip-10-12-40-239.us-west-2.compute.internal with emrlabel:TASK is excluded to request with emrLabel:MASTER,CORE {noformat} I suspect EMR is not wanting to schedule AMs on nodes that are more likely to go away (TASK nodes). Once it gets the AM running though, it takes off. Maybe someone from Amazon can chime-in?? cc [~danzhi] > Applications ACCEPTED but not starting > -------------------------------------- > > Key: YARN-5039 > URL: https://issues.apache.org/jira/browse/YARN-5039 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.2 > Reporter: Miles Crawford > Attachments: Screen Shot 2016-05-04 at 1.57.19 PM.png, Screen Shot > 2016-05-04 at 2.41.22 PM.png, capacity-scheduler-at-debug.log.gz, > queue-config.log, resource-manager-application-starts.log.gz, > whole-scheduler-at-debug.log.gz, > yarn-yarn-resourcemanager-ip-10-12-47-144.log.gz > > > Often when we submit applications to an incompletely utilized cluster, they > sit, unable to start for no apparent reason. > There are multiple nodes in the cluster with available resources, but the > resourcemanger logs show that scheduling is being skipped. The scheduling is > skipped because the application itself has reserved the node? I'm not sure > how to interpret this log output: > {code} > 2016-05-04 20:19:21,315 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler > (ResourceManager Event Processor): Trying to fulfill reservation for > application application_1462291866507_0025 on node: > ip-10-12-43-54.us-west-2.compute.internal:8041 > 2016-05-04 20:19:21,316 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue > (ResourceManager Event Processor): Reserved container > application=application_1462291866507_0025 resource=<memory:50688, vCores:1> > queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=<memory:1894464, vCores:33>, usedCapacity=0.7126589, > absoluteUsedCapacity=0.7126589, numApps=2, numContainers=33 > usedCapacity=0.7126589 absoluteUsedCapacity=0.7126589 used=<memory:1894464, > vCores:33> cluster=<memory:2658304, vCores:704> > 2016-05-04 20:19:21,316 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler > (ResourceManager Event Processor): Skipping scheduling since node > ip-10-12-43-54.us-west-2.compute.internal:8041 is reserved by application > appattempt_1462291866507_0025_000001 > 2016-05-04 20:19:22,232 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler > (ResourceManager Event Processor): Trying to fulfill reservation for > application application_1462291866507_0025 on node: > ip-10-12-43-53.us-west-2.compute.internal:8041 > 2016-05-04 20:19:22,232 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue > (ResourceManager Event Processor): Reserved container > application=application_1462291866507_0025 resource=<memory:50688, vCores:1> > queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=<memory:1894464, vCores:33>, usedCapacity=0.7126589, > absoluteUsedCapacity=0.7126589, numApps=2, numContainers=33 > usedCapacity=0.7126589 absoluteUsedCapacity=0.7126589 used=<memory:1894464, > vCores:33> cluster=<memory:2658304, vCores:704> > 2016-05-04 20:19:22,232 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler > (ResourceManager Event Processor): Skipping scheduling since node > ip-10-12-43-53.us-west-2.compute.internal:8041 is reserved by application > appattempt_1462291866507_0025_000001 > 2016-05-04 20:19:22,316 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler > (ResourceManager Event Processor): Trying to fulfill reservation for > application application_1462291866507_0025 on node: > ip-10-12-43-54.us-west-2.compute.internal:8041 > 2016-05-04 20:19:22,316 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue > (ResourceManager Event Processor): Reserved container > application=application_1462291866507_0025 resource=<memory:50688, vCores:1> > queue=default: capacity=1.0, absoluteCapacity=1.0, > usedResources=<memory:1894464, vCores:33>, usedCapacity=0.7126589, > absoluteUsedCapacity=0.7126589, numApps=2, numContainers=33 > usedCapacity=0.7126589 absoluteUsedCapacity=0.7126589 used=<memory:1894464, > vCores:33> cluster=<memory:2658304, vCores:704> > 2016-05-04 20:19:22,316 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler > (ResourceManager Event Processor): Skipping scheduling since node > ip-10-12-43-54.us-west-2.compute.internal:8041 is reserved by application > appattempt_1462291866507_0025_000001 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org