[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

niu (JIRA) Mon, 03 Sep 2018 18:30:40 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602522#comment-16602522
 ]


niu commented on YARN-8513:
---------------------------

Debug dump:
{code:java}
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing 
shopee-test-cluster04:45454 of type STATUS_UPDATE
2018-09-03 11:44:11,175 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: 
Dispatching the event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
 NODE_UPDATE
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
nodeUpdate: shopee-test-cluster04:45454 cluster capacity: <memory:1351680, 
vCores:240>
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Node being looked for scheduling shopee-test-cluster04:45454 availableResource: 
<memory:82944, vCores:77>
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Trying to schedule on node: shopee-test-cluster04, available: <memory:82944, 
vCores:77>
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Trying to assign containers to child-queue of root
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Check assign to queue: root nodePartition: , usedResources: <memory:1095680, 
vCores:8>, clusterResources: <memory:1351680, vCores:240>, currentUsedCapacity: 
0.81060606, max-capacity: 1.0
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
printChildQueues - queue: root child-queues: root.dwusedCapacity=(1.1842697),  
label=(*)root.devusedCapacity=(0.016571993),  label=(*)
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Trying to assign to queue: root.dev stats: dev: capacity=0.32, 
absoluteCapacity=0.32, usedResources=<memory:7168, vCores:1>, 
usedCapacity=0.016571993, absoluteUsedCapacity=0.0053030304, numApps=1, 
numContainers=1
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignContainers: partition= #applications=1
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Check assign to queue: dev nodePartition: , usedResources: <memory:7168, 
vCores:1>, clusterResources: <memory:1351680, vCores:240>, currentUsedCapacity: 
0.0053030304, max-capacity: 0.6
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:432640, vCores:77>, 
userSpecificUserLimit=<memory:432640, vCores:77>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Headroom calculation for user work:  userLimit=<memory:432640, vCores:77> 
queueMaxAvailRes=<memory:811008, vCores:144> consumed=<memory:7168, vCores:1> 
partition=
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 pre-assignContainers for application application_1535930391687_0019
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:432640, vCores:77>, 
userSpecificUserLimit=<memory:432640, vCores:77>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1535930391687_0019 
headRoom=<memory:425472, vCores:76> currentConsumption=7168
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalitySchedulingPlacementSet:
         Request={AllocationRequestId: 0, Priority: 1, Capability: 
<memory:360448, vCores:2>, # Containers: 3, Location: *, Relax Locality: true, 
Execution Type Request: null, Node Label Expression: }
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
 assignContainers: node=shopee-test-cluster04 
application=application_1535930391687_0019 priority=1 
pendingAsk=<per-allocation-resource=<memory:360448, vCores:2>,repeat=3> 
type=OFF_SWITCH
2018-09-03 11:44:11,175 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
 Reserved container  application=application_1535930391687_0019 
resource=<memory:360448, vCores:2> 
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@65ed660
 cluster=<memory:1351680, vCores:240>
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
post-assignContainers for application application_1535930391687_0019
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:432640, vCores:77>, 
userSpecificUserLimit=<memory:432640, vCores:77>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1535930391687_0019 
headRoom=<memory:425472, vCores:76> currentConsumption=7168
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalitySchedulingPlacementSet:
         Request={AllocationRequestId: 0, Priority: 1, Capability: 
<memory:360448, vCores:2>, # Containers: 3, Location: *, Relax Locality: true, 
Execution Type Request: null, Node Label Expression: }
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.dev stats: dev: capacity=0.32, absoluteCapacity=0.32, 
usedResources=<memory:7168, vCores:1>, usedCapacity=0.016571993, 
absoluteUsedCapacity=0.0053030304, numApps=1, numContainers=1 --> 
<memory:360448, vCores:2>, OFF_SWITCH
2018-09-03 11:44:11,175 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.81060606 
absoluteUsedCapacity=0.81060606 used=<memory:1095680, vCores:8> 
cluster=<memory:1351680, vCores:240>
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
ParentQ=root assignedSoFarInThisIteration=<memory:360448, vCores:2> 
usedCapacity=0.81060606 absoluteUsedCapacity=0.81060606
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Try to commit allocation proposal=New 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
         RESERVED=[(Application=appattempt_1535930391687_0019_000001; 
Node=shopee-test-cluster04:45454; Resource=<memory:360448, vCores:2>)]
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:432640, vCores:77>, 
userSpecificUserLimit=<memory:432640, vCores:77>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Headroom calculation for user work:  userLimit=<memory:432640, vCores:77> 
queueMaxAvailRes=<memory:811008, vCores:144> consumed=<memory:7168, vCores:1> 
partition=
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource=<memory:1095680, vCores:8> exceeded maxResourceLimit of the 
queue =<memory:1351680, vCores:240>
2018-09-03 11:44:11,175 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal
2018-09-03 11:44:11,175 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Assigned maximum number of off-switch containers: 1, assignments so far: 
resource:<memory:360448, vCores:2>; type:OFF_SWITCH; excessReservation:null; 
applicationid:null; skipped:NONE; fulfilled reservation:false; 
allocations(count/resource):0/<memory:0, vCores:0>; 
reservations(count/resource):1/<memory:360448, vCores:2>
2018-09-03 11:44:11,287 DEBUG org.apache.hadoop.ipc.Server:  got #68890
2018-09-03 11:44:11,287 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
30 on 8031: Call#68890 Retry#0 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
10.65.205.151:60900 for RpcKind RPC_PROTOCOL_BUFFER
2018-09-03 11:44:11,287 DEBUG org.apache.hadoop.security.UserGroupInformation: 
PrivilegedAction as:work (auth:SIMPLE) 
from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
2018-09-03 11:44:11,288 DEBUG org.apache.hadoop.ipc.Server: Served: 
nodeHeartbeat, queueTime= 1 procesingTime= 0
2018-09-03 11:44:11,288 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: 
Dispatching the event 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType:
 STATUS_UPDATE
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing 
shopee-test-cluster03:45454 of type STATUS_UPDATE
2018-09-03 11:44:11,288 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
30 on 8031: responding to Call#68890 Retry#0 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
10.65.205.151:60900
2018-09-03 11:44:11,288 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: 
Dispatching the event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
 NODE_UPDATE
2018-09-03 11:44:11,288 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
30 on 8031: responding to Call#68890 Retry#0 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
10.65.205.151:60900 Wrote 42 bytes.
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
nodeUpdate: shopee-test-cluster03:45454 cluster capacity: <memory:1351680, 
vCores:240>
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Node being looked for scheduling shopee-test-cluster03:45454 availableResource: 
<memory:90112, vCores:78>
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Trying to schedule on node: shopee-test-cluster03, available: <memory:90112, 
vCores:78>
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Trying to assign containers to child-queue of root
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Check assign to queue: root nodePartition: , usedResources: <memory:1095680, 
vCores:8>, clusterResources: <memory:1351680, vCores:240>, currentUsedCapacity: 
0.81060606, max-capacity: 1.0
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
printChildQueues - queue: root child-queues: root.dwusedCapacity=(1.1842697),  
label=(*)root.devusedCapacity=(0.016571993),  label=(*)
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Trying to assign to queue: root.dev stats: dev: capacity=0.32, 
absoluteCapacity=0.32, usedResources=<memory:7168, vCores:1>, 
usedCapacity=0.016571993, absoluteUsedCapacity=0.0053030304, numApps=1, 
numContainers=1
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignContainers: partition= #applications=1
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Check assign to queue: dev nodePartition: , usedResources: <memory:7168, 
vCores:1>, clusterResources: <memory:1351680, vCores:240>, currentUsedCapacity: 
0.0053030304, max-capacity: 0.6
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:432640, vCores:77>, 
userSpecificUserLimit=<memory:432640, vCores:77>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Headroom calculation for user work:  userLimit=<memory:432640, vCores:77> 
queueMaxAvailRes=<memory:811008, vCores:144> consumed=<memory:7168, vCores:1> 
partition=
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 pre-assignContainers for application application_1535930391687_0019
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:432640, vCores:77>, 
userSpecificUserLimit=<memory:432640, vCores:77>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1535930391687_0019 
headRoom=<memory:425472, vCores:76> currentConsumption=7168
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalitySchedulingPlacementSet:
         Request={AllocationRequestId: 0, Priority: 1, Capability: 
<memory:360448, vCores:2>, # Containers: 3, Location: *, Relax Locality: true, 
Execution Type Request: null, Node Label Expression: }
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
 assignContainers: node=shopee-test-cluster03 
application=application_1535930391687_0019 priority=1 
pendingAsk=<per-allocation-resource=<memory:360448, vCores:2>,repeat=3> 
type=OFF_SWITCH
2018-09-03 11:44:11,288 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
 Reserved container  application=application_1535930391687_0019 
resource=<memory:360448, vCores:2> 
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@65ed660
 cluster=<memory:1351680, vCores:240>
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
post-assignContainers for application application_1535930391687_0019
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:432640, vCores:77>, 
userSpecificUserLimit=<memory:432640, vCores:77>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 showRequests: application=application_1535930391687_0019 
headRoom=<memory:425472, vCores:76> currentConsumption=7168
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalitySchedulingPlacementSet:
         Request={AllocationRequestId: 0, Priority: 1, Capability: 
<memory:360448, vCores:2>, # Containers: 3, Location: *, Relax Locality: true, 
Execution Type Request: null, Node Label Expression: }
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Assigned to queue: root.dev stats: dev: capacity=0.32, absoluteCapacity=0.32, 
usedResources=<memory:7168, vCores:1>, usedCapacity=0.016571993, 
absoluteUsedCapacity=0.0053030304, numApps=1, numContainers=1 --> 
<memory:360448, vCores:2>, OFF_SWITCH
2018-09-03 11:44:11,288 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.81060606 
absoluteUsedCapacity=0.81060606 used=<memory:1095680, vCores:8> 
cluster=<memory:1351680, vCores:240>
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
ParentQ=root assignedSoFarInThisIteration=<memory:360448, vCores:2> 
usedCapacity=0.81060606 absoluteUsedCapacity=0.81060606
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Try to commit allocation proposal=New 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
         RESERVED=[(Application=appattempt_1535930391687_0019_000001; 
Node=shopee-test-cluster03:45454; Resource=<memory:360448, vCores:2>)]
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: 
userLimit is fetched. userLimit=<memory:432640, vCores:77>, 
userSpecificUserLimit=<memory:432640, vCores:77>, 
schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Headroom calculation for user work:  userLimit=<memory:432640, vCores:77> 
queueMaxAvailRes=<memory:811008, vCores:144> consumed=<memory:7168, vCores:1> 
partition=
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 Used resource=<memory:1095680, vCores:8> exceeded maxResourceLimit of the 
queue =<memory:1351680, vCores:240>
2018-09-03 11:44:11,288 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal
2018-09-03 11:44:11,288 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Assigned maximum number of off-switch containers: 1, assignments so far: 
resource:<memory:360448, vCores:2>; type:OFF_SWITCH; excessReservation:null; 
applicationid:null; skipped:NONE; fulfilled reservation:false; 
allocations(count/resource):0/<memory:0, vCores:0>; 
reservations(count/resource):1/<memory:360448, vCores:2>
2018-09-03 11:44:11,700 DEBUG org.apache.hadoop.ipc.Server:  got #440
2018-09-03 11:44:11,700 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
2 on 8032: Call#440 Retry#0 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 10.65.205.148:48970 for RpcKind RPC_PROTOCOL_BUFFER
2018-09-03 11:44:11,700 DEBUG org.apache.hadoop.security.UserGroupInformation: 
PrivilegedAction as:work (auth:SIMPLE) 
from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
2018-09-03 11:44:11,700 DEBUG 
org.apache.hadoop.yarn.server.security.ApplicationACLsManager: Verifying 
access-type VIEW_APP for work (auth:SIMPLE) on application 
application_1535930391687_0019 owned by work
2018-09-03 11:44:11,701 DEBUG org.apache.hadoop.ipc.Server: Served: 
getApplicationReport, queueTime= 0 procesingTime= 1
2018-09-03 11:44:11,701 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
2 on 8032: responding to Call#440 Retry#0 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 10.65.205.148:48970
2018-09-03 11:44:11,701 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
2 on 8032: responding to Call#440 Retry#0 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 10.65.205.148:48970 Wrote 358 bytes.
2018-09-03 11:44:11,989 DEBUG org.apache.hadoop.ipc.Server:  got #3118
2018-09-03 11:44:11,990 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
27 on 8032: Call#3118 Retry#0 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 10.65.205.148:48370 for RpcKind RPC_PROTOCOL_BUFFER
2018-09-03 11:44:11,990 DEBUG org.apache.hadoop.security.UserGroupInformation: 
PrivilegedAction as:work (auth:SIMPLE) 
from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
2018-09-03 11:44:11,990 DEBUG 
org.apache.hadoop.yarn.server.security.ApplicationACLsManager: Verifying 
access-type VIEW_APP for work (auth:SIMPLE) on application 
application_1535930391687_0012 owned by work
2018-09-03 11:44:11,990 DEBUG org.apache.hadoop.ipc.Server: Served: 
getApplicationReport, queueTime= 1 procesingTime= 0
2018-09-03 11:44:11,990 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
27 on 8032: responding to Call#3118 Retry#0 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 10.65.205.148:48370
2018-09-03 11:44:11,990 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
27 on 8032: responding to Call#3118 Retry#0 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 10.65.205.148:48370 Wrote 361 bytes.
2018-09-03 11:44:12,005 DEBUG org.apache.hadoop.ipc.Server:  got #502725
2018-09-03 11:44:12,005 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
31 on 8031: Call#502725 Retry#0 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
10.65.205.150:38836 for RpcKind RPC_PROTOCOL_BUFFER
2018-09-03 11:44:12,005 DEBUG org.apache.hadoop.security.UserGroupInformation: 
PrivilegedAction as:work (auth:SIMPLE) 
from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
2018-09-03 11:44:12,006 DEBUG org.apache.hadoop.ipc.Server: Served: 
nodeHeartbeat, queueTime= 1 procesingTime= 0
2018-09-03 11:44:12,006 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
31 on 8031: responding to Call#502725 Retry#0 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
10.65.205.150:38836
2018-09-03 11:44:12,006 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: 
Dispatching the event 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType:
 STATUS_UPDATE
2018-09-03 11:44:12,006 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 
31 on 8031: responding to Call#502725 Retry#0 
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
10.65.205.150:38836 Wrote 42 bytes.
2018-09-03 11:44:12,006 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing 
shopee-test-cluster02:45454 of type STATUS_UPDATE
2018-09-03 11:44:12,006 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: 
Dispatching the event 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
 NODE_UPDATE
2018-09-03 11:44:12,006 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
nodeUpdate: shopee-test-cluster02:45454 cluster capacity: <memory:1351680, 
vCores:240>
2018-09-03 11:44:12,006 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Node being looked for scheduling shopee-test-cluster02:45454 availableResource: 
<memory:82944, vCores:77>
2018-09-03 11:44:12,006 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Trying to schedule on node: shopee-test-cluster02, available: <memory:82944, 
vCores:77>{code}


> CapacityScheduler infinite loop when queue is near fully utilized
> -----------------------------------------------------------------
>
>                 Key: YARN-8513
>                 URL: https://issues.apache.org/jira/browse/YARN-8513
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, yarn
>    Affects Versions: 3.1.0, 2.9.1
>         Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>            Reporter: Chen Yufei
>            Priority: Major
>         Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, 
> yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, 
> yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used=<memory:16170624, vCores:1577> 
> cluster=<memory:29441544, vCores:5792>}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_000001 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource=<memory:29441544, vCores:5792> type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

Reply via email to