[jira] [Commented] (YARN-10492) deadlock in rm
[ https://issues.apache.org/jira/browse/YARN-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235828#comment-17235828 ] jufeng li commented on YARN-10492: -- We are using the same version(HDP-3.1.5.0-152 hadoop3.1),and we got the same issue,I solved this issue,do you want patch? > deadlock in rm > --- > > Key: YARN-10492 > URL: https://issues.apache.org/jira/browse/YARN-10492 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: brick yang >Priority: Critical > Labels: 3.1.1 > > version: HDP-3.1.5.0-152 hadoop3.1 > capacity scheduler > yarn sometimes not change to active > we found that jstack dump has deadlocked: > "IPC Server handler 44 on 8030" #316 daemon prio=5 os_prio=0 > tid=0x7fee8216e800 nid=0x63edc waiting for monitor entry > [0x7fee09633000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:323) > - waiting to lock <0x00043e2e19d0> (a > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.finishApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:75) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:97) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > > > > > > > > "IPC Server handler 8 on 8030" #280 daemon prio=5 os_prio=0 > tid=0x7fee83823800 nid=0x63eb8 waiting on condition [0x7fee0ba57000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0003c0d0d6c0> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1664) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1997) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:676) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.releaseContainers(AbstractYarnScheduler.java:753) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocate(CapacityScheduler.java:1182) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:279) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.SchedulerPlacementProcessor.allocate(SchedulerPlacementProcessor.java:53) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:433) > - locked <0x00043e2e19d0> (a > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server
[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
[ https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10483: - Description: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。没人解答待会我再来问 (was: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问) > yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复 > -- > > Key: YARN-10483 > URL: https://issues.apache.org/jira/browse/YARN-10483 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity > scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity > scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。没人解答待会我再来问 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
[ https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10483: - Description: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问 (was: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity scheduler内部的锁除了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问) > yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复 > -- > > Key: YARN-10483 > URL: https://issues.apache.org/jira/browse/YARN-10483 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity > scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity > scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
[ https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10483: - Description: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity scheduler内部的锁除了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问 (was: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity scheduler内部的锁除了问题。jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问) > yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复 > -- > > Key: YARN-10483 > URL: https://issues.apache.org/jira/browse/YARN-10483 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity > scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity > scheduler内部的锁除了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
[ https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10483: - Attachment: RM_unnormal_state.stack > yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复 > -- > > Key: YARN-10483 > URL: https://issues.apache.org/jira/browse/YARN-10483 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity > scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity > scheduler内部的锁除了问题。jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
jufeng li created YARN-10483: Summary: yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复 Key: YARN-10483 URL: https://issues.apache.org/jira/browse/YARN-10483 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler, resourcemanager, RM Affects Versions: 3.1.1 Reporter: jufeng li Attachments: RM_normal_state.stack, RM_unnormal_state.stack yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity scheduler内部的锁除了问题。jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
[ https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10483: - Attachment: RM_normal_state.stack > yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复 > -- > > Key: YARN-10483 > URL: https://issues.apache.org/jira/browse/YARN-10483 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity > scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity > scheduler内部的锁除了问题。jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226463#comment-17226463 ] jufeng li commented on YARN-10440: -- [~joepep] I set,and it not effective,you can check my jstack log > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer
[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226002#comment-17226002 ] jufeng li commented on YARN-10440: -- This issue happend hous ago.I uploaded the RM jvm stack > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer applicati
[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10440: - Attachment: (was: rm_2020-09-26-2.dump) > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null
[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10440: - Attachment: RM_unnormal_state.stack > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=ti
[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10440: - Attachment: RM_normal_state.stack > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tian
[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal
[ https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10482: - Description: Capacity Scheduler seems locked,RM cannot submit any new job, and change active RM manually return to normal。its a serious bug!I check the stack log,and found some info about *ReentrantReadWriteLock。*Can anyone can solve this issue?I uploaded the stack when RM normally and unnormally。RM hangs forever until I restart RM or change the active RM manually!! (was: Capacity Scheduler seems locked,RM cannot submit any new job, and change active RM manually return to normal。its a serious bug!I check the stack log,and found some info about *ReentrantReadWriteLock。*Can anyone can solve this issue?I uploaded the stack when RM normally and unnormally。) > Capacity Scheduler seems locked,RM cannot submit any new job,and change > active RM manually return to normal > > > Key: YARN-10482 > URL: https://issues.apache.org/jira/browse/YARN-10482 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > Capacity Scheduler seems locked,RM cannot submit any new job, and change > active RM manually return to normal。its a serious bug!I check the stack > log,and found some info about *ReentrantReadWriteLock。*Can anyone can solve > this issue?I uploaded the stack when RM normally and unnormally。RM hangs > forever until I restart RM or change the active RM manually!! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal
[ https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10482: - Attachment: (was: RM_normal_state.stack) > Capacity Scheduler seems locked,RM cannot submit any new job,and change > active RM manually return to normal > > > Key: YARN-10482 > URL: https://issues.apache.org/jira/browse/YARN-10482 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > Capacity Scheduler seems locked,RM cannot submit any new job, and change > active RM manually return to normal。its a serious bug!I check the stack > log,and found some info about *ReentrantReadWriteLock。*Can anyone can solve > this issue?I uploaded the stack when RM normally and unnormally。 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal
[ https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10482: - Attachment: RM_unnormal_state.stack > Capacity Scheduler seems locked,RM cannot submit any new job,and change > active RM manually return to normal > > > Key: YARN-10482 > URL: https://issues.apache.org/jira/browse/YARN-10482 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > Capacity Scheduler seems locked,RM cannot submit any new job, and change > active RM manually return to normal。its a serious bug!I check the stack > log,and found some info about *ReentrantReadWriteLock。*Can anyone can solve > this issue?I uploaded the stack when RM normally and unnormally。 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal
[ https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10482: - Attachment: RM_normal_state.stack > Capacity Scheduler seems locked,RM cannot submit any new job,and change > active RM manually return to normal > > > Key: YARN-10482 > URL: https://issues.apache.org/jira/browse/YARN-10482 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > Capacity Scheduler seems locked,RM cannot submit any new job, and change > active RM manually return to normal。its a serious bug!I check the stack > log,and found some info about *ReentrantReadWriteLock。*Can anyone can solve > this issue?I uploaded the stack when RM normally and unnormally。 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal
[ https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10482: - Attachment: RM_normal_state.stack > Capacity Scheduler seems locked,RM cannot submit any new job,and change > active RM manually return to normal > > > Key: YARN-10482 > URL: https://issues.apache.org/jira/browse/YARN-10482 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler, resourcemanager, > RM >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack > > > Capacity Scheduler seems locked,RM cannot submit any new job, and change > active RM manually return to normal。its a serious bug!I check the stack > log,and found some info about *ReentrantReadWriteLock。*Can anyone can solve > this issue?I uploaded the stack when RM normally and unnormally。 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal
jufeng li created YARN-10482: Summary: Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal Key: YARN-10482 URL: https://issues.apache.org/jira/browse/YARN-10482 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler, resourcemanager, RM Affects Versions: 3.1.1 Reporter: jufeng li Capacity Scheduler seems locked,RM cannot submit any new job, and change active RM manually return to normal。its a serious bug!I check the stack log,and found some info about *ReentrantReadWriteLock。*Can anyone can solve this issue?I uploaded the stack when RM normally and unnormally。 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10440: - Attachment: rm_2020-09-26-2.dump > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: rm_2020-09-26-2.dump > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCo
[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202478#comment-17202478 ] jufeng li commented on YARN-10440: -- i set yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments=100,but it happend again last nigth > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=a
[jira] [Reopened] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li reopened YARN-10440: -- > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO cap
[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10440: - Description: RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I can open x:8088/cluster/apps/RUNNING but can not x:8088/cluster/scheduler.Those apps submited can not end itself and new apps can not be submited.just everything hangs but not RM,NM server. How can I fix this?help me,please! here is the log: {code:java} //代码占位符 ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContai
[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jufeng li updated YARN-10440: - Description: RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I can open x:8088/cluster/apps/RUNNING but can not x:8088/cluster/scheduler.Those apps submited can not end itself and new apps can not be submited.just everything hangs but not RM,NM server. How can I fix this?help me,please! here is the log: {code:java} ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang clusterResource= type=NODE_LOCAL requestedPartition= 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer appl
[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197472#comment-17197472 ] jufeng li commented on YARN-10440: -- I restart RM then recovered,and the infomation is gone。But I dump the heap last time > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
jufeng li created YARN-10440: Summary: resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal Key: YARN-10440 URL: https://issues.apache.org/jira/browse/YARN-10440 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.1.1 Reporter: jufeng li RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I can open x:8088/cluster/apps/RUNNING but can not x:8088/cluster/scheduler.Those apps submited can not end itself and new apps can not be submited.just everything hangs but not RM,NM server. How can I fix this?help me,please! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org