[jira] [Commented] (YARN-10492) deadlock in rm

2020-11-19 Thread jufeng li (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235828#comment-17235828
 ] 

jufeng li commented on YARN-10492:
--

We are using the same version(HDP-3.1.5.0-152   hadoop3.1),and we got the same 
issue,I solved this issue,do you want patch?

> deadlock in rm 
> ---
>
> Key: YARN-10492
> URL: https://issues.apache.org/jira/browse/YARN-10492
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: brick yang
>Priority: Critical
>  Labels: 3.1.1
>
> version: HDP-3.1.5.0-152   hadoop3.1
> capacity scheduler
> yarn sometimes not change to active
> we found that jstack dump has deadlocked:
> "IPC Server handler 44 on 8030" #316 daemon prio=5 os_prio=0 
> tid=0x7fee8216e800 nid=0x63edc waiting for monitor entry 
> [0x7fee09633000]
>  java.lang.Thread.State: BLOCKED (on object monitor)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:323)
>  - waiting to lock <0x00043e2e19d0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.finishApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:75)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:97)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
>  
>  
>  
>  
>  
>  
> "IPC Server handler 8 on 8030" #280 daemon prio=5 os_prio=0 
> tid=0x7fee83823800 nid=0x63eb8 waiting on condition [0x7fee0ba57000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0003c0d0d6c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1664)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1997)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:676)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.releaseContainers(AbstractYarnScheduler.java:753)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocate(CapacityScheduler.java:1182)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:279)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.SchedulerPlacementProcessor.allocate(SchedulerPlacementProcessor.java:53)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:433)
>  - locked <0x00043e2e19d0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server

[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复

2020-11-04 Thread jufeng li (Jira)

 [ 
https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10483:
-
Description: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。没人解答待会我再来问
  (was: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问)

> yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
> --
>
> Key: YARN-10483
> URL: https://issues.apache.org/jira/browse/YARN-10483
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
> scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
> scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。没人解答待会我再来问



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复

2020-11-04 Thread jufeng li (Jira)

 [ 
https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10483:
-
Description: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问
  (was: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
scheduler内部的锁除了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问)

> yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
> --
>
> Key: YARN-10483
> URL: https://issues.apache.org/jira/browse/YARN-10483
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
> scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
> scheduler内部的锁出了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复

2020-11-04 Thread jufeng li (Jira)

 [ 
https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10483:
-
Description: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
scheduler内部的锁除了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问
  (was: yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
scheduler内部的锁除了问题。jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问)

> yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
> --
>
> Key: YARN-10483
> URL: https://issues.apache.org/jira/browse/YARN-10483
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
> scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
> scheduler内部的锁除了问题。正常状态下和卡住状态下rm的jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10483:
-
Attachment: RM_unnormal_state.stack

> yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
> --
>
> Key: YARN-10483
> URL: https://issues.apache.org/jira/browse/YARN-10483
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
> scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
> scheduler内部的锁除了问题。jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复

2020-11-04 Thread jufeng li (Jira)
jufeng li created YARN-10483:


 Summary: yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
 Key: YARN-10483
 URL: https://issues.apache.org/jira/browse/YARN-10483
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, capacityscheduler, resourcemanager, RM
Affects Versions: 3.1.1
Reporter: jufeng li
 Attachments: RM_normal_state.stack, RM_unnormal_state.stack

yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
scheduler内部的锁除了问题。jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10483) yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10483:
-
Attachment: RM_normal_state.stack

> yarn hang住卡死,任务无法提交,切换RM主节点或重启才能恢复
> --
>
> Key: YARN-10483
> URL: https://issues.apache.org/jira/browse/YARN-10483
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> yarn不定期卡死,新任务无法提交,经排查jstack日志,capacity 
> scheduler有线程在无限等待锁,rm的cpu内存网络磁盘均正常。问题基本可以确定是capacity 
> scheduler内部的锁除了问题。jstack日志已上传,希望有人可以解决一下,此bug比较严重,直接导致生产不可用。如果没人解答待会我再来问



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-11-04 Thread jufeng li (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226463#comment-17226463
 ] 

jufeng li commented on YARN-10440:
--

[~joepep] I set,and it not effective,you can check my jstack log 

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer 

[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-11-04 Thread jufeng li (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226002#comment-17226002
 ] 

jufeng li commented on YARN-10440:
--

This issue happend hous ago.I uploaded the RM jvm stack 

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer applicati

[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10440:
-
Attachment: (was: rm_2020-09-26-2.dump)

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null 

[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10440:
-
Attachment: RM_unnormal_state.stack

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=ti

[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10440:
-
Attachment: RM_normal_state.stack

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tian

[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10482:
-
Description: Capacity Scheduler seems locked,RM cannot submit any new job, 
and change active RM manually return to normal。its a serious bug!I check the 
stack log,and found some info about *ReentrantReadWriteLock。*Can  anyone can 
solve this issue?I uploaded the stack when RM normally and unnormally。RM  hangs 
forever until I restart RM or change the active RM manually!!  (was: Capacity 
Scheduler seems locked,RM cannot submit any new job, and change active RM 
manually return to normal。its a serious bug!I check the stack log,and found 
some info about *ReentrantReadWriteLock。*Can  anyone can solve this issue?I 
uploaded the stack when RM normally and unnormally。)

> Capacity Scheduler seems locked,RM cannot submit any new job,and change 
> active RM  manually return to normal
> 
>
> Key: YARN-10482
> URL: https://issues.apache.org/jira/browse/YARN-10482
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> Capacity Scheduler seems locked,RM cannot submit any new job, and change 
> active RM manually return to normal。its a serious bug!I check the stack 
> log,and found some info about *ReentrantReadWriteLock。*Can  anyone can solve 
> this issue?I uploaded the stack when RM normally and unnormally。RM  hangs 
> forever until I restart RM or change the active RM manually!!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10482:
-
Attachment: (was: RM_normal_state.stack)

> Capacity Scheduler seems locked,RM cannot submit any new job,and change 
> active RM  manually return to normal
> 
>
> Key: YARN-10482
> URL: https://issues.apache.org/jira/browse/YARN-10482
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> Capacity Scheduler seems locked,RM cannot submit any new job, and change 
> active RM manually return to normal。its a serious bug!I check the stack 
> log,and found some info about *ReentrantReadWriteLock。*Can  anyone can solve 
> this issue?I uploaded the stack when RM normally and unnormally。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10482:
-
Attachment: RM_unnormal_state.stack

> Capacity Scheduler seems locked,RM cannot submit any new job,and change 
> active RM  manually return to normal
> 
>
> Key: YARN-10482
> URL: https://issues.apache.org/jira/browse/YARN-10482
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> Capacity Scheduler seems locked,RM cannot submit any new job, and change 
> active RM manually return to normal。its a serious bug!I check the stack 
> log,and found some info about *ReentrantReadWriteLock。*Can  anyone can solve 
> this issue?I uploaded the stack when RM normally and unnormally。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10482:
-
Attachment: RM_normal_state.stack

> Capacity Scheduler seems locked,RM cannot submit any new job,and change 
> active RM  manually return to normal
> 
>
> Key: YARN-10482
> URL: https://issues.apache.org/jira/browse/YARN-10482
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> Capacity Scheduler seems locked,RM cannot submit any new job, and change 
> active RM manually return to normal。its a serious bug!I check the stack 
> log,and found some info about *ReentrantReadWriteLock。*Can  anyone can solve 
> this issue?I uploaded the stack when RM normally and unnormally。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal

2020-11-04 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10482:
-
Attachment: RM_normal_state.stack

> Capacity Scheduler seems locked,RM cannot submit any new job,and change 
> active RM  manually return to normal
> 
>
> Key: YARN-10482
> URL: https://issues.apache.org/jira/browse/YARN-10482
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack
>
>
> Capacity Scheduler seems locked,RM cannot submit any new job, and change 
> active RM manually return to normal。its a serious bug!I check the stack 
> log,and found some info about *ReentrantReadWriteLock。*Can  anyone can solve 
> this issue?I uploaded the stack when RM normally and unnormally。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal

2020-11-04 Thread jufeng li (Jira)
jufeng li created YARN-10482:


 Summary: Capacity Scheduler seems locked,RM cannot submit any new 
job,and change active RM  manually return to normal
 Key: YARN-10482
 URL: https://issues.apache.org/jira/browse/YARN-10482
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, capacityscheduler, resourcemanager, RM
Affects Versions: 3.1.1
Reporter: jufeng li


Capacity Scheduler seems locked,RM cannot submit any new job, and change active 
RM manually return to normal。its a serious bug!I check the stack log,and found 
some info about *ReentrantReadWriteLock。*Can  anyone can solve this issue?I 
uploaded the stack when RM normally and unnormally。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-26 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10440:
-
Attachment: rm_2020-09-26-2.dump

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: rm_2020-09-26-2.dump
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCo

[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-25 Thread jufeng li (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202478#comment-17202478
 ] 

jufeng li commented on YARN-10440:
--

i set 
yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments=100,but
 it happend again last nigth

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=a

[jira] [Reopened] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-25 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li reopened YARN-10440:
--

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  cap

[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10440:
-
Description: 
RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I 
can open  x:8088/cluster/apps/RUNNING but can not 
x:8088/cluster/scheduler.Those apps submited can not end itself and new 
apps can not be submited.just everything hangs but not RM,NM server. How can I 
fix this?help me,please!

 

here is the log:
{code:java}
//代码占位符
ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
clusterResource= type=NODE_LOCAL 
requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContai

[jira] [Updated] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread jufeng li (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jufeng li updated YARN-10440:
-
Description: 
RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I 
can open  x:8088/cluster/apps/RUNNING but can not 
x:8088/cluster/scheduler.Those apps submited can not end itself and new 
apps can not be submited.just everything hangs but not RM,NM server. How can I 
fix this?help me,please!

 

here is the log:
{code:java}
ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
clusterResource= type=NODE_LOCAL 
requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer application attempt=appattempt_1600074574138_66297_01 
container=null queue=tianqiwang clusterResource= 
type=NODE_LOCAL requestedPartition=
2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation proposal
2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
(AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
assignedContainer appl

[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread jufeng li (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197472#comment-17197472
 ] 

jufeng li commented on YARN-10440:
--

I restart RM then recovered,and the infomation is gone。But I dump the heap last 
time

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-17 Thread jufeng li (Jira)
jufeng li created YARN-10440:


 Summary: resource manager hangs,and i cannot submit any new 
jobs,but rm and nm processes are normal
 Key: YARN-10440
 URL: https://issues.apache.org/jira/browse/YARN-10440
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.1.1
Reporter: jufeng li


RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. I 
can open  x:8088/cluster/apps/RUNNING but can not 
x:8088/cluster/scheduler.Those apps submited can not end itself and new 
apps can not be submited.just everything hangs but not RM,NM server. How can I 
fix this?help me,please!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org