[jira] [Commented] (YARN-9673) RMStateStore writeLock make app waste more time

2022-04-15 Thread chan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522744#comment-17522744
 ] 

chan commented on YARN-9673:


[~chaosju] thanks,i had fixed this problem, i changed the store policy property 
and stored app metaData in zk 

> RMStateStore writeLock make app waste more time
> ---
>
> Key: YARN-9673
> URL: https://issues.apache.org/jira/browse/YARN-9673
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: chan
>Priority: Blocker
>
> We have 1000 nodes in the cluster. Recently I found that when many tasks are 
> submitted to the resourcemanager, an application takes 5-8 minutes from NEW 
> to NEW_SAVING state, and an appattempt takes almost the same time from 
> ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
> RMStateStore#handleStoreEvent, both methods will call this method
> Anyone has encountered the same problem?
>  
> protected void handleStoreEvent(RMStateStoreEvent event) {
>  this.writeLock.lock();
>  try {
> if (LOG.isDebugEnabled())
> { LOG.debug("Processing event of type " + event.getType()); }
> final RMStateStoreState oldState = getRMStateStoreState();
> this.stateMachine.doTransition(event.getType(), event);
> if (oldState != getRMStateStoreState())
> { LOG.info("RMStateStore state change from " + oldState + " to " + 
> getRMStateStoreState()); }
> } catch (InvalidStateTransitonException e)
> { LOG.error("Can't handle this event at current state", e); }
> finally
> { this.writeLock.unlock(); }
> }



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-11-06 Thread chan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227231#comment-17227231
 ] 

chan commented on YARN-10440:
-

[~Jufeng] may be you can close the preempt monitor first,and i think it hangs 
in CapacityScheduler#allocateContainersToNode#canAllocateMore  most 
possibly,becase this the only one way cause to dead loop

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  

[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-11-04 Thread chan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226008#comment-17226008
 ] 

chan commented on YARN-10440:
-

[~Jufeng] whether you set config 
yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled to 
false?

 

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> 

[jira] [Commented] (YARN-7651) branch-2 application master (MR) cannot run in 3.1 cluster

2020-10-30 Thread chan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223502#comment-17223502
 ] 

chan commented on YARN-7651:


hey [~sunilg],i think your cluster has multi version jar,you can make all node 
in same version

> branch-2 application master (MR) cannot run in 3.1 cluster
> --
>
> Key: YARN-7651
> URL: https://issues.apache.org/jira/browse/YARN-7651
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Sunil G
>Priority: Blocker
>
> {noformat}
> 2017-12-13 19:21:20,452 WARN [main] org.apache.hadoop.util.NativeCodeLoader: 
> Unable to load native-hadoop library for your platform... using builtin-java 
> classes where applicable
> 2017-12-13 19:21:20,481 FATAL [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> java.lang.RuntimeException: Unable to determine current user
> at 
> org.apache.hadoop.conf.Configuration$Resource.getRestrictParserDefault(Configuration.java:253)
> at 
> org.apache.hadoop.conf.Configuration$Resource.(Configuration.java:219)
> at 
> org.apache.hadoop.conf.Configuration$Resource.(Configuration.java:211)
> at 
> org.apache.hadoop.conf.Configuration.addResource(Configuration.java:876)
> at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1571)
> Caused by: java.io.IOException: Exception reading 
> /Users/sunilgovindan/install/hadoop/tmp/nm-local-dir/usercache/sunilgovindan/appcache/application_1513172966925_0001/container_1513172966925_0001_01_01/container_tokens
> at 
> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:208)
> at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:870)
> at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
> at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
> at 
> org.apache.hadoop.conf.Configuration$Resource.getRestrictParserDefault(Configuration.java:251)
> ... 4 more
> Caused by: java.io.IOException: Unknown version 1 in token storage.
> at 
> org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:226)
> at 
> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:205)
> ... 8 more
> 2017-12-13 19:21:20,484 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting 
> with status 1
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-10-30 Thread chan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223495#comment-17223495
 ] 

chan commented on YARN-10440:
-

@[~Jufeng]  i had ever met this problem and i set the config,hope to help you!
{code:java}
//代码占位符
  
 
yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments
1
 
  
 
yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled
false
 
{code}
 

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: rm_2020-09-26-2.dump
>
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept 

[jira] [Commented] (YARN-9604) RM Shutdown with FATAL Exception

2020-09-14 Thread chan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195240#comment-17195240
 ] 

chan commented on YARN-9604:


i think it may be caused by am release the container and am release the 
container will not get the lock

> RM Shutdown with FATAL Exception
> 
>
> Key: YARN-9604
> URL: https://issues.apache.org/jira/browse/YARN-9604
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Critical
>
> Earlier faced the FATAL Exception and got resolved by adding the following 
> properties.
>   
>     yarn.scheduler.capacity.rack-locality-additional-delay
>     1
>   
>  
>   
>     yarn.scheduler.capacity.node-locality-delay
>     0
>   
> https://issues.apache.org/jira/browse/YARN-8462 (Patch and describtion)
>  
> Recently facing same FATAL Exception with different stacktrace.
>  
>  
>  
> 2019-06-06 08:30:38,424 FATAL event.EventDispatcher (?:?(?)) - Error in 
> handling event type NODE_UPDATE to the Event Dispatcher
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:814)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:876)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.allocateFromReservedContainer(LeafQueue.java:1002)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1026)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1274)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151)
>  at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-06-06 08:30:38,424 INFO event.EventDispatcher (?:?(?)) - Exiting, bbye..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-09-14 Thread chan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-10395:

Fix Version/s: (was: 2.9.2)
 Target Version/s:   (was: 2.9.2)
Affects Version/s: (was: 2.9.2)

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: chan
>Priority: Major
> Attachments: Yarn-10395-001.patch
>
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-08-25 Thread chan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan resolved YARN-10395.
-
Resolution: Fixed

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.2
>Reporter: chan
>Priority: Major
> Fix For: 2.9.2
>
> Attachments: Yarn-10395-001.patch
>
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-08-20 Thread chan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-10395:

Attachment: Yarn-10395-001.patch

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.2
>Reporter: chan
>Priority: Major
> Attachments: Yarn-10395-001.patch
>
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-08-20 Thread chan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-10395:

Fix Version/s: 2.9.2

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.2
>Reporter: chan
>Priority: Major
> Fix For: 2.9.2
>
> Attachments: Yarn-10395-001.patch
>
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-08-20 Thread chan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181070#comment-17181070
 ] 

chan commented on YARN-10395:
-

[~yehuanhuan] yeah,but in capacityScheduler,the node will be reserved for the 
app if the app still is not satisfied for a long time, so i release the 
reserved container if RegularContainerAllocator#checkIfNodeBlackListed do not 
return null

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.2
>Reporter: chan
>Priority: Major
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-08-12 Thread chan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-10395:

Summary: ReservedContainer Node is added to blackList of application due to 
this node can not allocate other container  (was: ReservedContainer Node is 
added to blackList of application due to node due to node can not allocate)

> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.2
>Reporter: chan
>Priority: Major
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory
> or vcores.so i think we can release this reserved container when the reserved 
> node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container

2020-08-12 Thread chan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-10395:

Description: 
Now,if a app reserved a node,but the node is added to app`s blacklist.

when this node send  heartbeat to resourcemanager,the reserved container 
allocate fails,it will make this node can not allocate other container even 
thought this node have enough memory or vcores.so i think we can release this 
reserved container when the reserved node is in the black list of this app.

 

 

  was:
Now,if a app reserved a node,but the node is added to app`s blacklist.

when this node send  heartbeat to resourcemanager,the reserved container 
allocate fails,it will make this node can not allocate other container even 
thought this node have enough memory

or vcores.so i think we can release this reserved container when the reserved 
node is in the black list of this app.

 

 


> ReservedContainer Node is added to blackList of application due to this node 
> can not allocate other container
> -
>
> Key: YARN-10395
> URL: https://issues.apache.org/jira/browse/YARN-10395
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.9.2
>Reporter: chan
>Priority: Major
>
> Now,if a app reserved a node,but the node is added to app`s blacklist.
> when this node send  heartbeat to resourcemanager,the reserved container 
> allocate fails,it will make this node can not allocate other container even 
> thought this node have enough memory or vcores.so i think we can release this 
> reserved container when the reserved node is in the black list of this app.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10395) ReservedContainer Node is added to blackList of application due to node due to node can not allocate

2020-08-12 Thread chan (Jira)
chan created YARN-10395:
---

 Summary: ReservedContainer Node is added to blackList of 
application due to node due to node can not allocate
 Key: YARN-10395
 URL: https://issues.apache.org/jira/browse/YARN-10395
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.9.2
Reporter: chan


Now,if a app reserved a node,but the node is added to app`s blacklist.

when this node send  heartbeat to resourcemanager,the reserved container 
allocate fails,it will make this node can not allocate other container even 
thought this node have enough memory

or vcores.so i think we can release this reserved container when the reserved 
node is in the black list of this app.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10394) RACK/NODE_LOCAL Request have same nodelabel as ANY Request

2020-08-11 Thread chan (Jira)
chan created YARN-10394:
---

 Summary: RACK/NODE_LOCAL Request have same nodelabel as ANY Request
 Key: YARN-10394
 URL: https://issues.apache.org/jira/browse/YARN-10394
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.9.2
 Environment: {code:java}
//代码占位符

 private void updateNodeLabels(ResourceRequest request) {
String resourceName = request.getResourceName();
if (resourceName.equals(ResourceRequest.ANY)) {
  ResourceRequest previousAnyRequest =
  getResourceRequest(resourceName);  // When there is change in ANY 
request label expression, we should
  // update label for all resource requests already added of same
  // priority as ANY resource request.
  if ((null == previousAnyRequest) || hasRequestLabelChanged(
  previousAnyRequest, request)) {
for (ResourceRequest r : resourceRequestMap.values()) {
  if (!r.getResourceName().equals(ResourceRequest.ANY)) {
r.setNodeLabelExpression(request.getNodeLabelExpression());
  }
}
  }
} else{

  // if resource Name is not ANY its nodeLabel will be same as ANY Request
  ResourceRequest anyRequest = getResourceRequest(ResourceRequest.ANY);
  if (anyRequest != null) {
request.setNodeLabelExpression(anyRequest.getNodeLabelExpression());
  }
}
  }





{code}
 

 
Reporter: chan


LocalitySchedulingPlacementSet.updateNodeLabels make RACK/NODE_LOCAL Request 
have same nodelabel as ANY Request instead of 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9673) RMStateStore writeLock make app waste more time

2019-07-14 Thread chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-9673:
---
Description: 
We have 1000 nodes in the cluster. Recently I found that when many tasks are 
submitted to the resourcemanager, an application takes 5-8 minutes from NEW to 
NEW_SAVING state, and an appattempt takes almost the same time from 
ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
RMStateStore#handleStoreEvent, both methods will call this method

Anyone has encountered the same problem?

 

protected void handleStoreEvent(RMStateStoreEvent event) {
 this.writeLock.lock();
 try {

if (LOG.isDebugEnabled())

{ LOG.debug("Processing event of type " + event.getType()); }

final RMStateStoreState oldState = getRMStateStoreState();

this.stateMachine.doTransition(event.getType(), event);

if (oldState != getRMStateStoreState())

{ LOG.info("RMStateStore state change from " + oldState + " to " + 
getRMStateStoreState()); }

} catch (InvalidStateTransitonException e)

{ LOG.error("Can't handle this event at current state", e); }

finally

{ this.writeLock.unlock(); }

}

  was:
We have 1000 nodes in the cluster. Recently I found that when many tasks are 
submitted to the resourcemanager, an application takes 5-8 minutes from NEW to 
NEW_SAVING state, and an appattempt takes almost the same time from 
ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
RMStateStore#handleStoreEvent, both methods will call this method, and this 
method is locked. I want to ask why there use writeLock to lock it.

Anyone has encountered the same problem?

 

protected void handleStoreEvent(RMStateStoreEvent event) {
this.writeLock.lock();
try {

if (LOG.isDebugEnabled())

{ LOG.debug("Processing event of type " + event.getType()); }

final RMStateStoreState oldState = getRMStateStoreState();

this.stateMachine.doTransition(event.getType(), event);

if (oldState != getRMStateStoreState())

{ LOG.info("RMStateStore state change from " + oldState + " to " + 
getRMStateStoreState()); }

} catch (InvalidStateTransitonException e)

{ LOG.error("Can't handle this event at current state", e); }

finally

{ this.writeLock.unlock(); }

}


> RMStateStore writeLock make app waste more time
> ---
>
> Key: YARN-9673
> URL: https://issues.apache.org/jira/browse/YARN-9673
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: chan
>Priority: Blocker
>
> We have 1000 nodes in the cluster. Recently I found that when many tasks are 
> submitted to the resourcemanager, an application takes 5-8 minutes from NEW 
> to NEW_SAVING state, and an appattempt takes almost the same time from 
> ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
> RMStateStore#handleStoreEvent, both methods will call this method
> Anyone has encountered the same problem?
>  
> protected void handleStoreEvent(RMStateStoreEvent event) {
>  this.writeLock.lock();
>  try {
> if (LOG.isDebugEnabled())
> { LOG.debug("Processing event of type " + event.getType()); }
> final RMStateStoreState oldState = getRMStateStoreState();
> this.stateMachine.doTransition(event.getType(), event);
> if (oldState != getRMStateStoreState())
> { LOG.info("RMStateStore state change from " + oldState + " to " + 
> getRMStateStoreState()); }
> } catch (InvalidStateTransitonException e)
> { LOG.error("Can't handle this event at current state", e); }
> finally
> { this.writeLock.unlock(); }
> }



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9673) RMStateStore writeLock make app waste more time

2019-07-14 Thread chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-9673:
---
Environment: (was: protected void handleStoreEvent(RMStateStoreEvent 
event) {
 this.writeLock.lock();
 try {

 if (LOG.isDebugEnabled()) {
 LOG.debug("Processing event of type " + event.getType());
 }

 final RMStateStoreState oldState = getRMStateStoreState();

 this.stateMachine.doTransition(event.getType(), event);

 if (oldState != getRMStateStoreState()) {
 LOG.info("RMStateStore state change from " + oldState + " to "
 + getRMStateStoreState());
 }

 } catch (InvalidStateTransitonException e) {
 LOG.error("Can't handle this event at current state", e);
 } finally {
 this.writeLock.unlock();
 }
})

> RMStateStore writeLock make app waste more time
> ---
>
> Key: YARN-9673
> URL: https://issues.apache.org/jira/browse/YARN-9673
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: chan
>Priority: Blocker
>
> We have 1000 nodes in the cluster. Recently I found that when many tasks are 
> submitted to the resourcemanager, an application takes 5-8 minutes from NEW 
> to NEW_SAVING state, and an appattempt takes almost the same time from 
> ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
> RMStateStore#handleStoreEvent, both methods will call this method, and this 
> method is locked. I want to ask why there use writeLock to lock it.
> Anyone has encountered the same problem?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9673) RMStateStore writeLock make app waste more time

2019-07-14 Thread chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chan updated YARN-9673:
---
Description: 
We have 1000 nodes in the cluster. Recently I found that when many tasks are 
submitted to the resourcemanager, an application takes 5-8 minutes from NEW to 
NEW_SAVING state, and an appattempt takes almost the same time from 
ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
RMStateStore#handleStoreEvent, both methods will call this method, and this 
method is locked. I want to ask why there use writeLock to lock it.

Anyone has encountered the same problem?

 

protected void handleStoreEvent(RMStateStoreEvent event) {
this.writeLock.lock();
try {

if (LOG.isDebugEnabled())

{ LOG.debug("Processing event of type " + event.getType()); }

final RMStateStoreState oldState = getRMStateStoreState();

this.stateMachine.doTransition(event.getType(), event);

if (oldState != getRMStateStoreState())

{ LOG.info("RMStateStore state change from " + oldState + " to " + 
getRMStateStoreState()); }

} catch (InvalidStateTransitonException e)

{ LOG.error("Can't handle this event at current state", e); }

finally

{ this.writeLock.unlock(); }

}

  was:
We have 1000 nodes in the cluster. Recently I found that when many tasks are 
submitted to the resourcemanager, an application takes 5-8 minutes from NEW to 
NEW_SAVING state, and an appattempt takes almost the same time from 
ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
RMStateStore#handleStoreEvent, both methods will call this method, and this 
method is locked. I want to ask why there use writeLock to lock it.

Anyone has encountered the same problem?

 


> RMStateStore writeLock make app waste more time
> ---
>
> Key: YARN-9673
> URL: https://issues.apache.org/jira/browse/YARN-9673
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: chan
>Priority: Blocker
>
> We have 1000 nodes in the cluster. Recently I found that when many tasks are 
> submitted to the resourcemanager, an application takes 5-8 minutes from NEW 
> to NEW_SAVING state, and an appattempt takes almost the same time from 
> ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
> RMStateStore#handleStoreEvent, both methods will call this method, and this 
> method is locked. I want to ask why there use writeLock to lock it.
> Anyone has encountered the same problem?
>  
> protected void handleStoreEvent(RMStateStoreEvent event) {
> this.writeLock.lock();
> try {
> if (LOG.isDebugEnabled())
> { LOG.debug("Processing event of type " + event.getType()); }
> final RMStateStoreState oldState = getRMStateStoreState();
> this.stateMachine.doTransition(event.getType(), event);
> if (oldState != getRMStateStoreState())
> { LOG.info("RMStateStore state change from " + oldState + " to " + 
> getRMStateStoreState()); }
> } catch (InvalidStateTransitonException e)
> { LOG.error("Can't handle this event at current state", e); }
> finally
> { this.writeLock.unlock(); }
> }



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9673) RMStateStore writeLock make app waste more time

2019-07-14 Thread chan (JIRA)
chan created YARN-9673:
--

 Summary: RMStateStore writeLock make app waste more time
 Key: YARN-9673
 URL: https://issues.apache.org/jira/browse/YARN-9673
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.3
 Environment: protected void handleStoreEvent(RMStateStoreEvent event) {
 this.writeLock.lock();
 try {

 if (LOG.isDebugEnabled()) {
 LOG.debug("Processing event of type " + event.getType());
 }

 final RMStateStoreState oldState = getRMStateStoreState();

 this.stateMachine.doTransition(event.getType(), event);

 if (oldState != getRMStateStoreState()) {
 LOG.info("RMStateStore state change from " + oldState + " to "
 + getRMStateStoreState());
 }

 } catch (InvalidStateTransitonException e) {
 LOG.error("Can't handle this event at current state", e);
 } finally {
 this.writeLock.unlock();
 }
}
Reporter: chan


We have 1000 nodes in the cluster. Recently I found that when many tasks are 
submitted to the resourcemanager, an application takes 5-8 minutes from NEW to 
NEW_SAVING state, and an appattempt takes almost the same time from 
ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in 
RMStateStore#handleStoreEvent, both methods will call this method, and this 
method is locked. I want to ask why there use writeLock to lock it.

Anyone has encountered the same problem?

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org