[jira] [Commented] (YARN-9673) RMStateStore writeLock make app waste more time
[ https://issues.apache.org/jira/browse/YARN-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522744#comment-17522744 ] chan commented on YARN-9673: [~chaosju] thanks,i had fixed this problem, i changed the store policy property and stored app metaData in zk > RMStateStore writeLock make app waste more time > --- > > Key: YARN-9673 > URL: https://issues.apache.org/jira/browse/YARN-9673 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.3 >Reporter: chan >Priority: Blocker > > We have 1000 nodes in the cluster. Recently I found that when many tasks are > submitted to the resourcemanager, an application takes 5-8 minutes from NEW > to NEW_SAVING state, and an appattempt takes almost the same time from > ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in > RMStateStore#handleStoreEvent, both methods will call this method > Anyone has encountered the same problem? > > protected void handleStoreEvent(RMStateStoreEvent event) { > this.writeLock.lock(); > try { > if (LOG.isDebugEnabled()) > { LOG.debug("Processing event of type " + event.getType()); } > final RMStateStoreState oldState = getRMStateStoreState(); > this.stateMachine.doTransition(event.getType(), event); > if (oldState != getRMStateStoreState()) > { LOG.info("RMStateStore state change from " + oldState + " to " + > getRMStateStoreState()); } > } catch (InvalidStateTransitonException e) > { LOG.error("Can't handle this event at current state", e); } > finally > { this.writeLock.unlock(); } > } -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227231#comment-17227231 ] chan commented on YARN-10440: - [~Jufeng] may be you can close the preempt monitor first,and i think it hangs in CapacityScheduler#allocateContainersToNode#canAllocateMore most possibly,becase this the only one way cause to dead loop > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO
[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226008#comment-17226008 ] chan commented on YARN-10440: - [~Jufeng] whether you set config yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled to false? > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: RM_normal_state.stack, RM_unnormal_state.stack > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator >
[jira] [Commented] (YARN-7651) branch-2 application master (MR) cannot run in 3.1 cluster
[ https://issues.apache.org/jira/browse/YARN-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223502#comment-17223502 ] chan commented on YARN-7651: hey [~sunilg],i think your cluster has multi version jar,you can make all node in same version > branch-2 application master (MR) cannot run in 3.1 cluster > -- > > Key: YARN-7651 > URL: https://issues.apache.org/jira/browse/YARN-7651 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Sunil G >Priority: Blocker > > {noformat} > 2017-12-13 19:21:20,452 WARN [main] org.apache.hadoop.util.NativeCodeLoader: > Unable to load native-hadoop library for your platform... using builtin-java > classes where applicable > 2017-12-13 19:21:20,481 FATAL [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster > java.lang.RuntimeException: Unable to determine current user > at > org.apache.hadoop.conf.Configuration$Resource.getRestrictParserDefault(Configuration.java:253) > at > org.apache.hadoop.conf.Configuration$Resource.(Configuration.java:219) > at > org.apache.hadoop.conf.Configuration$Resource.(Configuration.java:211) > at > org.apache.hadoop.conf.Configuration.addResource(Configuration.java:876) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1571) > Caused by: java.io.IOException: Exception reading > /Users/sunilgovindan/install/hadoop/tmp/nm-local-dir/usercache/sunilgovindan/appcache/application_1513172966925_0001/container_1513172966925_0001_01_01/container_tokens > at > org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:208) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:870) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676) > at > org.apache.hadoop.conf.Configuration$Resource.getRestrictParserDefault(Configuration.java:251) > ... 4 more > Caused by: java.io.IOException: Unknown version 1 in token storage. > at > org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:226) > at > org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:205) > ... 8 more > 2017-12-13 19:21:20,484 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting > with status 1 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223495#comment-17223495 ] chan commented on YARN-10440: - @[~Jufeng] i had ever met this problem and i set the config,hope to help you! {code:java} //代码占位符 yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments 1 yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled false {code} > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > Attachments: rm_2020-09-26-2.dump > > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept
[jira] [Commented] (YARN-9604) RM Shutdown with FATAL Exception
[ https://issues.apache.org/jira/browse/YARN-9604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195240#comment-17195240 ] chan commented on YARN-9604: i think it may be caused by am release the container and am release the container will not get the lock > RM Shutdown with FATAL Exception > > > Key: YARN-9604 > URL: https://issues.apache.org/jira/browse/YARN-9604 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Amithsha >Priority: Critical > > Earlier faced the FATAL Exception and got resolved by adding the following > properties. > > yarn.scheduler.capacity.rack-locality-additional-delay > 1 > > > > yarn.scheduler.capacity.node-locality-delay > 0 > > https://issues.apache.org/jira/browse/YARN-8462 (Patch and describtion) > > Recently facing same FATAL Exception with different stacktrace. > > > > 2019-06-06 08:30:38,424 FATAL event.EventDispatcher (?:?(?)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:814) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:876) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.allocateFromReservedContainer(LeafQueue.java:1002) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1026) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1274) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > 2019-06-06 08:30:38,424 INFO event.EventDispatcher (?:?(?)) - Exiting, bbye.. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-10395: Fix Version/s: (was: 2.9.2) Target Version/s: (was: 2.9.2) Affects Version/s: (was: 2.9.2) > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: chan >Priority: Major > Attachments: Yarn-10395-001.patch > > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory or vcores.so i think we can release this > reserved container when the reserved node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan resolved YARN-10395. - Resolution: Fixed > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.2 >Reporter: chan >Priority: Major > Fix For: 2.9.2 > > Attachments: Yarn-10395-001.patch > > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory or vcores.so i think we can release this > reserved container when the reserved node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-10395: Attachment: Yarn-10395-001.patch > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.2 >Reporter: chan >Priority: Major > Attachments: Yarn-10395-001.patch > > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory or vcores.so i think we can release this > reserved container when the reserved node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-10395: Fix Version/s: 2.9.2 > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.2 >Reporter: chan >Priority: Major > Fix For: 2.9.2 > > Attachments: Yarn-10395-001.patch > > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory or vcores.so i think we can release this > reserved container when the reserved node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181070#comment-17181070 ] chan commented on YARN-10395: - [~yehuanhuan] yeah,but in capacityScheduler,the node will be reserved for the app if the app still is not satisfied for a long time, so i release the reserved container if RegularContainerAllocator#checkIfNodeBlackListed do not return null > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.2 >Reporter: chan >Priority: Major > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory or vcores.so i think we can release this > reserved container when the reserved node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-10395: Summary: ReservedContainer Node is added to blackList of application due to this node can not allocate other container (was: ReservedContainer Node is added to blackList of application due to node due to node can not allocate) > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.2 >Reporter: chan >Priority: Major > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory > or vcores.so i think we can release this reserved container when the reserved > node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10395) ReservedContainer Node is added to blackList of application due to this node can not allocate other container
[ https://issues.apache.org/jira/browse/YARN-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-10395: Description: Now,if a app reserved a node,but the node is added to app`s blacklist. when this node send heartbeat to resourcemanager,the reserved container allocate fails,it will make this node can not allocate other container even thought this node have enough memory or vcores.so i think we can release this reserved container when the reserved node is in the black list of this app. was: Now,if a app reserved a node,but the node is added to app`s blacklist. when this node send heartbeat to resourcemanager,the reserved container allocate fails,it will make this node can not allocate other container even thought this node have enough memory or vcores.so i think we can release this reserved container when the reserved node is in the black list of this app. > ReservedContainer Node is added to blackList of application due to this node > can not allocate other container > - > > Key: YARN-10395 > URL: https://issues.apache.org/jira/browse/YARN-10395 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.2 >Reporter: chan >Priority: Major > > Now,if a app reserved a node,but the node is added to app`s blacklist. > when this node send heartbeat to resourcemanager,the reserved container > allocate fails,it will make this node can not allocate other container even > thought this node have enough memory or vcores.so i think we can release this > reserved container when the reserved node is in the black list of this app. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10395) ReservedContainer Node is added to blackList of application due to node due to node can not allocate
chan created YARN-10395: --- Summary: ReservedContainer Node is added to blackList of application due to node due to node can not allocate Key: YARN-10395 URL: https://issues.apache.org/jira/browse/YARN-10395 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.9.2 Reporter: chan Now,if a app reserved a node,but the node is added to app`s blacklist. when this node send heartbeat to resourcemanager,the reserved container allocate fails,it will make this node can not allocate other container even thought this node have enough memory or vcores.so i think we can release this reserved container when the reserved node is in the black list of this app. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10394) RACK/NODE_LOCAL Request have same nodelabel as ANY Request
chan created YARN-10394: --- Summary: RACK/NODE_LOCAL Request have same nodelabel as ANY Request Key: YARN-10394 URL: https://issues.apache.org/jira/browse/YARN-10394 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.9.2 Environment: {code:java} //代码占位符 private void updateNodeLabels(ResourceRequest request) { String resourceName = request.getResourceName(); if (resourceName.equals(ResourceRequest.ANY)) { ResourceRequest previousAnyRequest = getResourceRequest(resourceName); // When there is change in ANY request label expression, we should // update label for all resource requests already added of same // priority as ANY resource request. if ((null == previousAnyRequest) || hasRequestLabelChanged( previousAnyRequest, request)) { for (ResourceRequest r : resourceRequestMap.values()) { if (!r.getResourceName().equals(ResourceRequest.ANY)) { r.setNodeLabelExpression(request.getNodeLabelExpression()); } } } } else{ // if resource Name is not ANY its nodeLabel will be same as ANY Request ResourceRequest anyRequest = getResourceRequest(ResourceRequest.ANY); if (anyRequest != null) { request.setNodeLabelExpression(anyRequest.getNodeLabelExpression()); } } } {code} Reporter: chan LocalitySchedulingPlacementSet.updateNodeLabels make RACK/NODE_LOCAL Request have same nodelabel as ANY Request instead of -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9673) RMStateStore writeLock make app waste more time
[ https://issues.apache.org/jira/browse/YARN-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-9673: --- Description: We have 1000 nodes in the cluster. Recently I found that when many tasks are submitted to the resourcemanager, an application takes 5-8 minutes from NEW to NEW_SAVING state, and an appattempt takes almost the same time from ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in RMStateStore#handleStoreEvent, both methods will call this method Anyone has encountered the same problem? protected void handleStoreEvent(RMStateStoreEvent event) { this.writeLock.lock(); try { if (LOG.isDebugEnabled()) { LOG.debug("Processing event of type " + event.getType()); } final RMStateStoreState oldState = getRMStateStoreState(); this.stateMachine.doTransition(event.getType(), event); if (oldState != getRMStateStoreState()) { LOG.info("RMStateStore state change from " + oldState + " to " + getRMStateStoreState()); } } catch (InvalidStateTransitonException e) { LOG.error("Can't handle this event at current state", e); } finally { this.writeLock.unlock(); } } was: We have 1000 nodes in the cluster. Recently I found that when many tasks are submitted to the resourcemanager, an application takes 5-8 minutes from NEW to NEW_SAVING state, and an appattempt takes almost the same time from ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in RMStateStore#handleStoreEvent, both methods will call this method, and this method is locked. I want to ask why there use writeLock to lock it. Anyone has encountered the same problem? protected void handleStoreEvent(RMStateStoreEvent event) { this.writeLock.lock(); try { if (LOG.isDebugEnabled()) { LOG.debug("Processing event of type " + event.getType()); } final RMStateStoreState oldState = getRMStateStoreState(); this.stateMachine.doTransition(event.getType(), event); if (oldState != getRMStateStoreState()) { LOG.info("RMStateStore state change from " + oldState + " to " + getRMStateStoreState()); } } catch (InvalidStateTransitonException e) { LOG.error("Can't handle this event at current state", e); } finally { this.writeLock.unlock(); } } > RMStateStore writeLock make app waste more time > --- > > Key: YARN-9673 > URL: https://issues.apache.org/jira/browse/YARN-9673 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.3 >Reporter: chan >Priority: Blocker > > We have 1000 nodes in the cluster. Recently I found that when many tasks are > submitted to the resourcemanager, an application takes 5-8 minutes from NEW > to NEW_SAVING state, and an appattempt takes almost the same time from > ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in > RMStateStore#handleStoreEvent, both methods will call this method > Anyone has encountered the same problem? > > protected void handleStoreEvent(RMStateStoreEvent event) { > this.writeLock.lock(); > try { > if (LOG.isDebugEnabled()) > { LOG.debug("Processing event of type " + event.getType()); } > final RMStateStoreState oldState = getRMStateStoreState(); > this.stateMachine.doTransition(event.getType(), event); > if (oldState != getRMStateStoreState()) > { LOG.info("RMStateStore state change from " + oldState + " to " + > getRMStateStoreState()); } > } catch (InvalidStateTransitonException e) > { LOG.error("Can't handle this event at current state", e); } > finally > { this.writeLock.unlock(); } > } -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9673) RMStateStore writeLock make app waste more time
[ https://issues.apache.org/jira/browse/YARN-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-9673: --- Environment: (was: protected void handleStoreEvent(RMStateStoreEvent event) { this.writeLock.lock(); try { if (LOG.isDebugEnabled()) { LOG.debug("Processing event of type " + event.getType()); } final RMStateStoreState oldState = getRMStateStoreState(); this.stateMachine.doTransition(event.getType(), event); if (oldState != getRMStateStoreState()) { LOG.info("RMStateStore state change from " + oldState + " to " + getRMStateStoreState()); } } catch (InvalidStateTransitonException e) { LOG.error("Can't handle this event at current state", e); } finally { this.writeLock.unlock(); } }) > RMStateStore writeLock make app waste more time > --- > > Key: YARN-9673 > URL: https://issues.apache.org/jira/browse/YARN-9673 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.3 >Reporter: chan >Priority: Blocker > > We have 1000 nodes in the cluster. Recently I found that when many tasks are > submitted to the resourcemanager, an application takes 5-8 minutes from NEW > to NEW_SAVING state, and an appattempt takes almost the same time from > ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in > RMStateStore#handleStoreEvent, both methods will call this method, and this > method is locked. I want to ask why there use writeLock to lock it. > Anyone has encountered the same problem? > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9673) RMStateStore writeLock make app waste more time
[ https://issues.apache.org/jira/browse/YARN-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chan updated YARN-9673: --- Description: We have 1000 nodes in the cluster. Recently I found that when many tasks are submitted to the resourcemanager, an application takes 5-8 minutes from NEW to NEW_SAVING state, and an appattempt takes almost the same time from ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in RMStateStore#handleStoreEvent, both methods will call this method, and this method is locked. I want to ask why there use writeLock to lock it. Anyone has encountered the same problem? protected void handleStoreEvent(RMStateStoreEvent event) { this.writeLock.lock(); try { if (LOG.isDebugEnabled()) { LOG.debug("Processing event of type " + event.getType()); } final RMStateStoreState oldState = getRMStateStoreState(); this.stateMachine.doTransition(event.getType(), event); if (oldState != getRMStateStoreState()) { LOG.info("RMStateStore state change from " + oldState + " to " + getRMStateStoreState()); } } catch (InvalidStateTransitonException e) { LOG.error("Can't handle this event at current state", e); } finally { this.writeLock.unlock(); } } was: We have 1000 nodes in the cluster. Recently I found that when many tasks are submitted to the resourcemanager, an application takes 5-8 minutes from NEW to NEW_SAVING state, and an appattempt takes almost the same time from ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in RMStateStore#handleStoreEvent, both methods will call this method, and this method is locked. I want to ask why there use writeLock to lock it. Anyone has encountered the same problem? > RMStateStore writeLock make app waste more time > --- > > Key: YARN-9673 > URL: https://issues.apache.org/jira/browse/YARN-9673 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.3 >Reporter: chan >Priority: Blocker > > We have 1000 nodes in the cluster. Recently I found that when many tasks are > submitted to the resourcemanager, an application takes 5-8 minutes from NEW > to NEW_SAVING state, and an appattempt takes almost the same time from > ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in > RMStateStore#handleStoreEvent, both methods will call this method, and this > method is locked. I want to ask why there use writeLock to lock it. > Anyone has encountered the same problem? > > protected void handleStoreEvent(RMStateStoreEvent event) { > this.writeLock.lock(); > try { > if (LOG.isDebugEnabled()) > { LOG.debug("Processing event of type " + event.getType()); } > final RMStateStoreState oldState = getRMStateStoreState(); > this.stateMachine.doTransition(event.getType(), event); > if (oldState != getRMStateStoreState()) > { LOG.info("RMStateStore state change from " + oldState + " to " + > getRMStateStoreState()); } > } catch (InvalidStateTransitonException e) > { LOG.error("Can't handle this event at current state", e); } > finally > { this.writeLock.unlock(); } > } -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9673) RMStateStore writeLock make app waste more time
chan created YARN-9673: -- Summary: RMStateStore writeLock make app waste more time Key: YARN-9673 URL: https://issues.apache.org/jira/browse/YARN-9673 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.3 Environment: protected void handleStoreEvent(RMStateStoreEvent event) { this.writeLock.lock(); try { if (LOG.isDebugEnabled()) { LOG.debug("Processing event of type " + event.getType()); } final RMStateStoreState oldState = getRMStateStoreState(); this.stateMachine.doTransition(event.getType(), event); if (oldState != getRMStateStoreState()) { LOG.info("RMStateStore state change from " + oldState + " to " + getRMStateStoreState()); } } catch (InvalidStateTransitonException e) { LOG.error("Can't handle this event at current state", e); } finally { this.writeLock.unlock(); } } Reporter: chan We have 1000 nodes in the cluster. Recently I found that when many tasks are submitted to the resourcemanager, an application takes 5-8 minutes from NEW to NEW_SAVING state, and an appattempt takes almost the same time from ALLOCATED_SAVING to ALLOCATED. I think the problem occurs in RMStateStore#handleStoreEvent, both methods will call this method, and this method is locked. I want to ask why there use writeLock to lock it. Anyone has encountered the same problem? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org