[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969197#comment-15969197 ] zhihai xu edited comment on YARN-6396 at 4/14/17 4:02 PM: -- Thanks for the review [~jianhe] and [~rkanter]! if some one deletes the remote log dir, all the old log will disappear. That will be a more serious issue, recreating the remote log dir won't save the old log data. This looks like a monitor problem, I think it will be better to do it in some tool outside the NM. It will be more efficient to do it at one place instead of on each NM, which could be many thousands in a large cluster. Yes, it's a trade off between validation and efficiency. Also restarting the NM will help recreate the remote log dir. was (Author: zxu): Thanks for the review [~jianhe] and [~rkanter]! if some one deletes the remote log dir, all the old log will disappear. That will be a more serious issue, recreating the remote log dir won't save the old log data. This looks like a monitor problem, I think it will be better to do it in some tool outside the NM. It will be more efficient to do it at one place instead of on each NM, which could be many thousands in a large cluster. Yes, it's a trade off between validation and efficiency. > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969197#comment-15969197 ] zhihai xu commented on YARN-6396: - Thanks for the review [~jianhe] and [~rkanter]! if some one deletes the remote log dir, all the old log will disappear. That will be a more serious issue, recreating the remote log dir won't save the old log data. This looks like a monitor problem, I think it will be better to do it in some tool outside the NM. It will be more efficient to do it at one place instead of on each NM, which could be many thousands in a large cluster. Yes, it's a trade off between validation and efficiency. > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961891#comment-15961891 ] zhihai xu edited comment on YARN-6396 at 4/13/17 7:47 PM: -- Thanks for the review [~haibochen], [~jianhe], [~rkanter], [~xgong] Could you also help review the patch? thanks was (Author: zxu): Thanks for the review [~haibochen], [~rkanter], [~xgong] Could you also help review the patch? thanks > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961891#comment-15961891 ] zhihai xu edited comment on YARN-6396 at 4/12/17 11:48 PM: --- Thanks for the review [~haibochen], [~rkanter], [~xgong] Could you also help review the patch? thanks was (Author: zxu): Thanks for the review [~haibochen], [~rkanter][~xgong] Could you also help review the patch? thanks > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961891#comment-15961891 ] zhihai xu edited comment on YARN-6396 at 4/12/17 11:48 PM: --- Thanks for the review [~haibochen], [~rkanter][~xgong] Could you also help review the patch? thanks was (Author: zxu): Thanks for the review [~haibochen], [~xgong] Could you also help review the patch? thanks > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961891#comment-15961891 ] zhihai xu edited comment on YARN-6396 at 4/8/17 8:29 PM: - Thanks for the review [~haibochen], [~xgong] Could you also help review the patch? thanks was (Author: zxu): [~xgong] Could you help review the patch? thanks > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3001) RM dies because of divide by zero
[ https://issues.apache.org/jira/browse/YARN-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3001: Attachment: YARN-3001.barnch-2.7.patch > RM dies because of divide by zero > - > > Key: YARN-3001 > URL: https://issues.apache.org/jira/browse/YARN-3001 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.5.1 >Reporter: hoelog >Assignee: Rohith Sharma K S > Attachments: YARN-3001.barnch-2.7.patch > > > RM dies because of divide by zero exception. > {code} > 2014-12-31 21:27:05,022 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.ArithmeticException: / by zero > at > org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator.computeAvailableContainers(DefaultResourceCalculator.java:37) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1332) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1218) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:877) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:656) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:570) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:851) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:900) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599) > at java.lang.Thread.run(Thread.java:745) > 2014-12-31 21:27:05,023 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3001) RM dies because of divide by zero
[ https://issues.apache.org/jira/browse/YARN-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961908#comment-15961908 ] zhihai xu commented on YARN-3001: - We also see this issue at CDH5.7.2 which is based on hadoop 2.6 release + patches from hadoop 2.7 release, I studied most of the code paths I found two potential corner cases which may cause this issue: 1. maximum allocation can be changed based on node added and removed and total resource changed on the node. if maximum allocation is changed to 0 transiently, this issue may happen. since the following code at CapacityScheduler.allocate will change ResourceRequest in ask to 0 if getMaximumResourceCapability is 0. {code} SchedulerUtils.normalizeRequests( ask, getResourceCalculator(), getClusterResource(), getMinimumResourceCapability(), getMaximumResourceCapability()); {code} 2. capability from resource request in application returned without cloning in LeafQueue.assignContainer and AppSchedulingInfo.cloneResourceRequest and AppSchedulingInfo.getResource, Potentially the capability in resource request returned can be changed outside. I implemented a patch which fixed the first potential corner case based on branch-2.7. We already deployed this patch for more than one month, so far we didn't see this issue happen with the attached patch. The stack trace for the exception is {code} 2017-02-09 15:36:43,062 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.ArithmeticException: / by zero at org.apache.hadoop.yarn.util.resource.DominantResourceCalculator.computeAvailableContainers(DominantResourceCalculator.java:115) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1536) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1392) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1271) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersInternal(LeafQueue.java:830) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:734) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:691) at java.lang.Thread.run(Thread.java:745) {code} > RM dies because of divide by zero > - > > Key: YARN-3001 > URL: https://issues.apache.org/jira/browse/YARN-3001 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.5.1 >Reporter: hoelog >Assignee: Rohith Sharma K S > > RM dies because of divide by zero exception. > {code} > 2014-12-31 21:27:05,022 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type NODE_UPDATE to the scheduler > java.lang.ArithmeticException: / by zero > at > org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator.computeAvailableContainers(DefaultResourceCalculator.java:37) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1332) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1218) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:877) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersT
[jira] [Commented] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961891#comment-15961891 ] zhihai xu commented on YARN-6396: - [~xgong] Could you help review the patch? thanks > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
[ https://issues.apache.org/jira/browse/YARN-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960233#comment-15960233 ] zhihai xu commented on YARN-4095: - [~Feng Yuan], I think, For ShuffleHandler, we always want it to access all local directories which also include full local directories, since output data from mappers may be at the full local directories. Otherwise shuffle may fail due to data or index file can't be found in the good local directories. {code} Path indexFileName = lDirAlloc.getLocalPathToRead( attemptBase + "/" + INDEX_FILE_NAME, conf); Path mapOutputFileName = lDirAlloc.getLocalPathToRead( attemptBase + "/" + DATA_FILE_NAME, conf); public Path getLocalPathToRead(String pathStr, Configuration conf) throws IOException { Context ctx = confChanged(conf); int numDirs = ctx.localDirs.length; int numDirsSearched = 0; //remove the leading slash from the path (to make sure that the uri //resolution results in a valid path on the dir being checked) if (pathStr.startsWith("/")) { pathStr = pathStr.substring(1); } while (numDirsSearched < numDirs) { Path file = new Path(ctx.localDirs[numDirsSearched], pathStr); if (ctx.localFS.exists(file)) { return file; } numDirsSearched++; } //no path found throw new DiskErrorException ("Could not find " + pathStr +" in any of" + " the configured local directories"); } {code} I think This may be also the reason why we didn't want to use the same configuration between ShuffleHandler and LocalDirHandlerService. > Avoid sharing AllocatorPerContext object in LocalDirAllocator between > ShuffleHandler and LocalDirsHandlerService. > - > > Key: YARN-4095 > URL: https://issues.apache.org/jira/browse/YARN-4095 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: YARN-4095.000.patch, YARN-4095.001.patch > > > Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share > {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration > {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}} are stored in a static > TreeMap with configuration name as key > {code} > private static Map contexts = > new TreeMap(); > {code} > {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a > {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same > {{Configuration}} object, but they will use the same {{AllocatorPerContext}} > object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value > in its {{Configuration}} object to exclude full and bad local dirs, > {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its > {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} > is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, > {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value > is changed. This will cause some overhead. > {code} > String newLocalDirs = conf.get(contextCfgItemName); > if (!newLocalDirs.equals(savedLocalDirs)) { > {code} > So it will be a good improvement to not share the same > {{AllocatorPerContext}} instance between {{ShuffleHandler}} and > {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942651#comment-15942651 ] zhihai xu commented on YARN-6396: - I attached a patch which call verifyAndCreateRemoteLogDir in serviceStart, only call verifyAndCreateRemoteLogDir in initApp, when remote log directory is failed to create. > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
[ https://issues.apache.org/jira/browse/YARN-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-6396: Attachment: YARN-6396.000.patch > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node > --- > > Key: YARN-6396 > URL: https://issues.apache.org/jira/browse/YARN-6396 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6396.000.patch > > > Call verifyAndCreateRemoteLogDir at service initialization instead of > application initialization to decrease load for name node. > Currently for every application at each Node, verifyAndCreateRemoteLogDir > will be called before doing log aggregation, This will be a non trivial > overhead for name node in a large cluster since verifyAndCreateRemoteLogDir > calls getFileStatus. Once the remote log directory is created successfully, > it is not necessary to call it again. It will be better to call > verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
zhihai xu created YARN-6396: --- Summary: Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node Key: YARN-6396 URL: https://issues.apache.org/jira/browse/YARN-6396 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Affects Versions: 3.0.0-alpha2 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node. Currently for every application at each Node, verifyAndCreateRemoteLogDir will be called before doing log aggregation, This will be a non trivial overhead for name node in a large cluster since verifyAndCreateRemoteLogDir calls getFileStatus. Once the remote log directory is created successfully, it is not necessary to call it again. It will be better to call verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6392) add submit time to Application Summary log
[ https://issues.apache.org/jira/browse/YARN-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942636#comment-15942636 ] zhihai xu edited comment on YARN-6392 at 3/27/17 4:45 AM: -- The test failures are not related to my change. was (Author: zxu): The test failures are related to my change. > add submit time to Application Summary log > -- > > Key: YARN-6392 > URL: https://issues.apache.org/jira/browse/YARN-6392 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6392.000.patch > > > add submit time to Application Summary log, application submit time will be > passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a > very important parameter, So it will be useful to log it in Application > Summary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6392) add submit time to Application Summary log
[ https://issues.apache.org/jira/browse/YARN-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942636#comment-15942636 ] zhihai xu commented on YARN-6392: - The test failures are related to my change. > add submit time to Application Summary log > -- > > Key: YARN-6392 > URL: https://issues.apache.org/jira/browse/YARN-6392 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6392.000.patch > > > add submit time to Application Summary log, application submit time will be > passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a > very important parameter, So it will be useful to log it in Application > Summary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6392) add submit time to Application Summary log
[ https://issues.apache.org/jira/browse/YARN-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942589#comment-15942589 ] zhihai xu commented on YARN-6392: - I attached a patch YARN-6392.000.patch which will log submitTime in Application Summary. > add submit time to Application Summary log > -- > > Key: YARN-6392 > URL: https://issues.apache.org/jira/browse/YARN-6392 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6392.000.patch > > > add submit time to Application Summary log, application submit time will be > passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a > very important parameter, So it will be useful to log it in Application > Summary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6392) add submit time to Application Summary log
[ https://issues.apache.org/jira/browse/YARN-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-6392: Attachment: YARN-6392.000.patch > add submit time to Application Summary log > -- > > Key: YARN-6392 > URL: https://issues.apache.org/jira/browse/YARN-6392 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-6392.000.patch > > > add submit time to Application Summary log, application submit time will be > passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a > very important parameter, So it will be useful to log it in Application > Summary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6392) add submit time to Application Summary log
zhihai xu created YARN-6392: --- Summary: add submit time to Application Summary log Key: YARN-6392 URL: https://issues.apache.org/jira/browse/YARN-6392 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 3.0.0-alpha2 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor add submit time to Application Summary log, application submit time will be passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a very important parameter, So it will be useful to log it in Application Summary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5288) Resource Localization fails due to leftover files
[ https://issues.apache.org/jira/browse/YARN-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15344910#comment-15344910 ] zhihai xu commented on YARN-5288: - Thanks for reporting this issue [~yufeigu]! Can YARN-3727 fix your issue? YARN-3727 will delete the leftover files and move to the next directory if the leftover files is there. > Resource Localization fails due to leftover files > - > > Key: YARN-5288 > URL: https://issues.apache.org/jira/browse/YARN-5288 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.9.0 >Reporter: Yufei Gu >Assignee: Yufei Gu > > NM restart didn't clean up all user cache. The leftover files can cause > resource localization failure. > {code} > 2016-06-14 23:09:12,717 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > > java.io.IOException: Rename cannot overwrite non empty destination directory > /data/5/yarn/nm/usercache/xxx/filecache/4567 > at > org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) > at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:236) > at > org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) > at org.apache.hadoop.fs.FileContext.rename(FileContext.java:912) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4979) FSAppAttempt demand calculation considers demands at multiple locality levels different
[ https://issues.apache.org/jira/browse/YARN-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297200#comment-15297200 ] zhihai xu commented on YARN-4979: - thanks [~kasha] for reviewing and committing the patch! > FSAppAttempt demand calculation considers demands at multiple locality levels > different > --- > > Key: YARN-4979 > URL: https://issues.apache.org/jira/browse/YARN-4979 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0, 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.9.0 > > Attachments: YARN-4979.001.patch > > > FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We > should only count ResourceRequest for ResourceRequest.ANY when calculate > demand. > Because {{hasContainerForNode}} will return false if no container request for > ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} > will also decrease the number of containers for ResourceRequest.ANY. > This issue may cause current memory demand overflow(integer) because > duplicate requests can be on multiple nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252087#comment-15252087 ] zhihai xu commented on YARN-1458: - Ok, no problem, you can try it at your convenience. thanks for finding this issue! > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.addendum.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251356#comment-15251356 ] zhihai xu commented on YARN-1458: - I think FSAppAttempt may add duplicate ResourceRequest to demand, which may cause current memory demand Integer Overflow. I created YARN-4979 to fix the wrong demand calculation issue for FSAppAttempt. The root cause may be YARN-4979. > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.addendum.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.
[ https://issues.apache.org/jira/browse/YARN-4979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4979: Attachment: YARN-4979.001.patch > FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. > -- > > Key: YARN-4979 > URL: https://issues.apache.org/jira/browse/YARN-4979 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.0, 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4979.001.patch > > > FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We > should only count ResourceRequest for ResourceRequest.ANY when calculate > demand. > Because {{hasContainerForNode}} will return false if no container request for > ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} > will also decrease the number of containers for ResourceRequest.ANY. > This issue may cause current memory demand overflow(integer) because > duplicate requests can be on multiple nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.
zhihai xu created YARN-4979: --- Summary: FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. Key: YARN-4979 URL: https://issues.apache.org/jira/browse/YARN-4979 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.2, 2.8.0 Reporter: zhihai xu Assignee: zhihai xu FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We should only count ResourceRequest for ResourceRequest.ANY when calculate demand. Because {{hasContainerForNode}} will return false if no container request for ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} will also decrease the number of containers for ResourceRequest.ANY. This issue may cause current memory demand overflow(integer) because duplicate requests can be on multiple nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251016#comment-15251016 ] zhihai xu commented on YARN-1458: - Hi [~dwatzke], thanks for reporting this issue, I double check the code, I find one corner case which can cause this issue. Hopefully this is the only case which isn't handled. The corner case is when current memory demand for the app is Integer Overflow. If that happen, the weight will become [NaN|https://docs.oracle.com/javase/7/docs/api/java/lang/Math.html#log1p(double)] because current memory demand is a negative value. {code} weight = Math.log1p(app.getDemand().getMemory()) / Math.log(2); {code} {{getFairShareIfFixed}} will treat NaN weight same as positive weight. {{computeShare}} will always return 0 if the weight is NaN because {{share}} is NaN and {{(int)NaN}} is 0. I attached a addendum patch YARN-1458.addendum.patch, Could you verify whether this patch can fix your issue? thanks > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.addendum.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: YARN-1458.addendum.patch > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.addendum.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2910) FSLeafQueue can throw ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248637#comment-15248637 ] zhihai xu commented on YARN-2910: - linked YARN-2975 to this issue, It looks like we need both YARN-2910 and YARN-2975 to fix this issue completely. > FSLeafQueue can throw ConcurrentModificationException > - > > Key: YARN-2910 > URL: https://issues.apache.org/jira/browse/YARN-2910 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Labels: 2.6.1-candidate > Fix For: 2.7.0, 2.6.1 > > Attachments: FSLeafQueue_concurrent_exception.txt, > YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, > YARN-2910.4.patch, YARN-2910.5.patch, YARN-2910.6.patch, YARN-2910.7.patch, > YARN-2910.8.patch, YARN-2910.patch > > > The list that maintains the runnable and the non runnable apps are a standard > ArrayList but there is no guarantee that it will only be manipulated by one > thread in the system. This can lead to the following exception: > {noformat} > 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.util.ConcurrentModificationException: > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) > at java.util.ArrayList$Itr.next(ArrayList.java:831) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516) > {noformat} > Full stack trace in the attached file. > We should guard against that by using a thread safe version from > java.util.concurrent.CopyOnWriteArrayList -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182647#comment-15182647 ] zhihai xu commented on YARN-4761: - I just committed it to trunk, branch-2, branch-2.8, branch-2.7 and branch-2.6. thanks [~sjlee0] for the contribution and thanks [~rohithsharma] for the review! > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Fix For: 2.8.0, 2.7.3, 2.6.5 > > Attachments: YARN-4761.01.patch, YARN-4761.02.patch > > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182514#comment-15182514 ] zhihai xu commented on YARN-4761: - +1 for the latest patch, the test failures are not elated to the patch and one test failure is the same as YARN-4306. Will commit the patch shortly. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > Attachments: YARN-4761.01.patch, YARN-4761.02.patch > > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4761) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations on fair scheduler
[ https://issues.apache.org/jira/browse/YARN-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179066#comment-15179066 ] zhihai xu commented on YARN-4761: - Good Finding [~sjlee0]! the same issue could also happen for fair scheduler. we should decouple RMNode status from fair scheduler also. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations on fair scheduler > > > Key: YARN-4761 > URL: https://issues.apache.org/jira/browse/YARN-4761 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.4 >Reporter: Sangjin Lee >Assignee: Sangjin Lee > > YARN-3802 uncovered an issue with the scheduler where the resource > calculation can be incorrect due to async event handling. It was subsequently > fixed by YARN-4344, but it was never fixed for the fair scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170481#comment-15170481 ] zhihai xu commented on YARN-4728: - Yes, MAPREDUCE-6513 is possible, but YARN-1680 may be more possible. Because blacklisted nodes can happen easier in your environment than MAPREDUCE-6513 especially with mapreduce.job.reduce.slowstart.completedmaps=1. To see whether it is MAPREDUCE-6513 or YARN-1680, you need check the log to see wether reduce task is preempted. If reduce task is preempted and map task still can't get resource, it is MAPREDUCE-6513/MAPREDUCE-6514. Otherwise, it is YARN-1680. Even YARN-1680 is fixed, which trigger the preemption, MAPREDUCE-6513 still will happen. > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160305#comment-15160305 ] zhihai xu commented on YARN-4728: - Thanks for reporting this issue [~Silnov]! It looks like this issue is caused by the long timeout at two level. This issue is similar as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work around this issue by changing the configuration values: "ipc.client.connect.max.retries.on.timeouts" (default is 45), "ipc.client.connect.timeout"(default is 2ms) and "yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms). > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) Fix two AM containers get allocated when AM restart
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4502: Summary: Fix two AM containers get allocated when AM restart (was: gjfbndbfcjenrgccriejuvcnktllcc) > Fix two AM containers get allocated when AM restart > --- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) cfjgdgcejkrbvgluuehgnkj
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4502: Summary: cfjgdgcejkrbvgluuehgnkj (was: Fix two AM containers get allocated when AM restart) > cfjgdgcejkrbvgluuehgnkj > --- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) gjfbndbfcjenrgccriejuvcnktllcc
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4502: Summary: gjfbndbfcjenrgccriejuvcnktllcc (was: cfjgdgcejkrbvgluuehgnkj) > gjfbndbfcjenrgccriejuvcnktllcc > -- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4502) Fix two AM containers get allocated when AM restart
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129324#comment-15129324 ] zhihai xu commented on YARN-4502: - +1 also. This patch also covers the case when a container receives RMContainerEventType.EXPIRE event at state RMContainerState.ALLOCATED, which was not covered by YARN-3535. Based on the original suggestion by [~leftnoteasy], It looks like the implementation for {{AbstractYarnScheduler#getApplicationAttempt(ApplicationAttemptId applicationAttemptId)}} is also confusing. It always returns the current application attempt even the current application attempt doesn't match the given {{applicationAttemptId}}. In contrast, {{RMAppImpl#getRMAppAttempt(ApplicationAttemptId appAttemptId)}} always returns the matched {{RMAppAttempt}}. Should we fix it in a follow-up JIRA? > Fix two AM containers get allocated when AM restart > --- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4646) AMRMClient crashed when RM transition from active to standby
[ https://issues.apache.org/jira/browse/YARN-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118816#comment-15118816 ] zhihai xu commented on YARN-4646: - Is this issue fixed in MAPREDUCE-6439? They have same stack trace. > AMRMClient crashed when RM transition from active to standby > > > Key: YARN-4646 > URL: https://issues.apache.org/jira/browse/YARN-4646 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee > > when RM transition to standby, ApplicationMasterService#allocate() is > interrupted and the exception is passed to AM. > the following is the exception msg: > {quote} > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:266) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:448) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:258) > ... 11 more > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:107) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at com.sun.proxy.$Proxy35.allocate(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:274) > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:237) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException): > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:266) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:448) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$
[jira] [Commented] (YARN-3446) FairScheduler headroom calculation should exclude nodes in the blacklist
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098617#comment-15098617 ] zhihai xu commented on YARN-3446: - [~kasha], thanks for the review and committing the patch! > FairScheduler headroom calculation should exclude nodes in the blacklist > > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.9.0 > > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, > YARN-3446.005.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098347#comment-15098347 ] zhihai xu commented on YARN-3446: - The test failures for TestClientRMTokens and TestAMAuthorizatio are not related to the patch. Both tests are passed in my local build. > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, > YARN-3446.005.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.005.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch, > YARN-3446.005.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097605#comment-15097605 ] zhihai xu commented on YARN-3446: - Thanks for the review [~kasha]! That is a good suggestion. I attached a new patch YARN-3446.005.patch, which addressed your comments. Please review it. > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown
[ https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082641#comment-15082641 ] zhihai xu commented on YARN-3697: - [~djp], yes, I just committed it to branch-2.6. thanks > FairScheduler: ContinuousSchedulingThread can fail to shutdown > -- > > Key: YARN-3697 > URL: https://issues.apache.org/jira/browse/YARN-3697 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-3697.000.patch, YARN-3697.001.patch > > > FairScheduler: ContinuousSchedulingThread can't be shutdown after stop > sometimes. > The reason is because the InterruptedException is blocked in > continuousSchedulingAttempt > {code} > try { > if (node != null && Resources.fitsIn(minimumAllocation, > node.getAvailableResource())) { > attemptScheduling(node); > } > } catch (Throwable ex) { > LOG.error("Error while attempting scheduling for node " + node + > ": " + ex.toString(), ex); > } > {code} > I saw the following exception after stop: > {code} > 2015-05-17 23:30:43,065 WARN [FairSchedulerContinuousScheduling] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285) > 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] > fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - > Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 > available= used=: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.re
[jira] [Updated] (YARN-3697) FairScheduler: ContinuousSchedulingThread can fail to shutdown
[ https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3697: Fix Version/s: 2.6.4 > FairScheduler: ContinuousSchedulingThread can fail to shutdown > -- > > Key: YARN-3697 > URL: https://issues.apache.org/jira/browse/YARN-3697 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-3697.000.patch, YARN-3697.001.patch > > > FairScheduler: ContinuousSchedulingThread can't be shutdown after stop > sometimes. > The reason is because the InterruptedException is blocked in > continuousSchedulingAttempt > {code} > try { > if (node != null && Resources.fitsIn(minimumAllocation, > node.getAvailableResource())) { > attemptScheduling(node); > } > } catch (Throwable ex) { > LOG.error("Error while attempting scheduling for node " + node + > ": " + ex.toString(), ex); > } > {code} > I saw the following exception after stop: > {code} > 2015-05-17 23:30:43,065 WARN [FairSchedulerContinuousScheduling] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285) > 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] > fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - > Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 > available= used=: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContai
[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082601#comment-15082601 ] zhihai xu commented on YARN-3446: - thanks for the review! Just updated the patch at YARN-3446.004.patch. > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.004.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch, YARN-3446.004.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
[ https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065066#comment-15065066 ] zhihai xu commented on YARN-4440: - yes, thanks [~leftnoteasy] for committing it to branch-2.8! > FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time > - > > Key: YARN-4440 > URL: https://issues.apache.org/jira/browse/YARN-4440 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Fix For: 2.8.0 > > Attachments: YARN-4440.001.patch, YARN-4440.002.patch, > YARN-4440.003.patch > > > It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} > method > {code} > // default level is NODE_LOCAL > if (! allowedLocalityLevel.containsKey(priority)) { > allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL); > return NodeType.NODE_LOCAL; > } > {code} > If you first invoke this method, it doesn't init time in > lastScheduledContainer and this will lead to execute these code for next > invokation: > {code} > // check waiting time > long waitTime = currentTimeMs; > if (lastScheduledContainer.containsKey(priority)) { > waitTime -= lastScheduledContainer.get(priority); > } else { > waitTime -= getStartTime(); > } > {code} > the waitTime will subtract to FsApp startTime, and this will be easily more > than the delay time and allowedLocality degrade. Because FsApp startTime will > be start earlier than currentTimeMs. So we should add the initial time of > priority to prevent comparing with FsApp startTime and allowedLocalityLevel > degrade. And this problem will have more negative influence for small-jobs. > The YARN-4399 also discuss some problem in aspect of locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.
[ https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058460#comment-15058460 ] zhihai xu commented on YARN-4439: - Good Catch [~jlowe]! Will clean it up! thanks. > Clarify NMContainerStatus#toString method. > -- > > Key: YARN-4439 > URL: https://issues.apache.org/jira/browse/YARN-4439 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.7.3 > > Attachments: YARN-4439.1.patch, YARN-4439.2.patch, > YARN-4439.appendum-2.7.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4439) Clarify NMContainerStatus#toString method.
[ https://issues.apache.org/jira/browse/YARN-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058263#comment-15058263 ] zhihai xu commented on YARN-4439: - Hi [~jianhe], Could you revert the old patch and create a new patch for branch-2.7 to fix the compilation error? > Clarify NMContainerStatus#toString method. > -- > > Key: YARN-4439 > URL: https://issues.apache.org/jira/browse/YARN-4439 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Fix For: 2.7.3 > > Attachments: YARN-4439.1.patch, YARN-4439.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3857: Affects Version/s: 2.6.2 > Memory leak in ResourceManager with SIMPLE mode > --- > > Key: YARN-3857 > URL: https://issues.apache.org/jira/browse/YARN-3857 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0, 2.6.2 >Reporter: mujunchao >Assignee: mujunchao >Priority: Critical > Labels: patch > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, > YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch > > > We register the ClientTokenMasterKey to avoid client may hold an invalid > ClientToken after RM restarts. In SIMPLE mode, we register > Pair , But we never remove it from HashMap, as > unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058256#comment-15058256 ] zhihai xu commented on YARN-4458: - Thanks [~jlowe]! yes, It makes sense, which will make cherry-pick easier. > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. This issue only happens for branch-2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Release Note: (was: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.) > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. This issue only happens for branch-2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Description: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. This issue only happens for branch-2.7. (was: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.) > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. This issue only happens for branch-2.7. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Description: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Attachment: YARN-4458.branch-2.7.patch > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Attachment: (was: YARN-4458.000.patch) > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.branch-2.7.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Attachment: YARN-4458.000.patch > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-4458.000.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
[ https://issues.apache.org/jira/browse/YARN-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4458: Release Note: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. (was: Compilation error at branch-2.7 due to {{getNodeLabelExpression}} not defined in NMContainerStatusPBImpl.) > Compilation error at branch-2.7 due to getNodeLabelExpression not defined in > NMContainerStatusPBImpl. > - > > Key: YARN-4458 > URL: https://issues.apache.org/jira/browse/YARN-4458 > Project: Hadoop YARN > Issue Type: Bug >Reporter: zhihai xu >Assignee: zhihai xu > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
zhihai xu created YARN-4458: --- Summary: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. Key: YARN-4458 URL: https://issues.apache.org/jira/browse/YARN-4458 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
[ https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057613#comment-15057613 ] zhihai xu commented on YARN-4440: - Committed it to trunk and branch-2. thanks [~linyiqun] for the contributions! > FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time > - > > Key: YARN-4440 > URL: https://issues.apache.org/jira/browse/YARN-4440 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Attachments: YARN-4440.001.patch, YARN-4440.002.patch, > YARN-4440.003.patch > > > It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} > method > {code} > // default level is NODE_LOCAL > if (! allowedLocalityLevel.containsKey(priority)) { > allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL); > return NodeType.NODE_LOCAL; > } > {code} > If you first invoke this method, it doesn't init time in > lastScheduledContainer and this will lead to execute these code for next > invokation: > {code} > // check waiting time > long waitTime = currentTimeMs; > if (lastScheduledContainer.containsKey(priority)) { > waitTime -= lastScheduledContainer.get(priority); > } else { > waitTime -= getStartTime(); > } > {code} > the waitTime will subtract to FsApp startTime, and this will be easily more > than the delay time and allowedLocality degrade. Because FsApp startTime will > be start earlier than currentTimeMs. So we should add the initial time of > priority to prevent comparing with FsApp startTime and allowedLocalityLevel > degrade. And this problem will have more negative influence for small-jobs. > The YARN-4399 also discuss some problem in aspect of locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057608#comment-15057608 ] zhihai xu commented on YARN-3535: - You are welcome! I think this will be a very critical fix for 2.6.4 release. > Scheduler must re-request container resources when RMContainer transitions > from ALLOCATED to KILLED > --- > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057536#comment-15057536 ] zhihai xu commented on YARN-3535: - Yes, this issue exists in 2.6.x, I just committed this patch to branch-2.6. > Scheduler must re-request container resources when RMContainer transitions > from ALLOCATED to KILLED > --- > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) Scheduler must re-request container resources when RMContainer transitions from ALLOCATED to KILLED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3535: Fix Version/s: 2.6.4 > Scheduler must re-request container resources when RMContainer transitions > from ALLOCATED to KILLED > --- > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, fairscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, > 0005-YARN-3535.patch, 0006-YARN-3535.patch, YARN-3535-001.patch, > YARN-3535-002.patch, syslog.tgz, yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15057421#comment-15057421 ] zhihai xu commented on YARN-3857: - Yes, this issue exists in 2.6.x, I just committed this patch to branch-2.6. > Memory leak in ResourceManager with SIMPLE mode > --- > > Key: YARN-3857 > URL: https://issues.apache.org/jira/browse/YARN-3857 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: mujunchao >Assignee: mujunchao >Priority: Critical > Labels: patch > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, > YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch > > > We register the ClientTokenMasterKey to avoid client may hold an invalid > ClientToken after RM restarts. In SIMPLE mode, we register > Pair , But we never remove it from HashMap, as > unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3857: Fix Version/s: 2.6.4 > Memory leak in ResourceManager with SIMPLE mode > --- > > Key: YARN-3857 > URL: https://issues.apache.org/jira/browse/YARN-3857 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: mujunchao >Assignee: mujunchao >Priority: Critical > Labels: patch > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, > YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch > > > We register the ClientTokenMasterKey to avoid client may hold an invalid > ClientToken after RM restarts. In SIMPLE mode, we register > Pair , But we never remove it from HashMap, as > unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4440) FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time
[ https://issues.apache.org/jira/browse/YARN-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056895#comment-15056895 ] zhihai xu commented on YARN-4440: - Good catch! thanks for working on this issue [~linyiqun]! +1 for the latest patch, The test failures are not related to the patch, These failures were already reported at YARN-4318 and YARN-4306. Will commit it tomorrow if no one objects. > FSAppAttempt#getAllowedLocalityLevelByTime should init the lastScheduler time > - > > Key: YARN-4440 > URL: https://issues.apache.org/jira/browse/YARN-4440 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Attachments: YARN-4440.001.patch, YARN-4440.002.patch, > YARN-4440.003.patch > > > It seems there is a bug on {{FSAppAttempt#getAllowedLocalityLevelByTime}} > method > {code} > // default level is NODE_LOCAL > if (! allowedLocalityLevel.containsKey(priority)) { > allowedLocalityLevel.put(priority, NodeType.NODE_LOCAL); > return NodeType.NODE_LOCAL; > } > {code} > If you first invoke this method, it doesn't init time in > lastScheduledContainer and this will lead to execute these code for next > invokation: > {code} > // check waiting time > long waitTime = currentTimeMs; > if (lastScheduledContainer.containsKey(priority)) { > waitTime -= lastScheduledContainer.get(priority); > } else { > waitTime -= getStartTime(); > } > {code} > the waitTime will subtract to FsApp startTime, and this will be easily more > than the delay time and allowedLocality degrade. Because FsApp startTime will > be start earlier than currentTimeMs. So we should add the initial time of > priority to prevent comparing with FsApp startTime and allowedLocalityLevel > degrade. And this problem will have more negative influence for small-jobs. > The YARN-4399 also discuss some problem in aspect of locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056745#comment-15056745 ] zhihai xu commented on YARN-4209: - This issue won't affect 2.6.x branch, since RMStateStoreState.FENCED state is only added at 2.7.x branch. > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Fix For: 2.7.2 > > Attachments: YARN-4209.000.patch, YARN-4209.001.patch, > YARN-4209.002.patch, YARN-4209.branch-2.7.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4344) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations
[ https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003181#comment-15003181 ] zhihai xu commented on YARN-4344: - +1 for Jason Lowe's suggestion to fix the issue at scheduler side. Using {{SchedulerNode.getTotalResource()}} instead of {{RMNode.getTotalCapability()}} inside Scheduler can better decouple Scheduler from RMNodeImpl state machine. It may also fix some other potential issues. For example, {{CapacityScheduler#addNode}} uses {{nodeManager.getTotalCapability()}} after creating {{FiCaSchedulerNode}}, if {{nodeManager.totalCapability}} is changed by RMNodeImpl state machine right after {{FiCaSchedulerNode}} was created, similar issue may happen. > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations > -- > > Key: YARN-4344 > URL: https://issues.apache.org/jira/browse/YARN-4344 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Critical > Attachments: YARN-4344.001.patch > > > After YARN-3802, if an NM re-connects to the RM with changed capabilities, > there can arise situations where the overall cluster resource calculation for > the cluster will be incorrect leading to inconsistencies in scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4344) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations
[ https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001800#comment-15001800 ] zhihai xu commented on YARN-4344: - Thanks for reporting this issue [~vvasudev]! Thanks for the review [~Jason Lowe]! [~rohithsharma] tried to clean up the code at YARN-3286. Based on the following comment from [~jianhe] at YARN-3286, {code} I think this has changed the behavior that without any RM/NM restart features enabled, earlier restarting a node will trigger RM to kill all the containers on this node, but now it won't ? {code} The patch may cause compatibility issue. Maybe we can merge the case {{rmNode.getHttpPort() == newNode.getHttpPort()}} with {{rmNode.getHttpPort() != newNode.getHttpPort()}} for noRunningApps. Thoughts? > NMs reconnecting with changed capabilities can lead to wrong cluster resource > calculations > -- > > Key: YARN-4344 > URL: https://issues.apache.org/jira/browse/YARN-4344 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1, 2.6.2 >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Critical > Attachments: YARN-4344.001.patch > > > After YARN-3802, if an NM re-connects to the RM with changed capabilities, > there can arise situations where the overall cluster resource calculation for > the cluster will be incorrect leading to inconsistencies in scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4256) YARN fair scheduler vcores with decimal values
[ https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4256: Target Version/s: 2.8.0 (was: 2.7.2) > YARN fair scheduler vcores with decimal values > -- > > Key: YARN-4256 > URL: https://issues.apache.org/jira/browse/YARN-4256 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Prabhu Joseph >Assignee: Jun Gong >Priority: Minor > Fix For: 2.8.0 > > Attachments: YARN-4256.001.patch, YARN-4256.002.patch > > > When the queue with vcores is in decimal value, the value after the decimal > point is taken as vcores by FairScheduler. > For the below queue, > 2 mb,20 vcores,20.25 disks > 3 mb,40.2 vcores,30.25 disks > When many applications submitted parallely into queue, all were in PENDING > state as the vcores is taken as 2 skipping the value 40. > The code FairSchedulerConfiguration.java to Pattern match the vcores has to > be improved in such a way either throw > AllocationConfigurationException("Missing resource") or consider the value > before decimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4256) YARN fair scheduler vcores with decimal values
[ https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4256: Hadoop Flags: Reviewed > YARN fair scheduler vcores with decimal values > -- > > Key: YARN-4256 > URL: https://issues.apache.org/jira/browse/YARN-4256 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Prabhu Joseph >Assignee: Jun Gong >Priority: Minor > Fix For: 2.7.2 > > Attachments: YARN-4256.001.patch, YARN-4256.002.patch > > > When the queue with vcores is in decimal value, the value after the decimal > point is taken as vcores by FairScheduler. > For the below queue, > 2 mb,20 vcores,20.25 disks > 3 mb,40.2 vcores,30.25 disks > When many applications submitted parallely into queue, all were in PENDING > state as the vcores is taken as 2 skipping the value 40. > The code FairSchedulerConfiguration.java to Pattern match the vcores has to > be improved in such a way either throw > AllocationConfigurationException("Missing resource") or consider the value > before decimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4256) YARN fair scheduler vcores with decimal values
[ https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969748#comment-14969748 ] zhihai xu commented on YARN-4256: - committed it to trunk and branch-2, Thanks [~Prabhu Joseph] for reporting this issue, thanks [~hex108] for the patch and thanks [~brahmareddy] for additional review! > YARN fair scheduler vcores with decimal values > -- > > Key: YARN-4256 > URL: https://issues.apache.org/jira/browse/YARN-4256 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Prabhu Joseph >Assignee: Jun Gong >Priority: Minor > Fix For: 2.7.2 > > Attachments: YARN-4256.001.patch, YARN-4256.002.patch > > > When the queue with vcores is in decimal value, the value after the decimal > point is taken as vcores by FairScheduler. > For the below queue, > 2 mb,20 vcores,20.25 disks > 3 mb,40.2 vcores,30.25 disks > When many applications submitted parallely into queue, all were in PENDING > state as the vcores is taken as 2 skipping the value 40. > The code FairSchedulerConfiguration.java to Pattern match the vcores has to > be improved in such a way either throw > AllocationConfigurationException("Missing resource") or consider the value > before decimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4256) YARN fair scheduler vcores with decimal values
[ https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967355#comment-14967355 ] zhihai xu commented on YARN-4256: - +1 LGTM, Will commit tomorrow if no one objects. > YARN fair scheduler vcores with decimal values > -- > > Key: YARN-4256 > URL: https://issues.apache.org/jira/browse/YARN-4256 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Prabhu Joseph >Assignee: Jun Gong >Priority: Minor > Fix For: 2.7.2 > > Attachments: YARN-4256.001.patch, YARN-4256.002.patch > > > When the queue with vcores is in decimal value, the value after the decimal > point is taken as vcores by FairScheduler. > For the below queue, > 2 mb,20 vcores,20.25 disks > 3 mb,40.2 vcores,30.25 disks > When many applications submitted parallely into queue, all were in PENDING > state as the vcores is taken as 2 skipping the value 40. > The code FairSchedulerConfiguration.java to Pattern match the vcores has to > be improved in such a way either throw > AllocationConfigurationException("Missing resource") or consider the value > before decimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4256) YARN fair scheduler vcores with decimal values
[ https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965488#comment-14965488 ] zhihai xu commented on YARN-4256: - Thanks for reporting this issue [~Prabhu Joseph]! Thanks for the patch [~hex108]! The patch looks most good. Can we change '+' to '*' (\\.\\d+)? => (\\.\\d*)? So we can relax the condition to support 1024. mb. > YARN fair scheduler vcores with decimal values > -- > > Key: YARN-4256 > URL: https://issues.apache.org/jira/browse/YARN-4256 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: Prabhu Joseph >Assignee: Jun Gong >Priority: Minor > Fix For: 2.7.2 > > Attachments: YARN-4256.001.patch > > > When the queue with vcores is in decimal value, the value after the decimal > point is taken as vcores by FairScheduler. > For the below queue, > 2 mb,20 vcores,20.25 disks > 3 mb,40.2 vcores,30.25 disks > When many applications submitted parallely into queue, all were in PENDING > state as the vcores is taken as 2 skipping the value 40. > The code FairSchedulerConfiguration.java to Pattern match the vcores has to > be improved in such a way either throw > AllocationConfigurationException("Missing resource") or consider the value > before decimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963885#comment-14963885 ] zhihai xu commented on YARN-4227: - Currently I didn't find other code path except YARN-3675, which can cause this issue. bq. there are multiple allocations happening in between the removal of the node and the allocation. These allocations happen on other nodes. This won't help. We need find the log which shows the node removal and allocations on the same node. To confirm this issue, we need find the logs which show what node container_1436927988321_1307950_01_12 was allocated on and when that node used by container_1436927988321_1307950_01_12 was removed. > FairScheduler: RM quits processing expired container from a removed node > > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.3.0, 2.5.0, 2.7.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, > YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_12 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_12 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op >OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_12 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node
[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958402#comment-14958402 ] zhihai xu commented on YARN-4227: - Is it possible the root cause of this issue is YARN-3675? I think YARN-3675 may cause this issue. If we can get the complete logs for container_1436927988321_1307950_01_12, we may confirm it. Once the node is removed, all the containers allocated on the node are supposed to be killed. The race condition at YARN-3675 may cause a container allocated on a just removed node. > FairScheduler: RM quits processing expired container from a removed node > > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.3.0, 2.5.0, 2.7.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, > YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_12 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_12 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_12 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op >OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_12 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster
[ https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952738#comment-14952738 ] zhihai xu commented on YARN-4201: - Committed it to branch-2 and trunk, thanks [~hex108] for the contribution! > AMBlacklist does not work for minicluster > - > > Key: YARN-4201 > URL: https://issues.apache.org/jira/browse/YARN-4201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.8.0 > > Attachments: YARN-4021.001.patch, YARN-4201.002.patch, > YARN-4201.003.patch > > > For minicluster (scheduler.include-port-in-node-name is set to TRUE), > AMBlacklist does not work. It is because RM just puts host to AMBlacklist > whether scheduler.include-port-in-node-name is set or not. In fact RM should > put "host + port" to AMBlacklist when it is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4201) AMBlacklist does not work for minicluster
[ https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-4201: Hadoop Flags: Reviewed > AMBlacklist does not work for minicluster > - > > Key: YARN-4201 > URL: https://issues.apache.org/jira/browse/YARN-4201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4021.001.patch, YARN-4201.002.patch, > YARN-4201.003.patch > > > For minicluster (scheduler.include-port-in-node-name is set to TRUE), > AMBlacklist does not work. It is because RM just puts host to AMBlacklist > whether scheduler.include-port-in-node-name is set or not. In fact RM should > put "host + port" to AMBlacklist when it is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: (was: YARN-3446.003.patch) > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.003.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4247) Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing events
[ https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952328#comment-14952328 ] zhihai xu commented on YARN-4247: - [~adhoot], thanks for working on this issue, Is this issue fixed by YARN-3361? YARN-3361 removed {{readLock}} from {{RMAppAttemptImpl #getMasterContainer}}. > Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing > events > - > > Key: YARN-4247 > URL: https://issues.apache.org/jira/browse/YARN-4247 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Anubhav Dhoot >Priority: Blocker > Attachments: YARN-4247.001.patch, YARN-4247.001.patch > > > We see this deadlock in our testing where events do not get processed and we > see this in the logs before the RM dies of OOM {noformat} 2015-10-08 > 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of > event-queue is 1488000 2015-10-08 04:48:01,918 INFO > org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster
[ https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951393#comment-14951393 ] zhihai xu commented on YARN-4201: - +1 for the latest patch. I will wait for one or two days before committing for others to look at the patch. > AMBlacklist does not work for minicluster > - > > Key: YARN-4201 > URL: https://issues.apache.org/jira/browse/YARN-4201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4021.001.patch, YARN-4201.002.patch, > YARN-4201.003.patch > > > For minicluster (scheduler.include-port-in-node-name is set to TRUE), > AMBlacklist does not work. It is because RM just puts host to AMBlacklist > whether scheduler.include-port-in-node-name is set or not. In fact RM should > put "host + port" to AMBlacklist when it is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster
[ https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949991#comment-14949991 ] zhihai xu commented on YARN-4201: - Thanks for the new patch [~hex108], I think it will be better to check {{scheduler.getSchedulerNode(nodeId)}} not null to avoid NPE. If {{scheduler.getSchedulerNode(nodeId)}} return null, it means the blacklisted node is just removed from scheduler, I think it will be ok to not add a removed node to black List. > AMBlacklist does not work for minicluster > - > > Key: YARN-4201 > URL: https://issues.apache.org/jira/browse/YARN-4201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4021.001.patch, YARN-4201.002.patch > > > For minicluster (scheduler.include-port-in-node-name is set to TRUE), > AMBlacklist does not work. It is because RM just puts host to AMBlacklist > whether scheduler.include-port-in-node-name is set or not. In fact RM should > put "host + port" to AMBlacklist when it is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949536#comment-14949536 ] zhihai xu commented on YARN-3943: - Thanks [~jlowe] for the review and committing the patch, greatly appreciated! > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-3943.000.patch, YARN-3943.001.patch, > YARN-3943.002.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster
[ https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948931#comment-14948931 ] zhihai xu commented on YARN-4201: - Currently {{getSchedulerNode}} is defined at {{AbstractYarnScheduler}}. {{SchedulerAppUtils.isBlacklisted}} uses {{node.getNodeName()}} to check blacklisted node. So it will be good to use the same way to get blacklisted node name. All the configuration and format related to node name will be only in SchedulerNode.java. > AMBlacklist does not work for minicluster > - > > Key: YARN-4201 > URL: https://issues.apache.org/jira/browse/YARN-4201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4021.001.patch > > > For minicluster (scheduler.include-port-in-node-name is set to TRUE), > AMBlacklist does not work. It is because RM just puts host to AMBlacklist > whether scheduler.include-port-in-node-name is set or not. In fact RM should > put "host + port" to AMBlacklist when it is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4201) AMBlacklist does not work for minicluster
[ https://issues.apache.org/jira/browse/YARN-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948914#comment-14948914 ] zhihai xu commented on YARN-4201: - Thanks for the patch [~hex108]! It is a good catch. Should we use {{SchedulerNode#getNodeName}} to get the blacklisted node name? We can add {{getSchedulerNode}} to {{YarnScheduler}}, So we can call {{getSchedulerNode}} to look up the the SchedulerNode using NodeId in {{RMAppAttemptImpl}}. > AMBlacklist does not work for minicluster > - > > Key: YARN-4201 > URL: https://issues.apache.org/jira/browse/YARN-4201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4021.001.patch > > > For minicluster (scheduler.include-port-in-node-name is set to TRUE), > AMBlacklist does not work. It is because RM just puts host to AMBlacklist > whether scheduler.include-port-in-node-name is set or not. In fact RM should > put "host + port" to AMBlacklist when it is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948870#comment-14948870 ] zhihai xu commented on YARN-3943: - The checkstyle issues and release audit warnings for the new patch YARN-3943.002.patch were pre-existing. > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch, > YARN-3943.002.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947996#comment-14947996 ] zhihai xu commented on YARN-3943: - Thanks [~jlowe]! yes, the comments are great. Nice catch for the backwards compatibility problem! I uploaded a new patch YARN-3943.002.patch, which addressed all your comments, Please review it. > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch, > YARN-3943.002.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: YARN-3943.002.patch > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch, > YARN-3943.002.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947014#comment-14947014 ] zhihai xu commented on YARN-3943: - Hi [~jlowe], Could you help review the patch thanks? > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946210#comment-14946210 ] zhihai xu commented on YARN-3943: - The checkstyle issues and release audit warnings were pre-existing. > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946207#comment-14946207 ] zhihai xu commented on YARN-4209: - Thanks [~rohithsharma] for reviewing and committing the patch! > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Fix For: 2.7.2 > > Attachments: YARN-4209.000.patch, YARN-4209.001.patch, > YARN-4209.002.patch, YARN-4209.branch-2.7.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: (was: YARN-3446.003.patch) > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.003.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: YARN-3446.003.patch > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch, YARN-3446.003.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
[ https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3446: Attachment: (was: YARN-3446.003.patch) > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > - > > Key: YARN-3446 > URL: https://issues.apache.org/jira/browse/YARN-3446 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-3446.000.patch, YARN-3446.001.patch, > YARN-3446.002.patch > > > FairScheduler HeadRoom calculation should exclude nodes in the blacklist. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes. This makes jobs to > hang forever(ResourceManager does not assign any new containers on > blacklisted nodes but availableResource AM get from RM includes blacklisted > nodes available resource). > This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: YARN-3943.001.patch > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: (was: YARN-3943.001.patch) > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: YARN-3943.001.patch > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch, YARN-3943.001.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
[ https://issues.apache.org/jira/browse/YARN-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3943: Attachment: (was: YARN-3943.001.patch) > Use separate threshold configurations for disk-full detection and > disk-not-full detection. > -- > > Key: YARN-3943 > URL: https://issues.apache.org/jira/browse/YARN-3943 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3943.000.patch > > > Use separate threshold configurations to check when disks become full and > when disks become good. Currently the configuration > "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" > and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are > used to check both when disks become full and when disks become good. It will > be better to use two configurations: one is used when disks become full from > not-full and the other one is used when disks become not-full from full. So > we can avoid oscillating frequently. > For example: we can set the one for disk-full detection higher than the one > for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4209) RMStateStore FENCED state doesn’t work due to updateFencedState called by stateMachine.doTransition
[ https://issues.apache.org/jira/browse/YARN-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945553#comment-14945553 ] zhihai xu commented on YARN-4209: - thanks [~rohithsharma]! Yes, I attached the patch YARN-4209.branch-2.7.patch for branch-2.7. > RMStateStore FENCED state doesn’t work due to updateFencedState called by > stateMachine.doTransition > --- > > Key: YARN-4209 > URL: https://issues.apache.org/jira/browse/YARN-4209 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-4209.000.patch, YARN-4209.001.patch, > YARN-4209.002.patch, YARN-4209.branch-2.7.patch > > > RMStateStore FENCED state doesn’t work due to {{updateFencedState}} called by > {{stateMachine.doTransition}}. The reason is > {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded > in {{stateMachine.doTransition}} called from public > API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So > right after the internal state transition from {{updateFencedState}} changes > the state to FENCED state, the external state transition changes the state > back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE > state even after {{notifyStoreOperationFailed}} is called. The only working > case for FENCED state is {{notifyStoreOperationFailed}} called from > {{ZKRMStateStore#VerifyActiveStatusThread}}. > For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter > external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => > {{notifyStoreOperationFailed}} > =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal > {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} > change state to FENCED => exit external {{stateMachine.doTransition}} change > state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)