[jira] [Created] (YARN-10535) Make changes in queue placement policy to use auto-queue-placement API in CapacityScheduler
Wangda Tan created YARN-10535: - Summary: Make changes in queue placement policy to use auto-queue-placement API in CapacityScheduler Key: YARN-10535 URL: https://issues.apache.org/jira/browse/YARN-10535 Project: Hadoop YARN Issue Type: Sub-task Components: capacity scheduler Reporter: Wangda Tan Once YARN-10506 is done, we need to call the API from the queue placement policy to create queues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used
Wangda Tan created YARN-10532: - Summary: Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used Key: YARN-10532 URL: https://issues.apache.org/jira/browse/YARN-10532 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan It's better if we can delete auto-created queues when they are not in use for a period of time (like 5 mins). It will be helpful when we have a large number of auto-created queues (e.g. from 500 users), but only a small subset of queues are actively used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10531) Be able to disable user limit factor for CapacityScheduler Leaf Queue
Wangda Tan created YARN-10531: - Summary: Be able to disable user limit factor for CapacityScheduler Leaf Queue Key: YARN-10531 URL: https://issues.apache.org/jira/browse/YARN-10531 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan User limit factor is used to define max cap of how much resource can be consumed by single user. Under Auto Queue Creation context, it doesn't make much sense to set user limit factor, because initially every queue will set weight to 1.0, we want user can consume more resource if possible. It is hard to pre-determine how to set up user limit factor. So it makes more sense to add a new value (like -1) to indicate we will disable user limit factor Logic need to be changed is below: (Inside LeafQueue.java) {code} Resource maxUserLimit = Resources.none(); if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY) { maxUserLimit = Resources.multiplyAndRoundDown(queueCapacity, getUserLimitFactor()); } else if (schedulingMode == SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY) { maxUserLimit = partitionResource; } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10530) CapacityScheduler ResourceLimits doesn't handle node partition well
Wangda Tan created YARN-10530: - Summary: CapacityScheduler ResourceLimits doesn't handle node partition well Key: YARN-10530 URL: https://issues.apache.org/jira/browse/YARN-10530 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler Reporter: Wangda Tan This is a serious bug may impact all releases, I need to do further check but I want to log the JIRA so we will not forget: ResourceLimits objects are used to handle two purposes: 1) When there's cluster resource change, for example adding new node, or scheduler config reinitialize. We will pass ResourceLimits to updateClusterResource to queues. 2) When allocate container, we try to pass parent's available resource to child to make sure child's resource allocation won't violate parent's max resource. For example below: {code} queue used max -- root 1020 root.a8 10 root.a.a1 2 10 root.a.a2 6 10 {code} Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at most allocate 2 resources to a1 because root.a's limit will hit first. This information will be passed down from parent queue to child queue during assignContainers call via ResourceLimits. However, we only pass 1 ResourceLimits from top, for queue initialize, we passed in: {code} root.updateClusterResource(clusterResource, new ResourceLimits( clusterResource)); {code} And when we update cluster resource, we only considered default partition {code} // Update all children for (CSQueue childQueue : childQueues) { // Get ResourceLimits of child queue before assign containers ResourceLimits childLimits = getResourceLimitsOfChild(childQueue, clusterResource, resourceLimits, RMNodeLabelsManager.NO_LABEL, false); childQueue.updateClusterResource(clusterResource, childLimits); } {code} Same for allocation logic, we passed in: (Actually I found I added a TODO item 5 years ago). {code} // Try to use NON_EXCLUSIVE assignment = getRootQueue().assignContainers(getClusterResource(), candidates, // TODO, now we only consider limits for parent for non-labeled // resources, should consider labeled resources as well. new ResourceLimits(labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, getClusterResource())), SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY); {code} The good thing is, in the assignContainers call, we calculated child limit based on partition {code} ResourceLimits childLimits = getResourceLimitsOfChild(childQueue, cluster, limits, candidates.getPartition(), true); {code} So I think now the problem is, when a named partition has more resource than default partition, effective min/max resource of each queue could be wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues
Wangda Tan created YARN-10497: - Summary: Fix an issue in CapacityScheduler which fails to delete queues Key: YARN-10497 URL: https://issues.apache.org/jira/browse/YARN-10497 Project: Hadoop YARN Issue Type: Improvement Reporter: Wangda Tan We saw an exception when using queue mutation APIs: {code:java} 2020-11-13 16:47:46,327 WARN org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: CapacityScheduler configuration validation failed:java.io.IOException: Queue root.am2cmQueueSecond not found {code} Which comes from this code: {code:java} List siblingQueues = getSiblingQueues(queueToRemove, proposedConf); if (!siblingQueues.contains(queueName)) { throw new IOException("Queue " + queueToRemove + " not found"); } {code} (Inside MutableCSConfigurationProvider) If you look at the method: {code:java} private List getSiblingQueues(String queuePath, Configuration conf) { String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.')); String childQueuesKey = CapacitySchedulerConfiguration.PREFIX + parentQueue + CapacitySchedulerConfiguration.DOT + CapacitySchedulerConfiguration.QUEUES; return new ArrayList<>(conf.getStringCollection(childQueuesKey)); } {code} And here's capacity-scheduler.xml I got {code:java} yarn.scheduler.capacity.root.queuesdefault, q1, q2 {code} You can notice there're spaces between default, q1, a2 So conf.getStringCollection returns: {code:java} default q1 ... {code} Which causes match issue when we try to delete the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler
Wangda Tan created YARN-10496: - Summary: [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler Key: YARN-10496 URL: https://issues.apache.org/jira/browse/YARN-10496 Project: Hadoop YARN Issue Type: New Feature Components: capacity scheduler Reporter: Wangda Tan CapacityScheduler today doesn’t support an auto queue creation which is flexible enough. The current constraints: * Only leaf queues can be auto-created * A parent can only have either static queues or dynamic ones. This causes multiple constraints. For example: * It isn’t possible to have a VIP user like Alice with a static queue root.user.alice with 50% capacity while the other user queues (under root.user) are created dynamically and they share the remaining 50% of resources. * In comparison, FairScheduler allows the following scenarios, Capacity Scheduler doesn’t: ** This implies that there is no possibility to have both dynamically created and static queues at the same time under root * A new queue needs to be created under an existing parent, while the parent already has static queues * Nested queue mapping policy, like in the following example: | | * Here two levels of queues may need to be created If an application belongs to user _alice_ (who has the primary_group of _engineering_), the scheduler checks whether _root.engineering_ exists, if it doesn’t, it’ll be created. Then scheduler checks whether _root.engineering.alice_ exists, and creates it if it doesn't. When we try to move users from FairScheduler to CapacityScheduler, we face feature gaps which blocks users migrate from FS to CS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
Wangda Tan created YARN-10380: - Summary: Import logic of multi-node allocation in CapacityScheduler Key: YARN-10380 URL: https://issues.apache.org/jira/browse/YARN-10380 Project: Hadoop YARN Issue Type: Improvement Reporter: Wangda Tan *1) Entry point:* When we do multi-node allocation, we're using the same logic of async scheduling: {code:java} // Allocate containers of node [start, end) for (FiCaSchedulerNode node : nodes) { if (current++ >= start) { if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { continue; } cs.allocateContainersToNode(node.getNodeID(), false); } } {code} Is it the most effective way to do multi-node scheduling? Should we allocate based on partitions? In above logic, if we have thousands of node in one partition, we will repeatly access all nodes of the partition thousands of times. I would suggest looking at making entry-point for node-heartbeat, async-scheduling (single node), and async-scheduling (multi-node) to be different. Node-heartbeat and async-scheduling (single node) can be still similar and share most of the code. async-scheduling (multi-node): should iterate partition first, using pseudo code like: {code:java} for (partition : all partitions) { allocateContainersOnMultiNodes(getCandidate(partition)) } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality
[ https://issues.apache.org/jira/browse/YARN-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-10151. --- Resolution: Won't Fix Thanks folks for commenting about YARN-9838. I think we don't need this change now given we have a fix of the reported issue already. > Disable Capacity Scheduler's move app between queue functionality > - > > Key: YARN-10151 > URL: https://issues.apache.org/jira/browse/YARN-10151 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Priority: Critical > > Saw this happened in many clusters: Capacity Scheduler cannot work correctly > with the move app between queue features. It will cause weird JMX issue, > resource accounting issue, etc. In a lot of causes it will cause RM > completely hung and available resource became negative, nothing can be > allocated after that. We should turn off CapacityScheduler's move app between > queue feature. (see: > {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}} > ) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10170) Should revisit mix-usage of percentage-based and absolute-value-based min/max resource in CapacityScheduler
Wangda Tan created YARN-10170: - Summary: Should revisit mix-usage of percentage-based and absolute-value-based min/max resource in CapacityScheduler Key: YARN-10170 URL: https://issues.apache.org/jira/browse/YARN-10170 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan This should be finished after YARN-10169. (If we can get this one easily, we should do this one instead of YARN-10169). Absolute resource means mem=x, vcore=y. Percentage resource means x% We should not allow percentage-based child, but absolute-based parent. (root is considered as percentage-based). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10167) Need validate c-s.xml after converting
Wangda Tan created YARN-10167: - Summary: Need validate c-s.xml after converting Key: YARN-10167 URL: https://issues.apache.org/jira/browse/YARN-10167 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Currently we just generated c-s.xml, but we didn't validate that. To make sure the c-s.xml is correct after conversion, it's better to initialize the CS scheduler using configs. Also, in the test, we should try to leverage MockRM to validate generated configs as much as we could. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality
Wangda Tan created YARN-10151: - Summary: Disable Capacity Scheduler's move app between queue functionality Key: YARN-10151 URL: https://issues.apache.org/jira/browse/YARN-10151 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Saw this happened in many clusters: Capacity Scheduler cannot work correctly with the move app between queue features. It will cause weird JMX issue, resource accounting issue, etc. In a lot of causes it will cause RM completely hung and available resource became negative, nothing can be allocated after that. We should turn off CapacityScheduler's move app between queue feature. (see: {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}} ) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8975) [Submarine] Use predefined Charset object StandardCharsets.UTF_8 instead of String "UTF-8"
[ https://issues.apache.org/jira/browse/YARN-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8975. -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.3.0 Committed to trunk, thanks [~tangzhankun] and reviews from [~ajisakaa]. > [Submarine] Use predefined Charset object StandardCharsets.UTF_8 instead of > String "UTF-8" > -- > > Key: YARN-8975 > URL: https://issues.apache.org/jira/browse/YARN-8975 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Trivial > Fix For: 3.3.0 > > Attachments: YARN-8975-trunk.001.patch, YARN-8975-trunk.002.patch > > > {code:java} > Writer w = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");{code} > Could be refactored to this to improve a little bit performance due to avoid > string lookup: > {code:java} > Writer w = new OutputStreamWriter(new FileOutputStream(file), > StandardCharsets.UTF_8);{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9020) set a wrong AbsoluteCapacity when call ParentQueue#setAbsoluteCapacity
[ https://issues.apache.org/jira/browse/YARN-9020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-9020. -- Resolution: Duplicate Thanks [~jutia] for reporting this. It is a valid issue. This is dup of YARN-8917, [~Tao Yang] has put a patch already. Closing this as dup. > set a wrong AbsoluteCapacity when call ParentQueue#setAbsoluteCapacity > --- > > Key: YARN-9020 > URL: https://issues.apache.org/jira/browse/YARN-9020 > Project: Hadoop YARN > Issue Type: Bug >Reporter: tianjuan >Assignee: tianjuan >Priority: Major > > set a wrong AbsoluteCapacity when call ParentQueue#setAbsoluteCapacity > private void deriveCapacityFromAbsoluteConfigurations(String label, > Resource clusterResource, ResourceCalculator rc, CSQueue childQueue) { > // 3. Update absolute capacity as a float based on parent's minResource and > // cluster resource. > childQueue.getQueueCapacities().setAbsoluteCapacity(label, > (float) childQueue.getQueueCapacities().{color:#d04437}getCapacity(){color} > / getQueueCapacities().getAbsoluteCapacity(label)); > > {color:#d04437}should be{color} > childQueue.getQueueCapacities().setAbsoluteCapacity(label, > (float) > childQueue.getQueueCapacities().{color:#f6c342}getCapacity(label){color} > / getQueueCapacities().getAbsoluteCapacity(label)); -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8993) [Submarine] Add support to run deep learning workload in non-Docker containers
Wangda Tan created YARN-8993: Summary: [Submarine] Add support to run deep learning workload in non-Docker containers Key: YARN-8993 URL: https://issues.apache.org/jira/browse/YARN-8993 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Now Submarine can well support Docker container, there're some needs to run TF w/o Docker container. This JIRA is targeted to support non-docker container deep learning workload orchestration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8237) mxnet yarn spec file to add to native service examples
[ https://issues.apache.org/jira/browse/YARN-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8237. -- Resolution: Duplicate > mxnet yarn spec file to add to native service examples > -- > > Key: YARN-8237 > URL: https://issues.apache.org/jira/browse/YARN-8237 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > > Mxnet could be run on YARN. This jira will help to add examples, yarnfile, > docker files which are needed to run Mxnet on YARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8238) [Umbrella] YARN deep learning framework examples to run on native service
[ https://issues.apache.org/jira/browse/YARN-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8238. -- Resolution: Fixed Closing as dup of YARN-8135. > [Umbrella] YARN deep learning framework examples to run on native service > - > > Key: YARN-8238 > URL: https://issues.apache.org/jira/browse/YARN-8238 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-native-services >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > > Umbrella jira to track various deep learning frameworks which can run on yarn > native services. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
[ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8513. -- Resolution: Duplicate Fix Version/s: (was: 3.2.1) (was: 3.1.2) Reopen and closing as dup of YARN-8896 > CapacityScheduler infinite loop when queue is near fully utilized > - > > Key: YARN-8513 > URL: https://issues.apache.org/jira/browse/YARN-8513 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 3.1.0, 2.9.1 > Environment: Ubuntu 14.04.5 and 16.04.4 > YARN is configured with one label and 5 queues. >Reporter: Chen Yufei >Priority: Major > Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, > jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, > yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, > yarn3-resourcemanager.log, yarn3-top > > > ResourceManager does not respond to any request when queue is near fully > utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM > restart, it can recover running jobs and start accepting new ones. > > Seems like CapacityScheduler is in an infinite loop printing out the > following log messages (more than 25,000 lines in a second): > > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.99816763 > absoluteUsedCapacity=0.99816763 used= > cluster=}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1530619767030_1652_01 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 > clusterResource= type=NODE_LOCAL > requestedPartition=}} > > I encounter this problem several times after upgrading to YARN 2.9.1, while > the same configuration works fine under version 2.7.3. > > YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a > similar problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8896) Limit the maximum number of container assignments per heartbeat
[ https://issues.apache.org/jira/browse/YARN-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8896. -- Resolution: Fixed > Limit the maximum number of container assignments per heartbeat > --- > > Key: YARN-8896 > URL: https://issues.apache.org/jira/browse/YARN-8896 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.9.0, 3.0.0 >Reporter: Weiwei Yang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.1.2, 3.2.1 > > Attachments: YARN-8896-trunk.001.patch > > > YARN-4161 adds a configuration > \{{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} > to control max number of container assignments per heartbeat, however the > default value is -1. This could potentially cause the CS gets stuck in the > while loop causing issue like YARN-8513. We should change this to a finite > number, e.g 100. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8858) CapacityScheduler should respect maximum node resource when per-queue maximum-allocation is being used.
Wangda Tan created YARN-8858: Summary: CapacityScheduler should respect maximum node resource when per-queue maximum-allocation is being used. Key: YARN-8858 URL: https://issues.apache.org/jira/browse/YARN-8858 Project: Hadoop YARN Issue Type: Bug Reporter: Sumana Sathish Assignee: Wangda Tan This issue happens after YARN-8720. Before that, AMS uses scheduler.getMaximumAllocation to do the normalization. After that, AMS uses LeafQueue.getMaximumAllocation. The scheduler one uses nodeTracker.getMaximumAllocation, but the LeafQueue.getMaximum doesn't. We should use the scheduler.getMaximumAllocation to cap the per-queue's maximum-allocation every time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8817) [Submarine] In some cases HDFS is not asked by user when submit job but framework requires user to set HDFS related environments
Wangda Tan created YARN-8817: Summary: [Submarine] In some cases HDFS is not asked by user when submit job but framework requires user to set HDFS related environments Key: YARN-8817 URL: https://issues.apache.org/jira/browse/YARN-8817 Project: Hadoop YARN Issue Type: Sub-task Components: submarine Reporter: Wangda Tan User who submit the job can see the error message like: 18/09/24 23:12:58 ERROR yarnservice.YarnServiceJobSubmitter: When hdfs is being used to read/write models/data. Followingenvs are required: 1) DOCKER_HADOOP_HDFS_HOME= 2) DOCKER_JAVA_HOME=. You can use --env to pass these envars. Exception in thread "main" java.io.IOException: Failed to detect HDFS-related environments Even if hdfs is not asked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8799) [Submarine] Correct the default directory path in HDFS for "checkout_path"
[ https://issues.apache.org/jira/browse/YARN-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8799. -- Resolution: Duplicate This should be duplicated by YARN-8757. > [Submarine] Correct the default directory path in HDFS for "checkout_path" > -- > > Key: YARN-8799 > URL: https://issues.apache.org/jira/browse/YARN-8799 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.2.0 > > > > {code:java} > yarn jar > $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > -verbose \ > -wait_job_finish \ > -keep_staging_dir \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \ > --name tf-job-001 \ > --docker_image tangzhankun/tensorflow \ > --input_path hdfs://default/user/yarn/cifar-10-data \ > --worker_resources memory=4G,vcores=2 \ > --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py > --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 > --train-steps=5"{code} > > Above script should work, but the job failed due to invalid path passed to > "--job-dir" per my testing. It should be a URI start with "hdfs://". > {code:java} > 2018-09-19 23:19:34,729 INFO yarnservice.YarnServiceJobSubmitter: Worker > command =[cd /cifar10_estimator && python cifar10_main.py > --data-dir=hdfs://default/user/yarn/cifar-10-data > --job-dir=submarine/jobs/tf-job-001/staging/checkpoint_path --num-gpus=0 > --train-steps=2]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8800) Updated documentation of Submarine with latest examples.
Wangda Tan created YARN-8800: Summary: Updated documentation of Submarine with latest examples. Key: YARN-8800 URL: https://issues.apache.org/jira/browse/YARN-8800 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8770) [Submarine] Support using Submarine to submit Pytorch job
Wangda Tan created YARN-8770: Summary: [Submarine] Support using Submarine to submit Pytorch job Key: YARN-8770 URL: https://issues.apache.org/jira/browse/YARN-8770 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8769) [Submarine] Allow user to specify customized quicklink(s) when submit Submarine job
Wangda Tan created YARN-8769: Summary: [Submarine] Allow user to specify customized quicklink(s) when submit Submarine job Key: YARN-8769 URL: https://issues.apache.org/jira/browse/YARN-8769 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan This will be helpful when user submit a job and some links need to be shown on YARN UI2 (service page). For example, user can specify a quick link to Zeppelin notebook UI when a Zeppelin notebook got launched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8757) [Submarine] Add Tensorboard component when --tensorboard is specified
Wangda Tan created YARN-8757: Summary: [Submarine] Add Tensorboard component when --tensorboard is specified Key: YARN-8757 URL: https://issues.apache.org/jira/browse/YARN-8757 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8756) [Submarine] Properly handle relative path for staging area
Wangda Tan created YARN-8756: Summary: [Submarine] Properly handle relative path for staging area Key: YARN-8756 URL: https://issues.apache.org/jira/browse/YARN-8756 Project: Hadoop YARN Issue Type: Sub-task Components: submarine Reporter: Wangda Tan Assignee: Wangda Tan While doing tests, I found when a relative path is being specified for checkpoint. The path passed to Tensorflow is wrong. A trick is get a FileStatus before return. Will attach fix soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8716) [Submarine] Support passing Kerberos principle tokens when launch training jobs.
Wangda Tan created YARN-8716: Summary: [Submarine] Support passing Kerberos principle tokens when launch training jobs. Key: YARN-8716 URL: https://issues.apache.org/jira/browse/YARN-8716 Project: Hadoop YARN Issue Type: Sub-task Components: submarine Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8713) [Submarine] Support deploy model serving for existing models
Wangda Tan created YARN-8713: Summary: [Submarine] Support deploy model serving for existing models Key: YARN-8713 URL: https://issues.apache.org/jira/browse/YARN-8713 Project: Hadoop YARN Issue Type: Sub-task Components: submarine Reporter: Wangda Tan See https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7 {{model deploy}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.
Wangda Tan created YARN-8714: Summary: [Submarine] Support files/tarballs to be localized for a training job. Key: YARN-8714 URL: https://issues.apache.org/jira/browse/YARN-8714 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan See https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7, {{job run --localizations ...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8712) [Submarine] Support create models / versions for training result.
Wangda Tan created YARN-8712: Summary: [Submarine] Support create models / versions for training result. Key: YARN-8712 URL: https://issues.apache.org/jira/browse/YARN-8712 Project: Hadoop YARN Issue Type: Sub-task Components: submarine Reporter: Wangda Tan As mentioned in https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7, we should be able to have models/versions for models created by training algorithm. See design doc for syntax, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8657) User limit calculation should be read-lock-protected within LeafQueue
Wangda Tan created YARN-8657: Summary: User limit calculation should be read-lock-protected within LeafQueue Key: YARN-8657 URL: https://issues.apache.org/jira/browse/YARN-8657 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Sumana Sathish Assignee: Wangda Tan When async scheduling is enabled, user limit calculation could be wrong: It is possible that scheduler calculated a user_limit, but inside {{canAssignToUser}} it becomes staled. We need to protect user limit calculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8563) Support users to specify Python/TF package/version/dependencies for training job.
Wangda Tan created YARN-8563: Summary: Support users to specify Python/TF package/version/dependencies for training job. Key: YARN-8563 URL: https://issues.apache.org/jira/browse/YARN-8563 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan YARN-8561 assumes all Python / Tensorflow dependencies will be packed to docker image. In practice, user doesn't want to build docker image. Instead, user can provide python package / dependencies (like .whl), Python and TF version. And Submarine can localize specified dependencies to prebuilt base Docker images. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8561) Add submarine initial implementation: training job submission and job history retrieve.
Wangda Tan created YARN-8561: Summary: Add submarine initial implementation: training job submission and job history retrieve. Key: YARN-8561 URL: https://issues.apache.org/jira/browse/YARN-8561 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Added following parts: 1) New subcomponent of YARN, under applications/ project. 2) Tensorflow training job submission, including training (single node and distributed). - Supported Docker container. - Support GPU isolation. - Support YARN registry DNS. 3) Retrieve job history. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8545) YARN native service should return container if launch failed
Wangda Tan created YARN-8545: Summary: YARN native service should return container if launch failed Key: YARN-8545 URL: https://issues.apache.org/jira/browse/YARN-8545 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan In some cases, container launch may fail but container will not be properly returned to RM. This could happen when AM trying to prepare container launch context but failed w/o sending container launch context to NM (Once container launch context is sent to NM, NM will report failed container to RM). Exception like: {code:java} java.io.FileNotFoundException: File does not exist: hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) at org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) at org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) at org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) at org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe
Wangda Tan created YARN-8506: Summary: Make GetApplicationsRequestPBImpl thread safe Key: YARN-8506 URL: https://issues.apache.org/jira/browse/YARN-8506 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan Assignee: Wangda Tan When GetApplicationRequestPBImpl is used in multi-thread environment, exceptions like below will occur because we don't protect write ops. {code} java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at java.util.ArrayList.addAll(ArrayList.java:613) at com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132) at com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327) at org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69) {code} We need to make GetApplicationRequestPBImpl thread safe. We saw the issue happens frequently when RequestHedgingRMFailoverProxyProvider is being used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8489) Need to support customer termination policy for native services
Wangda Tan created YARN-8489: Summary: Need to support customer termination policy for native services Key: YARN-8489 URL: https://issues.apache.org/jira/browse/YARN-8489 Project: Hadoop YARN Issue Type: Task Components: yarn-native-services Reporter: Wangda Tan Existing YARN service support termination policy for different restart policies. For example ALWAYS means service will not be terminated. And NEVER means if all component terminated, service will be terminated. There're some jobs/services need different policy. For example, if Tensorflow master component terminated (regardless of succeed or finished), we need to terminate whole training job regardless or other states of other components. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8488) Need to add "SUCCEED" state to YARN service
Wangda Tan created YARN-8488: Summary: Need to add "SUCCEED" state to YARN service Key: YARN-8488 URL: https://issues.apache.org/jira/browse/YARN-8488 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan Existing YARN service has following states: {code} public enum ServiceState { ACCEPTED, STARTED, STABLE, STOPPED, FAILED, FLEX, UPGRADING, UPGRADING_AUTO_FINALIZE; } {code} Ideally we should add "SUCCEEDED" state in order to support long running applications like Tensorflow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8478) The capacity scheduler logs too frequently seriously affecting performance
[ https://issues.apache.org/jira/browse/YARN-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8478. -- Resolution: Duplicate > The capacity scheduler logs too frequently seriously affecting performance > -- > > Key: YARN-8478 > URL: https://issues.apache.org/jira/browse/YARN-8478 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: YunFan Zhou >Assignee: YunFan Zhou >Priority: Critical > Attachments: image-2018-06-29-14-08-50-981.png > > > The capacity scheduler logs too frequently, seriously affecting performance. > As a result of our test that the scheduling speed of capacity scheduler is > difficult to reach 5000/s in the production scenario. > And it will soon reach the log bottleneck. > My current work is to change many log levels from INFO to DEBUG level. > [~wangda] [~leftnoteasy] Any suggestion? > !image-2018-06-29-14-08-50-981.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale
Wangda Tan created YARN-8466: Summary: Add Chaos Monkey unit test framework for validation in scale Key: YARN-8466 URL: https://issues.apache.org/jira/browse/YARN-8466 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan Currently we don't have such framework for testing. We need a framework to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler
Wangda Tan created YARN-8459: Summary: Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler Key: YARN-8459 URL: https://issues.apache.org/jira/browse/YARN-8459 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Thanks [~gopalv] for reporting this issue. In async mode, capacity scheduler can allocate/reserve containers on node/app when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}). This will cause some issues, for example. a. Container for app_1 reserved on node_x. b. At the same time, app_1 is being removed. c. Reserve on node operation finished after app_1 removed ({{doneApplicationAttempt}}). For all the future runs, the node_x is completely blocked by the invalid reservation. It keep reporting "Trying to schedule for a finished app, please double check" for the node_x. We need a fix to make sure this won't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8417) Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container.
Wangda Tan created YARN-8417: Summary: Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container. Key: YARN-8417 URL: https://issues.apache.org/jira/browse/YARN-8417 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Currently, YARN NM passes JAVA_HOME, HDFS_HOME, CLASSPATH environments before launching Docker container no matter if ENTRY_POINT is used or not. This will overwrite environments defined inside Dockerfile (by using \{{ENV}}). For Docker container, it actually doesn't make sense to pass JAVA_HOME, HDFS_HOME, etc. because inside docker image we have a separate Java/Hadoop installed or mounted to exactly same directory of host machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8220. -- Resolution: Later > Running Tensorflow on YARN with GPU and Docker - Examples > - > > Key: YARN-8220 > URL: https://issues.apache.org/jira/browse/YARN-8220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Critical > Attachments: YARN-8220.001.patch > > > Tensorflow could be run on YARN and could leverage YARN's distributed > features. > This spec fill will help to run Tensorflow on yarn with GPU/docker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues
Wangda Tan created YARN-8379: Summary: Add an option to allow Capacity Scheduler preemption to balance satisfied queues Key: YARN-8379 URL: https://issues.apache.org/jira/browse/YARN-8379 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Existing capacity scheduler only supports preemption for an underutilized queue to reach its guaranteed resource. In addition to that, there’s an requirement to get better balance between queues when all of them reach guaranteed resource but with different fairness resource. An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c = 40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing scheduler preemption won't happen. But this is unfair to queue_b since queue_b has the same guaranteed resources. Before YARN-5864, capacity scheduler do additional preemption to balance queues. We changed the logic since it could preempt too many containers between queues when all queues are satisfied. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8343) YARN should have ability to run images only from a whitelist docker registries
Wangda Tan created YARN-8343: Summary: YARN should have ability to run images only from a whitelist docker registries Key: YARN-8343 URL: https://issues.apache.org/jira/browse/YARN-8343 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan This is a superset of docker.privileged-containers.registries, admin can specify a whitelist and all images from non-privileged-container.registries will be rejected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8342) Using docker image from a non-privileged registry, the launch_command is not honored
Wangda Tan created YARN-8342: Summary: Using docker image from a non-privileged registry, the launch_command is not honored Key: YARN-8342 URL: https://issues.apache.org/jira/browse/YARN-8342 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan During test of the Docker feature, I found that if a container comes from non-privileged docker registry, the specified launch command will be ignored. Container will success without any log, which is very confusing to end users. And this behavior is inconsistent to containers from privileged docker registries. cc: [~eyang], [~shaneku...@gmail.com], [~ebadger], [~jlowe] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8340) Capacity Scheduler Intra Queue Preemption Should Work When 3rd or more resources enabled.
Wangda Tan created YARN-8340: Summary: Capacity Scheduler Intra Queue Preemption Should Work When 3rd or more resources enabled. Key: YARN-8340 URL: https://issues.apache.org/jira/browse/YARN-8340 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Refer to comment from [~eepayne] and discussion below that: https://issues.apache.org/jira/browse/YARN-8292?focusedCommentId=16482689=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16482689 for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8272) Several items are missing from Hadoop 3.1.0 documentation
[ https://issues.apache.org/jira/browse/YARN-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8272. -- Resolution: Duplicate Closing as dup of HADOOP-15374 > Several items are missing from Hadoop 3.1.0 documentation > - > > Key: YARN-8272 > URL: https://issues.apache.org/jira/browse/YARN-8272 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Reporter: Wangda Tan >Priority: Blocker > > From what I can see there're several missing items like GPU / FPGA: > http://hadoop.apache.org/docs/current/ > We should add them to hadoop-project/src/site/site.xml in the next release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8272) Several items are missing from Hadoop 3.1.0 documentation
Wangda Tan created YARN-8272: Summary: Several items are missing from Hadoop 3.1.0 documentation Key: YARN-8272 URL: https://issues.apache.org/jira/browse/YARN-8272 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Wangda Tan >From what I can see there're several missing items like GPU / FPGA: >http://hadoop.apache.org/docs/current/ We should add them to hadoop-project/src/site/site.xml in the next release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN
Wangda Tan created YARN-8257: Summary: Native service should automatically adding escapes for environment/launch cmd before sending to YARN Key: YARN-8257 URL: https://issues.apache.org/jira/browse/YARN-8257 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Reporter: Wangda Tan Assignee: Gour Saha Noticed this issue while using native service: Basically, when a string for environment / launch command contains chars like ", /, `: it needs to be escaped twice. The first time is from json spec, because of json accept double quote only, it needs an escape. The second time is from launch container, what we did for command line is: (ContainerLaunch.java) {code:java} line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code} And for environment: {code:java} line("export ", key, "=\"", value, "\"");{code} An example of launch_command: {code:java} "launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop classpath --glob\\`"{code} And example of environment: {code:java} "TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": [\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": [\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": [\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, \\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", \\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code} To improve usability, I think we should auto escape the input string once. (For example, if user specified {code} "TF_CONFIG": "\"key\"" {code} We will automatically escape it to: {code} "TF_CONFIG": \\\"key\\\" {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler
Wangda Tan created YARN-8149: Summary: Revisit behavior of Re-Reservation in Capacity Scheduler Key: YARN-8149 URL: https://issues.apache.org/jira/browse/YARN-8149 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Frankly speaking, I'm not sure why we need the re-reservation. The formula is not that easy to understand: Inside: {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}} {code:java} starvation = re-reservation / (#reserved-container * (1 - min(requested-resource / max-alloc, max-alloc - min-alloc / max-alloc)) should_allocate = starvation + requiredContainers - reservedContainers > 0{code} I think we should be able to remove the starvation computation, just to check requiredContainers > reservedContainers should be enough. In a large cluster, we can easily overflow re-reservation to MAX_INT, see YARN-7636. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
Wangda Tan created YARN-8141: Summary: YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec Key: YARN-8141 URL: https://issues.apache.org/jira/browse/YARN-8141 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Reporter: Wangda Tan Existing YARN native service overwrites YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user specified this in service spec or not. It is important to allow user to mount local folders like /etc/passwd, etc. Following logic overwrites the YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment: {code:java} StringBuilder sb = new StringBuilder(); for (Entrymount : mountPaths.entrySet()) { if (sb.length() > 0) { sb.append(","); } sb.append(mount.getKey()); sb.append(":"); sb.append(mount.getValue()); } env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", sb.toString());{code} Inside AbstractLauncher.java -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop
Wangda Tan created YARN-8135: Summary: Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop Key: YARN-8135 URL: https://issues.apache.org/jira/browse/YARN-8135 Project: Hadoop YARN Issue Type: New Feature Reporter: Wangda Tan Assignee: Wangda Tan Attachments: image-2018-04-09-14-35-16-778.png Description: *Goals:* - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on YARN. - Allow jobs easy access data/models in HDFS and other storages. - Can launch services to serve Tensorflow/MXNet models. - Support run distributed Tensorflow jobs with simple configs. - Support run user-specified Docker images. - Support specify GPU and other resources. - Support launch tensorboard if user specified. - Support customized DNS name for roles (like tensorboard.$user.$domain:6006) *Why this name?* - Because Submarine is the only vehicle can take human to deep places. B-) Compare to other projects: !image-2018-04-09-14-35-16-778.png! *Notes:* * GPU Isolation of XLearning project is achieved by patched YARN, which is different from community’s GPU isolation solution. ** XLearning needs few modification to read ClusterSpec from env. *References:* - TensorflowOnSpark (Yahoo): https://github.com/yahoo/TensorFlowOnSpark - TensorFlowOnYARN (Intel): https://github.com/Intel-bigdata/TensorFlowOnYARN - Spark Deep Learning (Databricks): https://github.com/databricks/spark-deep-learning - XLearning (Qihoo360): https://github.com/Qihoo360/XLearning - Kubeflow (Google): https://github.com/kubeflow/kubeflow -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8109) Resource Manager WebApps fails to start due to ConcurrentModificationException
Wangda Tan created YARN-8109: Summary: Resource Manager WebApps fails to start due to ConcurrentModificationException Key: YARN-8109 URL: https://issues.apache.org/jira/browse/YARN-8109 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan {code} 2018-03-22 04:57:39,289 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:nodeHeartbeat(497)) - Node not found resyncing ctr-e138-1518143905142-129550-01-36.hwx.site:25454 2018-03-22 04:57:39,294 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state STARTED; cause: java.util.ConcurrentModificationException java.util.ConcurrentModificationException at java.util.Hashtable$Enumerator.next(Hashtable.java:1378) at org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2564) at org.apache.hadoop.conf.Configuration.getPropsWithPrefix(Configuration.java:2583) at org.apache.hadoop.yarn.webapp.WebApps$Builder.getConfigParameters(WebApps.java:386) at org.apache.hadoop.yarn.webapp.WebApps$Builder.build(WebApps.java:334) at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:395) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1049) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1152) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1293) 2018-03-22 04:57:39,296 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 8050 2018-03-22 04:57:39,300 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 8050 2018-03-22 04:57:39,301 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8091) Revisit checkUserAccessToQueue RM REST API
Wangda Tan created YARN-8091: Summary: Revisit checkUserAccessToQueue RM REST API Key: YARN-8091 URL: https://issues.apache.org/jira/browse/YARN-8091 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan Assignee: Wangda Tan As offline suggested by [~sershe]. Currently design of the checkUserAccessToQueue mixed config-related issues (like user doesn't access to the URL) and user-facing output (like requested user is not permitted to access the queue) in the same code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5881) Enable configuration of queue capacity in terms of absolute resources
[ https://issues.apache.org/jira/browse/YARN-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-5881. -- Resolution: Done > Enable configuration of queue capacity in terms of absolute resources > - > > Key: YARN-5881 > URL: https://issues.apache.org/jira/browse/YARN-5881 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Sean Po >Assignee: Sunil G >Priority: Major > Fix For: 3.1.0 > > Attachments: > YARN-5881.Support.Absolute.Min.Max.Resource.In.Capacity.Scheduler.design-doc.v1.pdf, > YARN-5881.v0.patch, YARN-5881.v1.patch > > > Currently, Yarn RM supports the configuration of queue capacity in terms of a > proportion to cluster capacity. In the context of Yarn being used as a public > cloud service, it makes more sense if queues can be configured absolutely. > This will allow administrators to set usage limits more concretely and > simplify customer expectations for cluster allocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8084) Yarn native service rename for easier development?
Wangda Tan created YARN-8084: Summary: Yarn native service rename for easier development? Key: YARN-8084 URL: https://issues.apache.org/jira/browse/YARN-8084 Project: Hadoop YARN Issue Type: Task Environment: There're a couple of classes with same name exists in YARN native service. Such as: 1) ...service.component.Component and api.records.Component. This makes harder when development in IDE since clash of class name forces to use full qualified class name. Similarly in API definition: ...service.api.records: Container/ContainerState/Resource/ResourceInformation. How about rename them to: ServiceContainer/ServiceContainerState/ServiceResource/ServiceResourceInformation? Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8080) YARN native service should support component restart policy
Wangda Tan created YARN-8080: Summary: YARN native service should support component restart policy Key: YARN-8080 URL: https://issues.apache.org/jira/browse/YARN-8080 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-8080.001.patch Existing native service assumes the service is long running and never finishes. Containers will be restarted even if exit code == 0. To support boarder use cases, we need to allow restart policy of component specified by users. Propose to have following policies: 1) Always: containers always restarted by framework regardless of container exit status. This is existing/default behavior. 2) Never: Do not restart containers in any cases after container finishes: To support job-like workload (for example Tensorflow training job). If a task exit with code == 0, we should not restart the task. This can be used by services which is not restart/recovery-able. 3) On-failure: Similar to above, only restart task with exitcode != 0. Behaviors after component *instance* finalize (Succeeded or Failed when restart_policy != ALWAYS): 1) For single component, single instance: complete service. 2) For single component, multiple instance: other running instances from the same component won't be affected by the finalized component instance. Service will be terminated once all instances finalized. 3) For multiple components: Service will be terminated once all components finalized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8079) YARN native service should respect source file of ConfigFile inside Service/Component spec
Wangda Tan created YARN-8079: Summary: YARN native service should respect source file of ConfigFile inside Service/Component spec Key: YARN-8079 URL: https://issues.apache.org/jira/browse/YARN-8079 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly read srcFile, instead it always construct {{remoteFile}} by using componentDir and fileName of {{destFile}}: {code} Path remoteFile = new Path(compInstanceDir, fileName); {code} To me it is a common use case which services have some files existed in HDFS and need to be localized when components get launched. (For example, if we want to serve a Tensorflow model, we need to localize Tensorflow model (typically not huge, less than GB) to local disk. Otherwise launched docker container has to access HDFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5983) [Umbrella] Support for FPGA as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-5983. -- Resolution: Done Fix Version/s: 3.1.0 Since this feature works end to end and landed in 3.1.0, closing the umbrella as done. > [Umbrella] Support for FPGA as a Resource in YARN > - > > Key: YARN-5983 > URL: https://issues.apache.org/jira/browse/YARN-5983 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-5983-Support-FPGA-resource-on-NM-side_v1.pdf, > YARN-5983-implementation-notes.pdf, YARN-5983_end-to-end_test_report.pdf > > > As various big data workload running on YARN, CPU will no longer scale > eventually and heterogeneous systems will become more important. ML/DL is a > rising star in recent years, applications focused on these areas have to > utilize GPU or FPGA to boost performance. Also, hardware vendors such as > Intel also invest in such hardware. It is most likely that FPGA will become > popular in data centers like CPU in the near future. > So YARN as a resource managing and scheduling system, would be great to > evolve to support this. This JIRA proposes FPGA to be a first-class citizen. > The changes roughly includes: > 1. FPGA resource detection and heartbeat > 2. Scheduler changes (YARN-3926 invlolved) > 3. FPGA related preparation and isolation before launch container > We know that YARN-3926 is trying to extend current resource model. But still > we can leave some FPGA related discussion here -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN
[ https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-6223. -- Resolution: Done Fix Version/s: 3.1.0 Closing as done since all sub tasks are done. > [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation > on YARN > > > Key: YARN-6223 > URL: https://issues.apache.org/jira/browse/YARN-6223 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, > YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch > > > As varieties of workloads are moving to YARN, including machine learning / > deep learning which can speed up by leveraging GPU computation power. > Workloads should be able to request GPU from YARN as simple as CPU and memory. > *To make a complete GPU story, we should support following pieces:* > 1) GPU discovery/configuration: Admin can either config GPU resources and > architectures on each node, or more advanced, NodeManager can automatically > discover GPU resources and architectures and report to ResourceManager > 2) GPU scheduling: YARN scheduler should account GPU as a resource type just > like CPU and memory. > 3) GPU isolation/monitoring: once launch a task with GPU resources, > NodeManager should properly isolate and monitor task's resource usage. > For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced > an extensible framework to support isolation for different resource types and > different runtimes. > *Related JIRAs:* > There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but > different solutions: > For scheduling: > - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource > protocol instead of leveraging YARN-3926. > For isolation: > - And YARN-4122 proposed to use CGroups to do isolation which cannot solve > the problem listed at > https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as > minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver > versions, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5326) Support for recurring reservations in the YARN ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-5326. -- Resolution: Done > Support for recurring reservations in the YARN ReservationSystem > > > Key: YARN-5326 > URL: https://issues.apache.org/jira/browse/YARN-5326 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Subru Krishnan >Assignee: Carlo Curino >Priority: Major > Attachments: SupportRecurringReservationsInRayon.pdf > > > YARN-1051 introduced a ReservationSytem that enables the YARN RM to handle > time explicitly, i.e. users can now "reserve" capacity ahead of time which is > predictably allocated to them. Most SLA jobs/workflows are recurring so they > need the same resources periodically. With the current implementation, users > will have to make individual reservations for each run. This is an umbrella > JIRA to enhance the reservation system by adding native support for recurring > reservations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7303) Merge YARN-5734 branch to trunk branch
[ https://issues.apache.org/jira/browse/YARN-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-7303. -- Resolution: Done Closing as "done" since there's no patch committed with the Jira. > Merge YARN-5734 branch to trunk branch > -- > > Key: YARN-7303 > URL: https://issues.apache.org/jira/browse/YARN-7303 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7873) Revert YARN-6078
[ https://issues.apache.org/jira/browse/YARN-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-7873. -- Resolution: Invalid > Revert YARN-6078 > > > Key: YARN-7873 > URL: https://issues.apache.org/jira/browse/YARN-7873 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Billie Rinaldi >Assignee: Billie Rinaldi >Priority: Blocker > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1 > > > I think we should revert YARN-6078, since it is not working as intended. The > NM does not have permission to destroy the process of the ContainerLocalizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8046) Revisit RMWebServiceProtocol implementations
Wangda Tan created YARN-8046: Summary: Revisit RMWebServiceProtocol implementations Key: YARN-8046 URL: https://issues.apache.org/jira/browse/YARN-8046 Project: Hadoop YARN Issue Type: Improvement Reporter: Wangda Tan I recently found that new changes of RMWebServiceProtocol make adding any new REST API pretty hard. There're at least 6 classes need to be implemented: 1. {{MockRESTRequestInterceptor}} 2. {{FederationInterceptorREST}} 3. {{DefaultRequestInterceptorREST}} 4. {{PassThroughRESTRequestInterceptor} 5. {{RouterWebServices}} 6. {{RMWebServices}} Different classes implementations have different styles, simple copy-paste is not enough. For example. {{DefaultRequestInterceptorREST}} uses {{RouterWebServiceUtil.genericForward}} to pass all parameters, which needs to understand how each REST API works, reconstruct an URL which can easily cause issues. I think we should revisit these APIs and make sure new API can be easier added to REST interface like before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8028) Support authorizeUserAccessToQueue in RMWebServices
Wangda Tan created YARN-8028: Summary: Support authorizeUserAccessToQueue in RMWebServices Key: YARN-8028 URL: https://issues.apache.org/jira/browse/YARN-8028 Project: Hadoop YARN Issue Type: Improvement Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7920) Cleanup configuration of PlacementConstraints
Wangda Tan created YARN-7920: Summary: Cleanup configuration of PlacementConstraints Key: YARN-7920 URL: https://issues.apache.org/jira/browse/YARN-7920 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Currently it is very confusing to have the two configs in two different files (yarn-site.xml and capacity-scheduler.xml). Maybe a better approach is: we can delete the scheduling-request.allowed in CS, and update placement-constraints configs in yarn-site.xml a bit: - Remove placement-constraints.enabled, and add a new placement-constraints.handler, by default is none, and other acceptable values are a. external-processor (since algorithm is too generic to me), b. scheduler. - And add a new PlacementProcessor just to pass SchedulingRequest to scheduler without any modifications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7854) Attach prefixes to different type of node attributes
[ https://issues.apache.org/jira/browse/YARN-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-7854. -- Resolution: Later > Attach prefixes to different type of node attributes > > > Key: YARN-7854 > URL: https://issues.apache.org/jira/browse/YARN-7854 > Project: Hadoop YARN > Issue Type: Sub-task > Components: RM >Reporter: Weiwei Yang >Assignee: LiangYe >Priority: Major > > There are multiple types of node attributes depending on which source it > comes from, includes > # Centralized: attributes set by users (admin or normal users) > # Distributed: attributes collected by a certain attribute provider on each > NM > # System: some built-in attributes in yarn, set by yarn internal components, > e.g scheduler > To better manage these attributes, we introduce the prefix (namespace) > concept to the an attribute. This Jira is opened to figure out how to attach > prefixes (automatically/implicitly or explicitly) to different type of > attributes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7759) [UI2]GPU chart shows as "Available: 0" even though GPU is available
[ https://issues.apache.org/jira/browse/YARN-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-7759. -- Resolution: Duplicate Duplicated by YARN-7817 > [UI2]GPU chart shows as "Available: 0" even though GPU is available > --- > > Key: YARN-7759 > URL: https://issues.apache.org/jira/browse/YARN-7759 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Vasudevan Skm >Priority: Major > > GPU chart under Node Manager page shows as zero GPU's available even though > GPU s are present. Only when we click 'GPU Information' chart, it shows > correct GPU information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7817) Add Resource reference to RM's NodeInfo object so REST API can get non memory/vcore resource usages.
Wangda Tan created YARN-7817: Summary: Add Resource reference to RM's NodeInfo object so REST API can get non memory/vcore resource usages. Key: YARN-7817 URL: https://issues.apache.org/jira/browse/YARN-7817 Project: Hadoop YARN Issue Type: Sub-task Reporter: Sumana Sathish Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7807) By default do intra-app anti-affinity for scheduling request inside app placement allocator
Wangda Tan created YARN-7807: Summary: By default do intra-app anti-affinity for scheduling request inside app placement allocator Key: YARN-7807 URL: https://issues.apache.org/jira/browse/YARN-7807 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan See discussion on: https://issues.apache.org/jira/browse/YARN-7791?focusedCommentId=16336857=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16336857 We need to make changes to AppPlacementAllocator to treat default target allocation tags is for intra-app. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7801) AmFilterInitializer should addFilter after fill all parameters
Wangda Tan created YARN-7801: Summary: AmFilterInitializer should addFilter after fill all parameters Key: YARN-7801 URL: https://issues.apache.org/jira/browse/YARN-7801 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Existing AmFilterInitializer cannot successfully pass RM_HA_URLS parameter to AmIpFitler because of this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures
Wangda Tan created YARN-7790: Summary: Improve Capacity Scheduler Async Scheduling to better handle node failures Key: YARN-7790 URL: https://issues.apache.org/jira/browse/YARN-7790 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan This is not a new issue but async scheduling makes it worse: In sync scheduling, if an AM container allocated to a node, it assumes node just heartbeat to RM, and in the same response, it will be sent back to NM. Even though it is possible that NM crashes after the heartbeat, which causes AM hangs for 10 mins. But it is related rare. In async scheduling world, multiple AM containers can be placed on a problematic NM, which could cause application hangs for long time. Discussed with [~sunilg] , we need at least two fixes: When async scheduling enabled: 1) Skip node which missed X node heartbeat. 2) Kill AM container in ALLOCATED state on a node which missed Y node heartbeat. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7789) Should fail RM if 3rd resource type is configured but RM uses DefaultResourceCalculator
Wangda Tan created YARN-7789: Summary: Should fail RM if 3rd resource type is configured but RM uses DefaultResourceCalculator Key: YARN-7789 URL: https://issues.apache.org/jira/browse/YARN-7789 Project: Hadoop YARN Issue Type: Sub-task Environment: We may need to revisit this behavior: Currently, RM doesn't fail if 3rd resource type is configured, allocated containers will be automatically assigned minimum allocation for all resource types except memory, this makes really hard for troubleshooting. I prefer to fail RM if 3rd or more resource type is configured inside resource-types.xml. Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7763) Refactoring PlacementConstraintUtils APIs so PlacementProcessor/Scheduler can use the same API and implementation
Wangda Tan created YARN-7763: Summary: Refactoring PlacementConstraintUtils APIs so PlacementProcessor/Scheduler can use the same API and implementation Key: YARN-7763 URL: https://issues.apache.org/jira/browse/YARN-7763 Project: Hadoop YARN Issue Type: Sub-task Environment: As I mentioned on YARN-6599, we will add SchedulingRequest as part of the PlacementConstraintUtil method and both of processor/scheduler implementation will use the same logic. The logic looks like: {code:java} PlacementConstraint pc = schedulingRequest.getPlacementConstraint(); If (pc == null) { pc = PlacementConstraintMgr.getPlacementConstraint(schedulingRequest.getAllocationTags()); } // Do placement constraint match ...{code} Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7739) Revisit scheduler resource normalization behavior for max allocation
Wangda Tan created YARN-7739: Summary: Revisit scheduler resource normalization behavior for max allocation Key: YARN-7739 URL: https://issues.apache.org/jira/browse/YARN-7739 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Priority: Critical Currently, YARN Scheduler normalizes requested resource based on the maximum allocation derived from configured maximum allocation and maximum registered node resources. Basically, the scheduler will silently cap asked resource by maximum allocation. This could cause issues for applications, for example, a Spark job which needs 12 GB memory to run, however in the cluster, registered NMs have at most 8 GB mem on each node. So scheduler allocates 8GB memory container to the requested application. Once app receives containers from RM, if it doesn't double check allocated resources, it will lead to OOM and hard to debug because scheduler silently caps maximum allocation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7723) Avoid using docker volume --format option to compatible to older docker releases
Wangda Tan created YARN-7723: Summary: Avoid using docker volume --format option to compatible to older docker releases Key: YARN-7723 URL: https://issues.apache.org/jira/browse/YARN-7723 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7718) DistributedShell failed to specify resource other than memory/vcores from container_resources
Wangda Tan created YARN-7718: Summary: DistributedShell failed to specify resource other than memory/vcores from container_resources Key: YARN-7718 URL: https://issues.apache.org/jira/browse/YARN-7718 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Priority: Critical After YARN-7242, it has a bug to read resource values other than memory/vcores. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7709) Remove SELF from TargetExpression type .
Wangda Tan created YARN-7709: Summary: Remove SELF from TargetExpression type . Key: YARN-7709 URL: https://issues.apache.org/jira/browse/YARN-7709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Priority: Blocker As mentioned by [~asuresh], SELF means target allocation tag same as allocation tag of the scheduling request itself. So this is not a new type for sure, it is still ALLOCATION_TAG type. If we really want this functionality, we can build this in PlacementConstraints, but I'm doubtful about this since copying allocation tags from source is just a trivial work. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7416) Use "docker volume inspect" to make sure that volumes for GPU drivers/libs are properly mounted.
[ https://issues.apache.org/jira/browse/YARN-7416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-7416. -- Resolution: Duplicate Duplicated by YARN-7487. > Use "docker volume inspect" to make sure that volumes for GPU drivers/libs > are properly mounted. > - > > Key: YARN-7416 > URL: https://issues.apache.org/jira/browse/YARN-7416 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-7509. -- Resolution: Fixed Fix Version/s: (was: 3.0.1) 3.0.0 > AsyncScheduleThread and ResourceCommitterService are still running after RM > is transitioned to standby > -- > > Key: YARN-7509 > URL: https://issues.apache.org/jira/browse/YARN-7509 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Fix For: 3.0.0, 3.1.0, 2.9.1 > > Attachments: YARN-7509.001.patch > > > After RM is transitioned to standby, AsyncScheduleThread and > ResourceCommitterService will receive interrupt signal. When thread is > sleeping, it will ignore the interrupt signal since InterruptedException is > catched inside and the interrupt signal is cleared. > For AsyncScheduleThread, InterruptedException was catched and ignored in > CapacityScheduler#schedule. > For ResourceCommitterService, InterruptedException was catched inside and > ignored in ResourceCommitterService#run. > We should let the interrupt signal out and make these threads exit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7555) Support multiple resource types in YARN native services
Wangda Tan created YARN-7555: Summary: Support multiple resource types in YARN native services Key: YARN-7555 URL: https://issues.apache.org/jira/browse/YARN-7555 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical We need to support specifying multiple resource type in addition to memory/cpu in YARN native services -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7522) Add application tags manager implementation
Wangda Tan created YARN-7522: Summary: Add application tags manager implementation Key: YARN-7522 URL: https://issues.apache.org/jira/browse/YARN-7522 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan This is different from YARN-6596, YARN-6596 is targeted to add constraint manager to store intra/inter application placement constraints. This JIRA is targeted to support storing maps between container-tags/applications and nodes. This will be required by affinity/anti-affinity implementation and cardinality. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7487) Make sure volume includes GPU base libraries exists after created by plugin
Wangda Tan created YARN-7487: Summary: Make sure volume includes GPU base libraries exists after created by plugin Key: YARN-7487 URL: https://issues.apache.org/jira/browse/YARN-7487 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan YARN-7224 will create docker volume includes GPU base libraries when launch a docker container which needs GPU. This JIRA will add necessary checks to make sure docker volume exists before launching the container to reduce debug efforts if container fails. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7457) Delay scheduling should be an individual policy instead of part of scheduler implementation
Wangda Tan created YARN-7457: Summary: Delay scheduling should be an individual policy instead of part of scheduler implementation Key: YARN-7457 URL: https://issues.apache.org/jira/browse/YARN-7457 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Currently, different schedulers have slightly different delay scheduling implementations. Ideally we should make delay scheduling independent from scheduler implementation. Benefits of doing this: 1) Applications can choose which delay scheduling policy to use, it could be time-based / missed-opportunistic-based or whatever new delay scheduling policy supported by the cluster. Now it is global config of scheduler. 2) Make scheduler implementations simpler and reusable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7442) [YARN-7069] Limit format of resource type name
Wangda Tan created YARN-7442: Summary: [YARN-7069] Limit format of resource type name Key: YARN-7442 URL: https://issues.apache.org/jira/browse/YARN-7442 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Priority: Blocker I think we should limit format of resource type name. Otherwise it could be very hard to update in the future after release. I propose to have format: {code} [a-zA-Z0-9][a-zA-Z0-9_.-/]* {code} Adding this check to setResourceInformation might affect performance a lot. Probably we can add to {{ResourceUtils#initializeResourcesMap}} when resource types are loaded from config file. [~templedf]/[~sunilg]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7438) Additional changes to make SchedulingPlacementSet agnostic to ResourceRequest / placement algorithm
Wangda Tan created YARN-7438: Summary: Additional changes to make SchedulingPlacementSet agnostic to ResourceRequest / placement algorithm Key: YARN-7438 URL: https://issues.apache.org/jira/browse/YARN-7438 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Priority: Major In additional to YARN-6040, we need to make changes to SchedulingPlacementSet to make it: 1) Agnostic to ResourceRequest (so once we have YARN-6592 merged, we can add new SchedulingPlacementSet implementation in parallel with LocalitySchedulingPlacementSet to use/manage new requests API) 2) Agnostic to placement algorithm (now it is bind to delayed scheduling, we should update APIs to make sure new placement algorithms such as complex placement algorithms can be implemented by using SchedulingPlacementSet). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7437) Give SchedulingPlacementSet to a better name.
Wangda Tan created YARN-7437: Summary: Give SchedulingPlacementSet to a better name. Key: YARN-7437 URL: https://issues.apache.org/jira/browse/YARN-7437 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Priority: Major Currently, the SchedulingPlacementSet is very confusing. Here're its responsibilities: 1) Store ResourceRequests. (Or SchedulingRequest after YARN-6592). 2) Decide order of nodes to allocate when there're multiple node candidates. 3) Decide if we should reject node for given requests. 4) Store any states/cache can help make decision for #2/#3 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5908) Add affinity/anti-affinity field to ResourceRequest API
[ https://issues.apache.org/jira/browse/YARN-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-5908. -- Resolution: Duplicate Duplicated to YARN-6952 > Add affinity/anti-affinity field to ResourceRequest API > --- > > Key: YARN-5908 > URL: https://issues.apache.org/jira/browse/YARN-5908 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7318) Fix shell check warnings of SLS.
Wangda Tan created YARN-7318: Summary: Fix shell check warnings of SLS. Key: YARN-7318 URL: https://issues.apache.org/jira/browse/YARN-7318 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Warnings like: {code} hadoop-tools/hadoop-sls/src/main/bin/rumen2sls.sh:75:77: warning: args is referenced but not assigned. [SC2154] hadoop-tools/hadoop-sls/src/main/bin/slsrun.sh:113:61: warning: args is referenced but not assigned. [SC2154] {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-4122) Add support for GPU as a resource
[ https://issues.apache.org/jira/browse/YARN-4122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-4122. -- Resolution: Duplicate This is duplicated by YARN-6620, closing as dup. > Add support for GPU as a resource > - > > Key: YARN-4122 > URL: https://issues.apache.org/jira/browse/YARN-4122 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: GPUAsAResourceDesign.pdf > > > Use [cgroups > devcies|https://www.kernel.org/doc/Documentation/cgroups/devices.txt] to > isolate GPUs for containers. For docker containers, we could use 'docker run > --device=...'. > Reference: [SLURM Resources isolation through > cgroups|http://slurm.schedmd.com/slurm_ug_2011/SLURM_UserGroup2011_cgroups.pdf]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7307) Revisit resource-types.xml loading behaviors
Wangda Tan created YARN-7307: Summary: Revisit resource-types.xml loading behaviors Key: YARN-7307 URL: https://issues.apache.org/jira/browse/YARN-7307 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Sunil G Existing feature requires every client has a resource-types.xml in order to use multiple resource types, should we allow client/AM update supported resource types via Yarn APIs? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7292) Revisit Resource Profile Behavior
Wangda Tan created YARN-7292: Summary: Revisit Resource Profile Behavior Key: YARN-7292 URL: https://issues.apache.org/jira/browse/YARN-7292 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Priority: Blocker Had discussions with [~templedf], [~vvasudev], [~sunilg] offline. There're a couple of resource profile related behavior -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed
[ https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-7249. -- Resolution: Invalid Sorry for the noise, it is not an issue for 2.8 as well. Closing as invalid. > Fix CapacityScheduler NPE issue when a container preempted while the node is > being removed > -- > > Key: YARN-7249 > URL: https://issues.apache.org/jira/browse/YARN-7249 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > > This issue could happen when 3 conditions satisfied: > 1) A node is removing from scheduler. > 2) A container running on the node is being preempted. > 3) A rare race condition causes scheduler pass a null node to leaf queue. > Fix of the problem is to add a null node check inside CapacityScheduler. > Stack trace: > {code} > 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager > (ResourceManager.java:run(714)) - Error in handling event type > KILL_RESERVED_CONTAINER to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705) > > {code} > This is an issue only existed in 2.8.x -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed
Wangda Tan created YARN-7249: Summary: Fix CapacityScheduler NPE issue when a container preempted while the node is being removed Key: YARN-7249 URL: https://issues.apache.org/jira/browse/YARN-7249 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.1 Reporter: Wangda Tan Assignee: Wangda Tan Priority: Blocker This issue could happen when 3 conditions satisfied: 1) A node is removing from scheduler. 2) A container running on the node is being preempted. 3) A rare race condition causes scheduler pass a null node to leaf queue. Fix of the problem is to add a null node check inside CapacityScheduler. Stack trace: {code} 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(714)) - Error in handling event type KILL_RESERVED_CONTAINER to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705) {code} This is an issue only existed in 2.8.x -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7242) Support support specify values of different resource types in DistributedShell for easier testing
Wangda Tan created YARN-7242: Summary: Support support specify values of different resource types in DistributedShell for easier testing Key: YARN-7242 URL: https://issues.apache.org/jira/browse/YARN-7242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Currently, DS supports specify resource profile, it's better to allow user to directly specify resource keys/values from command line. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7237) Cleanup usages of ResourceProfiles
Wangda Tan created YARN-7237: Summary: Cleanup usages of ResourceProfiles Key: YARN-7237 URL: https://issues.apache.org/jira/browse/YARN-7237 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical While doing tests, there're a couple of issues: 1) When use {{ProfileCapability#getProfileCapabilityOverride}}, it does overwrite of whatever specified in resource-profiles.json when value >= 0. Which is different from javadocs of {{ProfileCapability}} bq. For example, if you have a resource profile "small" that maps to <4096M, 2 cores, 1 gpu> and you set the capability override to <8192M, 0 cores, 0 gpu>, then the actual resource allocation on the ResourceManager will be <8192M, 2 cores, 1 gpu> To me, the correct behavior should do overwrite when value > 0. The reason is, by default resource value will be set to 0, For example, assume we have a profile {{"a" = (mem=3, vcore=5, res_1=7)}}, and create a capability-overwrite (capability = new resource(8). The final result should be (mem=8, vcore=5, res_1=7), instead of (mem=8, vcore=0, res_1=0). 2) ResourceProfileManager now loads minimum/maximum profile from config file (resource-profiles.json), to me this is not correct because minimum/maximum allocation for each resource types are already specified inside {{resource-types.xml}}. We should always use {{ResourceUtils#getResourceTypesMinimum/MaximumAllocation}} to get from resource-types.xml and yarn-site.xml. This value will be added to profiles so client can get these configs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7223) Document GPU isolation feature
Wangda Tan created YARN-7223: Summary: Document GPU isolation feature Key: YARN-7223 URL: https://issues.apache.org/jira/browse/YARN-7223 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7208) CMAKE_C_STANDARD take effect in NodeManager package.
[ https://issues.apache.org/jira/browse/YARN-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-7208. -- Resolution: Duplicate > CMAKE_C_STANDARD take effect in NodeManager package. > > > Key: YARN-7208 > URL: https://issues.apache.org/jira/browse/YARN-7208 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Priority: Blocker > > I just checked changes of this JIRA doesn't relate to issues I saw, I tried > to revert this patch but issue is still the same. > It seems the set (CMAKE_C_STANDARD) doesn't work for the nodemanager project. > I hardcoded to change set (CMAKE_C_STANDARD 99) to set (CMAKE_C_STANDARD 90) > in nodemanager project. (Since we have code uses C99-only syntax, so changing > to 90 should fail build). > I tried on two different environment: > 1) Centos 6, cmake version 3.1.0, gcc 4.4.7 > For both 99/90 standard, all fail. > 2) OSX v10.12.4, cmake version 3.5.2, cc = "Apple LLVM version 8.1.0 > (clang-802.0.42)". > For both 99/90 standard, all succeeded. > At least for the for loop in gpu-module.c is C99 only: > {code} > for (int i = 0; i < n_minor_devices_to_block; i++) { >// ... > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7208) CMAKE_C_STANDARD take effect in NodeManager package.
Wangda Tan created YARN-7208: Summary: CMAKE_C_STANDARD take effect in NodeManager package. Key: YARN-7208 URL: https://issues.apache.org/jira/browse/YARN-7208 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Priority: Blocker I just checked changes of this JIRA doesn't relate to issues I saw, I tried to revert this patch but issue is still the same. It seems the set (CMAKE_C_STANDARD) doesn't work for the nodemanager project. I hardcoded to change set (CMAKE_C_STANDARD 99) to set (CMAKE_C_STANDARD 90) in nodemanager project. (Since we have code uses C99-only syntax, so changing to 90 should fail build). I tried on two different environment: 1) Centos 6, cmake version 3.1.0, gcc 4.4.7 For both 99/90 standard, all fail. 2) OSX v10.12.4, cmake version 3.5.2, cc = "Apple LLVM version 8.1.0 (clang-802.0.42)". For both 99/90 standard, all succeeded. At least for the for loop in gpu-module.c is C99 only: {code} for (int i = 0; i < n_minor_devices_to_block; i++) { // ... } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-3926) Extend the YARN resource model for easier resource-type management and profiles
[ https://issues.apache.org/jira/browse/YARN-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-3926. -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.1.0 This feature is merged to trunk (3.1.0). Thanks everybody for helping this feature, especially thanks [~vvasudev] for leading and driving the feature development from the beginning. Just moved all pending items to YARN-7069 and mark this one as resolved. > Extend the YARN resource model for easier resource-type management and > profiles > --- > > Key: YARN-3926 > URL: https://issues.apache.org/jira/browse/YARN-3926 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 3.1.0 > > Attachments: Proposal for modifying resource model and profiles.pdf > > > Currently, there are efforts to add support for various resource-types such > as disk(YARN-2139), network(YARN-2140), and HDFS bandwidth(YARN-2681). These > efforts all aim to add support for a new resource type and are fairly > involved efforts. In addition, once support is added, it becomes harder for > users to specify the resources they need. All existing jobs have to be > modified, or have to use the minimum allocation. > This ticket is a proposal to extend the YARN resource model to a more > flexible model which makes it easier to support additional resource-types. It > also considers the related aspect of “resource profiles” which allow users to > easily specify the various resources they need for any given container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org