[jira] [Created] (YARN-10535) Make changes in queue placement policy to use auto-queue-placement API in CapacityScheduler

2020-12-15 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10535:
-

 Summary: Make changes in queue placement policy to use 
auto-queue-placement API in CapacityScheduler
 Key: YARN-10535
 URL: https://issues.apache.org/jira/browse/YARN-10535
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Wangda Tan


Once YARN-10506 is done, we need to call the API from the queue placement 
policy to create queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2020-12-11 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10532:
-

 Summary: Capacity Scheduler Auto Queue Creation: Allow auto delete 
queue when queue is not being used
 Key: YARN-10532
 URL: https://issues.apache.org/jira/browse/YARN-10532
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


It's better if we can delete auto-created queues when they are not in use for a 
period of time (like 5 mins). It will be helpful when we have a large number of 
auto-created queues (e.g. from 500 users), but only a small subset of queues 
are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10531) Be able to disable user limit factor for CapacityScheduler Leaf Queue

2020-12-11 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10531:
-

 Summary: Be able to disable user limit factor for 
CapacityScheduler Leaf Queue
 Key: YARN-10531
 URL: https://issues.apache.org/jira/browse/YARN-10531
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


User limit factor is used to define max cap of how much resource can be 
consumed by single user. 

Under Auto Queue Creation context, it doesn't make much sense to set user limit 
factor, because initially every queue will set weight to 1.0, we want user can 
consume more resource if possible. It is hard to pre-determine how to set up 
user limit factor. So it makes more sense to add a new value (like -1) to 
indicate we will disable user limit factor 

Logic need to be changed is below: 

(Inside LeafQueue.java)

{code}
Resource maxUserLimit = Resources.none();
if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY) {
  maxUserLimit = Resources.multiplyAndRoundDown(queueCapacity,
  getUserLimitFactor());
} else if (schedulingMode == SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY) {
  maxUserLimit = partitionResource;
}
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10530) CapacityScheduler ResourceLimits doesn't handle node partition well

2020-12-11 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10530:
-

 Summary: CapacityScheduler ResourceLimits doesn't handle node 
partition well
 Key: YARN-10530
 URL: https://issues.apache.org/jira/browse/YARN-10530
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, capacityscheduler
Reporter: Wangda Tan


This is a serious bug may impact all releases, I need to do further check but I 
want to log the JIRA so we will not forget:  

ResourceLimits objects are used to handle two purposes: 

1) When there's cluster resource change, for example adding new node, or 
scheduler config reinitialize. We will pass ResourceLimits to 
updateClusterResource to queues. 

2) When allocate container, we try to pass parent's available resource to child 
to make sure child's resource allocation won't violate parent's max resource. 
For example below: 

{code}
queue used  max
--
root  1020
root.a8 10
root.a.a1 2 10
root.a.a2 6 10
{code}

Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at 
most allocate 2 resources to a1 because root.a's limit will hit first. This 
information will be passed down from parent queue to child queue during 
assignContainers call via ResourceLimits. 

However, we only pass 1 ResourceLimits from top, for queue initialize, we 
passed in: 

{code}
root.updateClusterResource(clusterResource, new ResourceLimits(
clusterResource));
{code}

And when we update cluster resource, we only considered default partition

{code}
  // Update all children
  for (CSQueue childQueue : childQueues) {
// Get ResourceLimits of child queue before assign containers
ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
clusterResource, resourceLimits,
RMNodeLabelsManager.NO_LABEL, false);
childQueue.updateClusterResource(clusterResource, childLimits);
  }
{code}

Same for allocation logic, we passed in: (Actually I found I added a TODO item 
5 years ago).

{code}
// Try to use NON_EXCLUSIVE
assignment = getRootQueue().assignContainers(getClusterResource(),
candidates,
// TODO, now we only consider limits for parent for non-labeled
// resources, should consider labeled resources as well.
new ResourceLimits(labelManager
.getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
getClusterResource())),
SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
{code} 

The good thing is, in the assignContainers call, we calculated child limit 
based on partition
{code} 
ResourceLimits childLimits =
  getResourceLimitsOfChild(childQueue, cluster, limits,
  candidates.getPartition(), true);
{code} 

So I think now the problem is, when a named partition has more resource than 
default partition, effective min/max resource of each queue could be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-20 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10497:
-

 Summary: Fix an issue in CapacityScheduler which fails to delete 
queues
 Key: YARN-10497
 URL: https://issues.apache.org/jira/browse/YARN-10497
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wangda Tan


We saw an exception when using queue mutation APIs:
{code:java}
2020-11-13 16:47:46,327 WARN 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
CapacityScheduler configuration validation failed:java.io.IOException: Queue 
root.am2cmQueueSecond not found
{code}
Which comes from this code:
{code:java}
List siblingQueues = getSiblingQueues(queueToRemove,
proposedConf);
if (!siblingQueues.contains(queueName)) {
  throw new IOException("Queue " + queueToRemove + " not found");
} 
{code}
(Inside MutableCSConfigurationProvider)

If you look at the method:
{code:java}
 
  private List getSiblingQueues(String queuePath, Configuration conf) {
String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
parentQueue + CapacitySchedulerConfiguration.DOT +
CapacitySchedulerConfiguration.QUEUES;
return new ArrayList<>(conf.getStringCollection(childQueuesKey));
  }
{code}
And here's capacity-scheduler.xml I got
{code:java}
yarn.scheduler.capacity.root.queuesdefault, q1, 
q2
{code}
You can notice there're spaces between default, q1, a2

So conf.getStringCollection returns:
{code:java}
default
q1
...
{code}
Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler

2020-11-19 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10496:
-

 Summary: [Umbrella] Support Flexible Auto Queue Creation in 
Capacity Scheduler
 Key: YARN-10496
 URL: https://issues.apache.org/jira/browse/YARN-10496
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacity scheduler
Reporter: Wangda Tan


CapacityScheduler today doesn’t support an auto queue creation which is 
flexible enough. The current constraints: 
 * Only leaf queues can be auto-created
 * A parent can only have either static queues or dynamic ones. This causes 
multiple constraints. For example:

 * It isn’t possible to have a VIP user like Alice with a static queue 
root.user.alice with 50% capacity while the other user queues (under root.user) 
are created dynamically and they share the remaining 50% of resources.

 
 * In comparison, FairScheduler allows the following scenarios, Capacity 
Scheduler doesn’t:
 ** This implies that there is no possibility to have both dynamically created 
and static queues at the same time under root
 * A new queue needs to be created under an existing parent, while the parent 
already has static queues
 * Nested queue mapping policy, like in the following example: 

|

|
 * Here two levels of queues may need to be created 

If an application belongs to user _alice_ (who has the primary_group of 
_engineering_), the scheduler checks whether _root.engineering_ exists, if it 
doesn’t,  it’ll be created. Then scheduler checks whether 
_root.engineering.alice_ exists, and creates it if it doesn't.

 

When we try to move users from FairScheduler to CapacityScheduler, we face 
feature gaps which blocks users migrate from FS to CS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-07-30 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10380:
-

 Summary: Import logic of multi-node allocation in CapacityScheduler
 Key: YARN-10380
 URL: https://issues.apache.org/jira/browse/YARN-10380
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wangda Tan


*1) Entry point:* 
When we do multi-node allocation, we're using the same logic of async 
scheduling:
{code:java}
// Allocate containers of node [start, end)
 for (FiCaSchedulerNode node : nodes) {
  if (current++ >= start) {
     if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
        continue;
     }
     cs.allocateContainersToNode(node.getNodeID(), false);
  }
 } {code}
Is it the most effective way to do multi-node scheduling? Should we allocate 
based on partitions? In above logic, if we have thousands of node in one 
partition, we will repeatly access all nodes of the partition thousands of 
times.

I would suggest looking at making entry-point for node-heartbeat, 
async-scheduling (single node), and async-scheduling (multi-node) to be 
different.

Node-heartbeat and async-scheduling (single node) can be still similar and 
share most of the code. 

async-scheduling (multi-node): should iterate partition first, using pseudo 
code like: 
{code:java}
for (partition : all partitions) {
  allocateContainersOnMultiNodes(getCandidate(partition))
} {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality

2020-04-06 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-10151.
---
Resolution: Won't Fix

Thanks folks for commenting about YARN-9838. I think we don't need this change 
now given we have a fix of the reported issue already.

> Disable Capacity Scheduler's move app between queue functionality
> -
>
> Key: YARN-10151
> URL: https://issues.apache.org/jira/browse/YARN-10151
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Saw this happened in many clusters: Capacity Scheduler cannot work correctly 
> with the move app between queue features. It will cause weird JMX issue, 
> resource accounting issue, etc. In a lot of causes it will cause RM 
> completely hung and available resource became negative, nothing can be 
> allocated after that. We should turn off CapacityScheduler's move app between 
> queue feature. (see: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}}
>  )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10170) Should revisit mix-usage of percentage-based and absolute-value-based min/max resource in CapacityScheduler

2020-02-26 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10170:
-

 Summary: Should revisit mix-usage of percentage-based and 
absolute-value-based min/max resource in CapacityScheduler
 Key: YARN-10170
 URL: https://issues.apache.org/jira/browse/YARN-10170
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


This should be finished after YARN-10169. (If we can get this one easily, we 
should do this one instead of YARN-10169).

Absolute resource means mem=x, vcore=y.

Percentage resource means x%

We should not allow percentage-based child, but absolute-based parent. (root is 
considered as percentage-based).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10167) Need validate c-s.xml after converting

2020-02-26 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10167:
-

 Summary: Need validate c-s.xml after converting
 Key: YARN-10167
 URL: https://issues.apache.org/jira/browse/YARN-10167
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


Currently we just generated c-s.xml, but we didn't validate that. To make sure 
the c-s.xml is correct after conversion, it's better to initialize the CS 
scheduler using configs.

Also, in the test, we should try to leverage MockRM to validate generated 
configs as much as we could.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality

2020-02-18 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10151:
-

 Summary: Disable Capacity Scheduler's move app between queue 
functionality
 Key: YARN-10151
 URL: https://issues.apache.org/jira/browse/YARN-10151
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Saw this happened in many clusters: Capacity Scheduler cannot work correctly 
with the move app between queue features. It will cause weird JMX issue, 
resource accounting issue, etc. In a lot of causes it will cause RM completely 
hung and available resource became negative, nothing can be allocated after 
that. We should turn off CapacityScheduler's move app between queue feature. 
(see: 
{{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}}
 )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8975) [Submarine] Use predefined Charset object StandardCharsets.UTF_8 instead of String "UTF-8"

2018-11-28 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8975.
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.3.0

Committed to trunk, thanks [~tangzhankun] and reviews from [~ajisakaa].

> [Submarine] Use predefined Charset object StandardCharsets.UTF_8 instead of 
> String "UTF-8"
> --
>
> Key: YARN-8975
> URL: https://issues.apache.org/jira/browse/YARN-8975
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Trivial
> Fix For: 3.3.0
>
> Attachments: YARN-8975-trunk.001.patch, YARN-8975-trunk.002.patch
>
>
> {code:java}
> Writer w = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");{code}
> Could be refactored to this to improve a little bit performance due to avoid 
> string lookup:
> {code:java}
> Writer w = new OutputStreamWriter(new FileOutputStream(file), 
> StandardCharsets.UTF_8);{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9020) set a wrong AbsoluteCapacity when call ParentQueue#setAbsoluteCapacity

2018-11-14 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-9020.
--
Resolution: Duplicate

Thanks [~jutia] for reporting this. It is a valid issue.

This is dup of YARN-8917, [~Tao Yang] has put a patch already. Closing this as 
dup.

> set a wrong AbsoluteCapacity when call  ParentQueue#setAbsoluteCapacity
> ---
>
> Key: YARN-9020
> URL: https://issues.apache.org/jira/browse/YARN-9020
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: tianjuan
>Assignee: tianjuan
>Priority: Major
>
> set a wrong AbsoluteCapacity when call  ParentQueue#setAbsoluteCapacity
> private void deriveCapacityFromAbsoluteConfigurations(String label,
>  Resource clusterResource, ResourceCalculator rc, CSQueue childQueue) {
> // 3. Update absolute capacity as a float based on parent's minResource and
>  // cluster resource.
>  childQueue.getQueueCapacities().setAbsoluteCapacity(label,
>  (float) childQueue.getQueueCapacities().{color:#d04437}getCapacity(){color}
>  / getQueueCapacities().getAbsoluteCapacity(label));
>  
> {color:#d04437}should be{color} 
> childQueue.getQueueCapacities().setAbsoluteCapacity(label,
>  (float) 
> childQueue.getQueueCapacities().{color:#f6c342}getCapacity(label){color}
>  / getQueueCapacities().getAbsoluteCapacity(label));



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8993) [Submarine] Add support to run deep learning workload in non-Docker containers

2018-11-08 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8993:


 Summary: [Submarine] Add support to run deep learning workload in 
non-Docker containers
 Key: YARN-8993
 URL: https://issues.apache.org/jira/browse/YARN-8993
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Now Submarine can well support Docker container, there're some needs to run TF 
w/o Docker container. This JIRA is targeted to support non-docker container 
deep learning workload orchestration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8237) mxnet yarn spec file to add to native service examples

2018-11-06 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8237.
--
Resolution: Duplicate

> mxnet yarn spec file to add to native service examples
> --
>
> Key: YARN-8237
> URL: https://issues.apache.org/jira/browse/YARN-8237
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
>
> Mxnet could be run on YARN. This jira will help to add examples, yarnfile, 
> docker files which are needed to run Mxnet on YARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8238) [Umbrella] YARN deep learning framework examples to run on native service

2018-11-06 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8238.
--
Resolution: Fixed

Closing as dup of YARN-8135. 

> [Umbrella] YARN deep learning framework examples to run on native service
> -
>
> Key: YARN-8238
> URL: https://issues.apache.org/jira/browse/YARN-8238
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
>
> Umbrella jira to track various deep learning frameworks which can run on yarn 
> native services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-10-24 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8513.
--
   Resolution: Duplicate
Fix Version/s: (was: 3.2.1)
   (was: 3.1.2)

Reopen and closing as dup of YARN-8896

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 3.1.0, 2.9.1
> Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
> Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, 
> yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, 
> yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8896) Limit the maximum number of container assignments per heartbeat

2018-10-18 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8896.
--
Resolution: Fixed

> Limit the maximum number of container assignments per heartbeat
> ---
>
> Key: YARN-8896
> URL: https://issues.apache.org/jira/browse/YARN-8896
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Weiwei Yang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.1.2, 3.2.1
>
> Attachments: YARN-8896-trunk.001.patch
>
>
> YARN-4161 adds a configuration 
> \{{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> to control max number of container assignments per heartbeat, however the 
> default value is -1. This could potentially cause the CS gets stuck in the 
> while loop causing issue like YARN-8513. We should change this to a finite 
> number, e.g 100.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8858) CapacityScheduler should respect maximum node resource when per-queue maximum-allocation is being used.

2018-10-08 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8858:


 Summary: CapacityScheduler should respect maximum node resource 
when per-queue maximum-allocation is being used.
 Key: YARN-8858
 URL: https://issues.apache.org/jira/browse/YARN-8858
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sumana Sathish
Assignee: Wangda Tan


This issue happens after YARN-8720.

Before that, AMS uses scheduler.getMaximumAllocation to do the normalization. 
After that, AMS uses LeafQueue.getMaximumAllocation. The scheduler one uses 
nodeTracker.getMaximumAllocation, but the LeafQueue.getMaximum doesn't. 

We should use the scheduler.getMaximumAllocation to cap the per-queue's 
maximum-allocation every time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8817) [Submarine] In some cases HDFS is not asked by user when submit job but framework requires user to set HDFS related environments

2018-09-24 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8817:


 Summary: [Submarine] In some cases HDFS is not asked by user when 
submit job but framework requires user to set HDFS related environments
 Key: YARN-8817
 URL: https://issues.apache.org/jira/browse/YARN-8817
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: submarine
Reporter: Wangda Tan


User who submit the job can see the error message like: 

18/09/24 23:12:58 ERROR yarnservice.YarnServiceJobSubmitter: When hdfs is being 
used to read/write models/data. Followingenvs are required: 1) 
DOCKER_HADOOP_HDFS_HOME= 2) 
DOCKER_JAVA_HOME=. You can use --env to pass 
these envars.
Exception in thread "main" java.io.IOException: Failed to detect HDFS-related 
environments

Even if hdfs is not asked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8799) [Submarine] Correct the default directory path in HDFS for "checkout_path"

2018-09-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8799.
--
Resolution: Duplicate

This should be duplicated by YARN-8757. 

> [Submarine] Correct the default directory path in HDFS for "checkout_path"
> --
>
> Key: YARN-8799
> URL: https://issues.apache.org/jira/browse/YARN-8799
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.2.0
>
>
>  
> {code:java}
> yarn jar 
> $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  -verbose \
>  -wait_job_finish \
>  -keep_staging_dir \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \
>  --name tf-job-001 \
>  --docker_image tangzhankun/tensorflow \
>  --input_path hdfs://default/user/yarn/cifar-10-data \
>  --worker_resources memory=4G,vcores=2 \
>  --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py 
> --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 
> --train-steps=5"{code}
>  
> Above script should work, but the job failed due to invalid path passed to 
> "--job-dir" per my testing. It should be a URI start with "hdfs://".
> {code:java}
> 2018-09-19 23:19:34,729 INFO yarnservice.YarnServiceJobSubmitter: Worker 
> command =[cd /cifar10_estimator && python cifar10_main.py 
> --data-dir=hdfs://default/user/yarn/cifar-10-data 
> --job-dir=submarine/jobs/tf-job-001/staging/checkpoint_path --num-gpus=0 
> --train-steps=2]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8800) Updated documentation of Submarine with latest examples.

2018-09-19 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8800:


 Summary: Updated documentation of Submarine with latest examples.
 Key: YARN-8800
 URL: https://issues.apache.org/jira/browse/YARN-8800
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8770) [Submarine] Support using Submarine to submit Pytorch job

2018-09-12 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8770:


 Summary: [Submarine] Support using Submarine to submit Pytorch job
 Key: YARN-8770
 URL: https://issues.apache.org/jira/browse/YARN-8770
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8769) [Submarine] Allow user to specify customized quicklink(s) when submit Submarine job

2018-09-12 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8769:


 Summary: [Submarine] Allow user to specify customized quicklink(s) 
when submit Submarine job
 Key: YARN-8769
 URL: https://issues.apache.org/jira/browse/YARN-8769
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


This will be helpful when user submit a job and some links need to be shown on 
YARN UI2 (service page). For example, user can specify a quick link to Zeppelin 
notebook UI when a Zeppelin notebook got launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8757) [Submarine] Add Tensorboard component when --tensorboard is specified

2018-09-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8757:


 Summary: [Submarine] Add Tensorboard component when --tensorboard 
is specified
 Key: YARN-8757
 URL: https://issues.apache.org/jira/browse/YARN-8757
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8756) [Submarine] Properly handle relative path for staging area

2018-09-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8756:


 Summary: [Submarine] Properly handle relative path for staging area
 Key: YARN-8756
 URL: https://issues.apache.org/jira/browse/YARN-8756
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: submarine
Reporter: Wangda Tan
Assignee: Wangda Tan


While doing tests, I found when a relative path is being specified for 
checkpoint. The path passed to Tensorflow is wrong. A trick is get a FileStatus 
before return. Will attach fix soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8716) [Submarine] Support passing Kerberos principle tokens when launch training jobs.

2018-08-26 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8716:


 Summary: [Submarine] Support passing Kerberos principle tokens 
when launch training jobs.
 Key: YARN-8716
 URL: https://issues.apache.org/jira/browse/YARN-8716
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: submarine
Reporter: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8713) [Submarine] Support deploy model serving for existing models

2018-08-24 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8713:


 Summary: [Submarine] Support deploy model serving for existing 
models
 Key: YARN-8713
 URL: https://issues.apache.org/jira/browse/YARN-8713
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: submarine
Reporter: Wangda Tan


See 
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
 {{model deploy}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-08-24 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8714:


 Summary: [Submarine] Support files/tarballs to be localized for a 
training job.
 Key: YARN-8714
 URL: https://issues.apache.org/jira/browse/YARN-8714
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


See 
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
 {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8712) [Submarine] Support create models / versions for training result.

2018-08-24 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8712:


 Summary: [Submarine] Support create models / versions for training 
result. 
 Key: YARN-8712
 URL: https://issues.apache.org/jira/browse/YARN-8712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: submarine
Reporter: Wangda Tan


As mentioned in 
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
 we should be able to have models/versions for models created by training 
algorithm. See design doc for syntax, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8657) User limit calculation should be read-lock-protected within LeafQueue

2018-08-13 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8657:


 Summary: User limit calculation should be read-lock-protected 
within LeafQueue
 Key: YARN-8657
 URL: https://issues.apache.org/jira/browse/YARN-8657
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Sumana Sathish
Assignee: Wangda Tan


When async scheduling is enabled, user limit calculation could be wrong: 

It is possible that scheduler calculated a user_limit, but inside 
{{canAssignToUser}} it becomes staled. 

We need to protect user limit calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8563) Support users to specify Python/TF package/version/dependencies for training job.

2018-07-21 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8563:


 Summary: Support users to specify Python/TF 
package/version/dependencies for training job.
 Key: YARN-8563
 URL: https://issues.apache.org/jira/browse/YARN-8563
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


YARN-8561 assumes all Python / Tensorflow dependencies will be packed to docker 
image. In practice, user doesn't want to build docker image. Instead, user can 
provide python package / dependencies (like .whl), Python and TF version. And 
Submarine can localize specified dependencies to prebuilt base Docker images.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8561) Add submarine initial implementation: training job submission and job history retrieve.

2018-07-20 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8561:


 Summary: Add submarine initial implementation: training job 
submission and job history retrieve.
 Key: YARN-8561
 URL: https://issues.apache.org/jira/browse/YARN-8561
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


Added following parts:
1) New subcomponent of YARN, under applications/ project. 

2) Tensorflow training job submission, including training (single node and 
distributed). 
- Supported Docker container. 
- Support GPU isolation. 
- Support YARN registry DNS.

3) Retrieve job history.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8545) YARN native service should return container if launch failed

2018-07-17 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8545:


 Summary: YARN native service should return container if launch 
failed
 Key: YARN-8545
 URL: https://issues.apache.org/jira/browse/YARN-8545
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan


In some cases, container launch may fail but container will not be properly 
returned to RM. 

This could happen when AM trying to prepare container launch context but failed 
w/o sending container launch context to NM (Once container launch context is 
sent to NM, NM will report failed container to RM).

Exception like: 
{code:java}
java.io.FileNotFoundException: File does not exist: 
hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at 
org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
at 
org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
at 
org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
at 
org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe

2018-07-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8506:


 Summary: Make GetApplicationsRequestPBImpl thread safe
 Key: YARN-8506
 URL: https://issues.apache.org/jira/browse/YARN-8506
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan
Assignee: Wangda Tan


When GetApplicationRequestPBImpl is used in multi-thread environment, 
exceptions like below will occur because we don't protect write ops.

{code}
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at java.util.ArrayList.addAll(ArrayList.java:613)
at 
com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132)
at 
com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69)
{code}

We need to make GetApplicationRequestPBImpl thread safe. We saw the issue 
happens frequently when RequestHedgingRMFailoverProxyProvider is being used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8489) Need to support customer termination policy for native services

2018-07-02 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8489:


 Summary: Need to support customer termination policy for native 
services
 Key: YARN-8489
 URL: https://issues.apache.org/jira/browse/YARN-8489
 Project: Hadoop YARN
  Issue Type: Task
  Components: yarn-native-services
Reporter: Wangda Tan


Existing YARN service support termination policy for different restart 
policies. For example ALWAYS means service will not be terminated. And NEVER 
means if all component terminated, service will be terminated. 

There're some jobs/services need different policy. For example, if Tensorflow 
master component terminated (regardless of succeed or finished), we need to 
terminate whole training job regardless or other states of other components.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8488:


 Summary: Need to add "SUCCEED" state to YARN service
 Key: YARN-8488
 URL: https://issues.apache.org/jira/browse/YARN-8488
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan


Existing YARN service has following states:

{code} 
public enum ServiceState {
  ACCEPTED, STARTED, STABLE, STOPPED, FAILED, FLEX, UPGRADING,
  UPGRADING_AUTO_FINALIZE;
}
{code} 

Ideally we should add "SUCCEEDED" state in order to support long running 
applications like Tensorflow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8478) The capacity scheduler logs too frequently seriously affecting performance

2018-06-29 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8478.
--
Resolution: Duplicate

> The capacity scheduler logs too frequently seriously affecting performance
> --
>
> Key: YARN-8478
> URL: https://issues.apache.org/jira/browse/YARN-8478
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: YunFan Zhou
>Assignee: YunFan Zhou
>Priority: Critical
> Attachments: image-2018-06-29-14-08-50-981.png
>
>
> The capacity scheduler logs too frequently, seriously affecting performance.
> As a result of our test that the scheduling speed of capacity scheduler is 
> difficult to reach 5000/s in the production scenario.
> And it will soon reach the log bottleneck.
> My current work is to change many log levels from INFO to DEBUG level.
> [~wangda] [~leftnoteasy] Any suggestion?
> !image-2018-06-29-14-08-50-981.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8466:


 Summary: Add Chaos Monkey unit test framework for validation in 
scale
 Key: YARN-8466
 URL: https://issues.apache.org/jira/browse/YARN-8466
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan


Currently we don't have such framework for testing. 

We need a framework to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8459:


 Summary: Capacity Scheduler should properly handle container 
allocation on app/node when app/node being removed by scheduler
 Key: YARN-8459
 URL: https://issues.apache.org/jira/browse/YARN-8459
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan


Thanks [~gopalv] for reporting this issue. 

In async mode, capacity scheduler can allocate/reserve containers on node/app 
when node/app is being removed ({{doneApplicationAttempt}}/{{removeNode}}).

This will cause some issues, for example.

a. Container for app_1 reserved on node_x.
b. At the same time, app_1 is being removed.
c. Reserve on node operation finished after app_1 removed 
({{doneApplicationAttempt}}). 

For all the future runs, the node_x is completely blocked by the invalid 
reservation. It keep reporting "Trying to schedule for a finished app, please 
double check" for the node_x.

We need a fix to make sure this won't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8417) Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container.

2018-06-11 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8417:


 Summary: Should skip passing HDFS_HOME, HADOOP_CONF_DIR, 
JAVA_HOME, etc. to Docker container.
 Key: YARN-8417
 URL: https://issues.apache.org/jira/browse/YARN-8417
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Currently, YARN NM passes JAVA_HOME, HDFS_HOME, CLASSPATH environments before 
launching Docker container no matter if ENTRY_POINT is used or not. This will 
overwrite environments defined inside Dockerfile (by using \{{ENV}}). For 
Docker container, it actually doesn't make sense to pass JAVA_HOME, HDFS_HOME, 
etc. because inside docker image we have a separate Java/Hadoop installed or 
mounted to exactly same directory of host machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-02 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8220.
--
Resolution: Later

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-05-30 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8379:


 Summary: Add an option to allow Capacity Scheduler preemption to 
balance satisfied queues
 Key: YARN-8379
 URL: https://issues.apache.org/jira/browse/YARN-8379
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan


Existing capacity scheduler only supports preemption for an underutilized queue 
to reach its guaranteed resource. In addition to that, there’s an requirement 
to get better balance between queues when all of them reach guaranteed resource 
but with different fairness resource.

An example is, 3 queues with capacity, queue_a = 30%, queue_b = 30%, queue_c = 
40%. At time T. queue_a is using 30%, queue_b is using 70%. Existing scheduler 
preemption won't happen. But this is unfair to queue_b since queue_b has the 
same guaranteed resources.

Before YARN-5864, capacity scheduler do additional preemption to balance 
queues. We changed the logic since it could preempt too many containers between 
queues when all queues are satisfied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8343) YARN should have ability to run images only from a whitelist docker registries

2018-05-22 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8343:


 Summary: YARN should have ability to run images only from a 
whitelist docker registries
 Key: YARN-8343
 URL: https://issues.apache.org/jira/browse/YARN-8343
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


This is a superset of docker.privileged-containers.registries, admin can 
specify a whitelist and all images from non-privileged-container.registries 
will be rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8342) Using docker image from a non-privileged registry, the launch_command is not honored

2018-05-22 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8342:


 Summary: Using docker image from a non-privileged registry, the 
launch_command is not honored
 Key: YARN-8342
 URL: https://issues.apache.org/jira/browse/YARN-8342
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


During test of the Docker feature, I found that if a container comes from 
non-privileged docker registry, the specified launch command will be ignored. 
Container will success without any log, which is very confusing to end users. 
And this behavior is inconsistent to containers from privileged docker 
registries.

cc: [~eyang], [~shaneku...@gmail.com], [~ebadger], [~jlowe]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8340) Capacity Scheduler Intra Queue Preemption Should Work When 3rd or more resources enabled.

2018-05-22 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8340:


 Summary: Capacity Scheduler Intra Queue Preemption Should Work 
When 3rd or more resources enabled.
 Key: YARN-8340
 URL: https://issues.apache.org/jira/browse/YARN-8340
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Refer to comment from [~eepayne] and discussion below that: 
https://issues.apache.org/jira/browse/YARN-8292?focusedCommentId=16482689=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16482689
 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8272) Several items are missing from Hadoop 3.1.0 documentation

2018-05-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8272.
--
Resolution: Duplicate

Closing as dup of HADOOP-15374

> Several items are missing from Hadoop 3.1.0 documentation
> -
>
> Key: YARN-8272
> URL: https://issues.apache.org/jira/browse/YARN-8272
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Wangda Tan
>Priority: Blocker
>
> From what I can see there're several missing items like GPU / FPGA: 
> http://hadoop.apache.org/docs/current/
> We should add them to hadoop-project/src/site/site.xml in the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8272) Several items are missing from Hadoop 3.1.0 documentation

2018-05-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8272:


 Summary: Several items are missing from Hadoop 3.1.0 documentation
 Key: YARN-8272
 URL: https://issues.apache.org/jira/browse/YARN-8272
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Wangda Tan


>From what I can see there're several missing items like GPU / FPGA: 
>http://hadoop.apache.org/docs/current/

We should add them to hadoop-project/src/site/site.xml in the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-05-07 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8257:


 Summary: Native service should automatically adding escapes for 
environment/launch cmd before sending to YARN
 Key: YARN-8257
 URL: https://issues.apache.org/jira/browse/YARN-8257
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: Wangda Tan
Assignee: Gour Saha


Noticed this issue while using native service: 

Basically, when a string for environment / launch command contains chars like 
", /, `: it needs to be escaped twice.

The first time is from json spec, because of json accept double quote only, it 
needs an escape.

The second time is from launch container, what we did for command line is: 
(ContainerLaunch.java)
{code:java}
line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code}
And for environment:
{code:java}
line("export ", key, "=\"", value, "\"");{code}
An example of launch_command: 
{code:java}
"launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop classpath 
--glob\\`"{code}
And example of environment:
{code:java}
"TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": 
[\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": 
[\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": 
[\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, 
\\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", 
\\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code}

To improve usability, I think we should auto escape the input string once. (For 
example, if user specified 
{code}
"TF_CONFIG": "\"key\""
{code}
We will automatically escape it to:
{code}
"TF_CONFIG": \\\"key\\\"
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-11 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8149:


 Summary: Revisit behavior of Re-Reservation in Capacity Scheduler
 Key: YARN-8149
 URL: https://issues.apache.org/jira/browse/YARN-8149
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
not that easy to understand:

Inside: 
{{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
{code:java}
starvation = re-reservation / (#reserved-container * 
 (1 - min(requested-resource / max-alloc, 
  max-alloc - min-alloc / max-alloc))
should_allocate = starvation + requiredContainers - reservedContainers > 0{code}
I think we should be able to remove the starvation computation, just to check 
requiredContainers > reservedContainers should be enough.

In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
YARN-7636. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec

2018-04-10 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8141:


 Summary: YARN Native Service: Respect 
YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
 Key: YARN-8141
 URL: https://issues.apache.org/jira/browse/YARN-8141
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: Wangda Tan


Existing YARN native service overwrites 
YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user 
specified this in service spec or not. It is important to allow user to mount 
local folders like /etc/passwd, etc.

Following logic overwrites the 
YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment:
{code:java}
StringBuilder sb = new StringBuilder();
for (Entry mount : mountPaths.entrySet()) {
  if (sb.length() > 0) {
sb.append(",");
  }
  sb.append(mount.getKey());
  sb.append(":");
  sb.append(mount.getValue());
}
env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", 
sb.toString());{code}
Inside AbstractLauncher.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8135:


 Summary: Hadoop {Submarine} Project: Simple and scalable 
deployment of deep learning training / serving jobs on Hadoop
 Key: YARN-8135
 URL: https://issues.apache.org/jira/browse/YARN-8135
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: image-2018-04-09-14-35-16-778.png

Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can take human to deep places. B-)

Compare to other projects:

!image-2018-04-09-14-35-16-778.png!

*Notes:*

* GPU Isolation of XLearning project is achieved by patched YARN, which is 
different from community’s GPU isolation solution.

** XLearning needs few modification to read ClusterSpec from env.

*References:*

- TensorflowOnSpark (Yahoo): https://github.com/yahoo/TensorFlowOnSpark
- TensorFlowOnYARN (Intel): https://github.com/Intel-bigdata/TensorFlowOnYARN
- Spark Deep Learning (Databricks): 
https://github.com/databricks/spark-deep-learning
- XLearning (Qihoo360): https://github.com/Qihoo360/XLearning
- Kubeflow (Google): https://github.com/kubeflow/kubeflow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8109) Resource Manager WebApps fails to start due to ConcurrentModificationException

2018-04-02 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8109:


 Summary: Resource Manager WebApps  fails to start due to 
ConcurrentModificationException
 Key: YARN-8109
 URL: https://issues.apache.org/jira/browse/YARN-8109
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan


{code}
2018-03-22 04:57:39,289 INFO  resourcemanager.ResourceTrackerService 
(ResourceTrackerService.java:nodeHeartbeat(497)) - Node not found resyncing 
ctr-e138-1518143905142-129550-01-36.hwx.site:25454
2018-03-22 04:57:39,294 INFO  service.AbstractService 
(AbstractService.java:noteFailure(272)) - Service ResourceManager failed in 
state STARTED; cause: java.util.ConcurrentModificationException
java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1378)
at 
org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2564)
at 
org.apache.hadoop.conf.Configuration.getPropsWithPrefix(Configuration.java:2583)
at 
org.apache.hadoop.yarn.webapp.WebApps$Builder.getConfigParameters(WebApps.java:386)
at org.apache.hadoop.yarn.webapp.WebApps$Builder.build(WebApps.java:334)
at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:395)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1049)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1152)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1293)
2018-03-22 04:57:39,296 INFO  ipc.Server (Server.java:stop(2752)) - Stopping 
server on 8050
2018-03-22 04:57:39,300 INFO  ipc.Server (Server.java:run(932)) - Stopping IPC 
Server listener on 8050
2018-03-22 04:57:39,301 INFO  ipc.Server (Server.java:run(1069)) - Stopping IPC 
Server Responder
{code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8091) Revisit checkUserAccessToQueue RM REST API

2018-03-29 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8091:


 Summary: Revisit checkUserAccessToQueue RM REST API
 Key: YARN-8091
 URL: https://issues.apache.org/jira/browse/YARN-8091
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan
Assignee: Wangda Tan


As offline suggested by [~sershe]. Currently design of the 
checkUserAccessToQueue mixed config-related issues (like user doesn't access to 
the URL) and user-facing output (like requested user is not permitted to access 
the queue) in the same code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5881) Enable configuration of queue capacity in terms of absolute resources

2018-03-28 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-5881.
--
Resolution: Done

> Enable configuration of queue capacity in terms of absolute resources
> -
>
> Key: YARN-5881
> URL: https://issues.apache.org/jira/browse/YARN-5881
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Sean Po
>Assignee: Sunil G
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: 
> YARN-5881.Support.Absolute.Min.Max.Resource.In.Capacity.Scheduler.design-doc.v1.pdf,
>  YARN-5881.v0.patch, YARN-5881.v1.patch
>
>
> Currently, Yarn RM supports the configuration of queue capacity in terms of a 
> proportion to cluster capacity. In the context of Yarn being used as a public 
> cloud service, it makes more sense if queues can be configured absolutely. 
> This will allow administrators to set usage limits more concretely and 
> simplify customer expectations for cluster allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8084) Yarn native service rename for easier development?

2018-03-28 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8084:


 Summary: Yarn native service rename for easier development?
 Key: YARN-8084
 URL: https://issues.apache.org/jira/browse/YARN-8084
 Project: Hadoop YARN
  Issue Type: Task
 Environment: There're a couple of classes with same name exists in 
YARN native service. Such as: 
1) ...service.component.Component and api.records.Component.
This makes harder when development in IDE since clash of class name forces to 
use full qualified class name.

Similarly in API definition:
...service.api.records:
Container/ContainerState/Resource/ResourceInformation. How about rename them to:
ServiceContainer/ServiceContainerState/ServiceResource/ServiceResourceInformation?
Reporter: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8080) YARN native service should support component restart policy

2018-03-27 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8080:


 Summary: YARN native service should support component restart 
policy
 Key: YARN-8080
 URL: https://issues.apache.org/jira/browse/YARN-8080
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-8080.001.patch

Existing native service assumes the service is long running and never finishes. 
Containers will be restarted even if exit code == 0. 

To support boarder use cases, we need to allow restart policy of component 
specified by users. Propose to have following policies:
1) Always: containers always restarted by framework regardless of container 
exit status. This is existing/default behavior.
2) Never: Do not restart containers in any cases after container finishes: To 
support job-like workload (for example Tensorflow training job). If a task exit 
with code == 0, we should not restart the task. This can be used by services 
which is not restart/recovery-able.
3) On-failure: Similar to above, only restart task with exitcode != 0. 

Behaviors after component *instance* finalize (Succeeded or Failed when 
restart_policy != ALWAYS): 
1) For single component, single instance: complete service.
2) For single component, multiple instance: other running instances from the 
same component won't be affected by the finalized component instance. Service 
will be terminated once all instances finalized. 
3) For multiple components: Service will be terminated once all components 
finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8079) YARN native service should respect source file of ConfigFile inside Service/Component spec

2018-03-27 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8079:


 Summary: YARN native service should respect source file of 
ConfigFile inside Service/Component spec
 Key: YARN-8079
 URL: https://issues.apache.org/jira/browse/YARN-8079
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan


Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
read srcFile, instead it always construct {{remoteFile}} by using componentDir 
and fileName of {{destFile}}:

{code}
Path remoteFile = new Path(compInstanceDir, fileName);
{code} 

To me it is a common use case which services have some files existed in HDFS 
and need to be localized when components get launched. (For example, if we want 
to serve a Tensorflow model, we need to localize Tensorflow model (typically 
not huge, less than GB) to local disk. Otherwise launched docker container has 
to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5983) [Umbrella] Support for FPGA as a Resource in YARN

2018-03-22 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-5983.
--
   Resolution: Done
Fix Version/s: 3.1.0

Since this feature works end to end and landed in 3.1.0, closing the umbrella 
as done.

> [Umbrella] Support for FPGA as a Resource in YARN
> -
>
> Key: YARN-5983
> URL: https://issues.apache.org/jira/browse/YARN-5983
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-5983-Support-FPGA-resource-on-NM-side_v1.pdf, 
> YARN-5983-implementation-notes.pdf, YARN-5983_end-to-end_test_report.pdf
>
>
> As various big data workload running on YARN, CPU will no longer scale 
> eventually and heterogeneous systems will become more important. ML/DL is a 
> rising star in recent years, applications focused on these areas have to 
> utilize GPU or FPGA to boost performance. Also, hardware vendors such as 
> Intel also invest in such hardware. It is most likely that FPGA will become 
> popular in data centers like CPU in the near future.
> So YARN as a resource managing and scheduling system, would be great to 
> evolve to support this. This JIRA proposes FPGA to be a first-class citizen. 
> The changes roughly includes:
> 1. FPGA resource detection and heartbeat
> 2. Scheduler changes (YARN-3926 invlolved)
> 3. FPGA related preparation and isolation before launch container
> We know that YARN-3926 is trying to extend current resource model. But still 
> we can leave some FPGA related discussion here



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2018-03-22 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-6223.
--
   Resolution: Done
Fix Version/s: 3.1.0

Closing as done since all sub tasks are done.

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5326) Support for recurring reservations in the YARN ReservationSystem

2018-03-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-5326.
--
Resolution: Done

> Support for recurring reservations in the YARN ReservationSystem
> 
>
> Key: YARN-5326
> URL: https://issues.apache.org/jira/browse/YARN-5326
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Subru Krishnan
>Assignee: Carlo Curino
>Priority: Major
> Attachments: SupportRecurringReservationsInRayon.pdf
>
>
> YARN-1051 introduced a ReservationSytem that enables the YARN RM to handle 
> time explicitly, i.e. users can now "reserve" capacity ahead of time which is 
> predictably allocated to them. Most SLA jobs/workflows are recurring so they 
> need the same resources periodically. With the current implementation, users 
> will have to make individual reservations for each run. This is an umbrella 
> JIRA to enhance the reservation system by adding native support for recurring 
> reservations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7303) Merge YARN-5734 branch to trunk branch

2018-03-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-7303.
--
Resolution: Done

Closing as "done" since there's no patch committed with the Jira.

> Merge YARN-5734 branch to trunk branch
> --
>
> Key: YARN-7303
> URL: https://issues.apache.org/jira/browse/YARN-7303
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7873) Revert YARN-6078

2018-03-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-7873.
--
Resolution: Invalid

> Revert YARN-6078
> 
>
> Key: YARN-7873
> URL: https://issues.apache.org/jira/browse/YARN-7873
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Blocker
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1
>
>
> I think we should revert YARN-6078, since it is not working as intended. The 
> NM does not have permission to destroy the process of the ContainerLocalizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8046) Revisit RMWebServiceProtocol implementations

2018-03-18 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8046:


 Summary: Revisit RMWebServiceProtocol implementations
 Key: YARN-8046
 URL: https://issues.apache.org/jira/browse/YARN-8046
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wangda Tan


I recently found that new changes of RMWebServiceProtocol make adding any new 
REST API pretty hard. There're at least 6 classes need to be implemented:

1. {{MockRESTRequestInterceptor}}
2. {{FederationInterceptorREST}}
3. {{DefaultRequestInterceptorREST}}
4. {{PassThroughRESTRequestInterceptor}
5. {{RouterWebServices}}
6. {{RMWebServices}}

Different classes implementations have different styles, simple copy-paste is 
not enough. For example.

{{DefaultRequestInterceptorREST}} uses {{RouterWebServiceUtil.genericForward}} 
to pass all parameters, which needs to understand how each REST API works, 
reconstruct an URL which can easily cause issues.

I think we should revisit these APIs and make sure new API can be easier added 
to REST interface like before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8028) Support authorizeUserAccessToQueue in RMWebServices

2018-03-13 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8028:


 Summary: Support authorizeUserAccessToQueue in RMWebServices
 Key: YARN-8028
 URL: https://issues.apache.org/jira/browse/YARN-8028
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wangda Tan
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7920) Cleanup configuration of PlacementConstraints

2018-02-10 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7920:


 Summary: Cleanup configuration of PlacementConstraints
 Key: YARN-7920
 URL: https://issues.apache.org/jira/browse/YARN-7920
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


Currently it is very confusing to have the two configs in two different files 
(yarn-site.xml and capacity-scheduler.xml). 
 
Maybe a better approach is: we can delete the scheduling-request.allowed in CS, 
and update placement-constraints configs in yarn-site.xml a bit: 
 
- Remove placement-constraints.enabled, and add a new 
placement-constraints.handler, by default is none, and other acceptable values 
are a. external-processor (since algorithm is too generic to me), b. scheduler. 
- And add a new PlacementProcessor just to pass SchedulingRequest to scheduler 
without any modifications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7854) Attach prefixes to different type of node attributes

2018-02-08 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-7854.
--
Resolution: Later

> Attach prefixes to different type of node attributes
> 
>
> Key: YARN-7854
> URL: https://issues.apache.org/jira/browse/YARN-7854
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: RM
>Reporter: Weiwei Yang
>Assignee: LiangYe
>Priority: Major
>
> There are multiple types of node attributes depending on which source it 
> comes from, includes
>  # Centralized: attributes set by users (admin or normal users)
>  # Distributed: attributes collected by a certain attribute provider on each 
> NM
>  # System: some built-in attributes in yarn, set by yarn internal components, 
> e.g scheduler
> To better manage these attributes, we introduce the prefix (namespace) 
> concept to the an attribute. This Jira is opened to figure out how to attach 
> prefixes (automatically/implicitly or explicitly) to different type of 
> attributes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7759) [UI2]GPU chart shows as "Available: 0" even though GPU is available

2018-01-24 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-7759.
--
Resolution: Duplicate

Duplicated by YARN-7817

> [UI2]GPU chart shows as "Available: 0" even though GPU is available
> ---
>
> Key: YARN-7759
> URL: https://issues.apache.org/jira/browse/YARN-7759
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Vasudevan Skm
>Priority: Major
>
> GPU chart under Node Manager page shows as zero GPU's available even though 
> GPU s are present. Only when we click 'GPU Information' chart, it shows 
> correct GPU information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7817) Add Resource reference to RM's NodeInfo object so REST API can get non memory/vcore resource usages.

2018-01-24 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7817:


 Summary: Add Resource reference to RM's NodeInfo object so REST 
API can get non memory/vcore resource usages.
 Key: YARN-7817
 URL: https://issues.apache.org/jira/browse/YARN-7817
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Sumana Sathish
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7807) By default do intra-app anti-affinity for scheduling request inside app placement allocator

2018-01-24 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7807:


 Summary: By default do intra-app anti-affinity for scheduling 
request inside app placement allocator
 Key: YARN-7807
 URL: https://issues.apache.org/jira/browse/YARN-7807
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


See discussion on: 
https://issues.apache.org/jira/browse/YARN-7791?focusedCommentId=16336857=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16336857

We need to make changes to AppPlacementAllocator to treat default target 
allocation tags is for intra-app.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7801) AmFilterInitializer should addFilter after fill all parameters

2018-01-23 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7801:


 Summary: AmFilterInitializer should addFilter after fill all 
parameters
 Key: YARN-7801
 URL: https://issues.apache.org/jira/browse/YARN-7801
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan


Existing AmFilterInitializer cannot successfully pass RM_HA_URLS parameter to 
AmIpFitler because of this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

2018-01-22 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7790:


 Summary: Improve Capacity Scheduler Async Scheduling to better 
handle node failures
 Key: YARN-7790
 URL: https://issues.apache.org/jira/browse/YARN-7790
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan


This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and in the same response, it will be sent back to NM. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for 10 mins. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs for long time. Discussed 
with [~sunilg] , we need at least two fixes:

When async scheduling enabled:
1) Skip node which missed X node heartbeat.
2) Kill AM container in ALLOCATED state on a node which missed Y node heartbeat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7789) Should fail RM if 3rd resource type is configured but RM uses DefaultResourceCalculator

2018-01-22 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7789:


 Summary: Should fail RM if 3rd resource type is configured but RM 
uses DefaultResourceCalculator
 Key: YARN-7789
 URL: https://issues.apache.org/jira/browse/YARN-7789
 Project: Hadoop YARN
  Issue Type: Sub-task
 Environment: We may need to revisit this behavior: Currently, RM 
doesn't fail if 3rd resource type is configured, allocated containers will be 
automatically assigned minimum allocation for all resource types except memory, 
this makes really hard for troubleshooting. I prefer to fail RM if 3rd or more 
resource type is configured inside resource-types.xml. 
Reporter: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7763) Refactoring PlacementConstraintUtils APIs so PlacementProcessor/Scheduler can use the same API and implementation

2018-01-16 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7763:


 Summary: Refactoring PlacementConstraintUtils APIs so 
PlacementProcessor/Scheduler can use the same API and implementation
 Key: YARN-7763
 URL: https://issues.apache.org/jira/browse/YARN-7763
 Project: Hadoop YARN
  Issue Type: Sub-task
 Environment: As I mentioned on YARN-6599, we will add 
SchedulingRequest as part of the PlacementConstraintUtil method and both of 
processor/scheduler implementation will use the same logic. The logic looks 
like:
{code:java}
PlacementConstraint pc = schedulingRequest.getPlacementConstraint();
If (pc == null) {
  pc = 
PlacementConstraintMgr.getPlacementConstraint(schedulingRequest.getAllocationTags());
}

// Do placement constraint match ...{code}
Reporter: Wangda Tan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7739) Revisit scheduler resource normalization behavior for max allocation

2018-01-11 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7739:


 Summary: Revisit scheduler resource normalization behavior for max 
allocation
 Key: YARN-7739
 URL: https://issues.apache.org/jira/browse/YARN-7739
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Priority: Critical


Currently, YARN Scheduler normalizes requested resource based on the maximum 
allocation derived from configured maximum allocation and maximum registered 
node resources. Basically, the scheduler will silently cap asked resource by 
maximum allocation.

This could cause issues for applications, for example, a Spark job which needs 
12 GB memory to run, however in the cluster, registered NMs have at most 8 GB 
mem on each node. So scheduler allocates 8GB memory container to the requested 
application.

Once app receives containers from RM, if it doesn't double check allocated 
resources, it will lead to OOM and hard to debug because scheduler silently 
caps maximum allocation.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7723) Avoid using docker volume --format option to compatible to older docker releases

2018-01-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7723:


 Summary: Avoid using docker volume --format option to compatible 
to older docker releases
 Key: YARN-7723
 URL: https://issues.apache.org/jira/browse/YARN-7723
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7718) DistributedShell failed to specify resource other than memory/vcores from container_resources

2018-01-08 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7718:


 Summary: DistributedShell failed to specify resource other than 
memory/vcores from container_resources
 Key: YARN-7718
 URL: https://issues.apache.org/jira/browse/YARN-7718
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Priority: Critical


After YARN-7242, it has a bug to read resource values other than memory/vcores.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7709) Remove SELF from TargetExpression type .

2018-01-06 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7709:


 Summary: Remove SELF from TargetExpression type .
 Key: YARN-7709
 URL: https://issues.apache.org/jira/browse/YARN-7709
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Priority: Blocker


As mentioned by [~asuresh], SELF means target allocation tag same as allocation 
tag of the scheduling request itself. So this is not a new type for sure, it is 
still ALLOCATION_TAG type.

If we really want this functionality, we can build this in 
PlacementConstraints, but I'm doubtful about this since copying allocation tags 
from source is just a trivial work.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7416) Use "docker volume inspect" to make sure that volumes for GPU drivers/libs are properly mounted.

2017-12-06 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-7416.
--
Resolution: Duplicate

Duplicated by YARN-7487.

> Use "docker volume inspect" to make sure that volumes for GPU drivers/libs 
> are properly mounted. 
> -
>
> Key: YARN-7416
> URL: https://issues.apache.org/jira/browse/YARN-7416
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby

2017-11-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-7509.
--
   Resolution: Fixed
Fix Version/s: (was: 3.0.1)
   3.0.0

> AsyncScheduleThread and ResourceCommitterService are still running after RM 
> is transitioned to standby
> --
>
> Key: YARN-7509
> URL: https://issues.apache.org/jira/browse/YARN-7509
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4, 2.9.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.0.0, 3.1.0, 2.9.1
>
> Attachments: YARN-7509.001.patch
>
>
> After RM is transitioned to standby, AsyncScheduleThread and 
> ResourceCommitterService will receive interrupt signal. When thread is 
> sleeping, it will ignore the interrupt signal since InterruptedException is 
> catched inside and the interrupt signal is cleared.
> For AsyncScheduleThread, InterruptedException was catched and ignored in  
> CapacityScheduler#schedule.
> For ResourceCommitterService, InterruptedException was catched inside and 
> ignored in ResourceCommitterService#run. 
> We should let the interrupt signal out and make these threads exit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7555) Support multiple resource types in YARN native services

2017-11-22 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7555:


 Summary: Support multiple resource types in YARN native services
 Key: YARN-7555
 URL: https://issues.apache.org/jira/browse/YARN-7555
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Critical


We need to support specifying multiple resource type in addition to memory/cpu 
in YARN native services



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7522) Add application tags manager implementation

2017-11-16 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7522:


 Summary: Add application tags manager implementation
 Key: YARN-7522
 URL: https://issues.apache.org/jira/browse/YARN-7522
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


This is different from YARN-6596, YARN-6596 is targeted to add constraint 
manager to store intra/inter application placement constraints. This JIRA is 
targeted to support storing maps between container-tags/applications and nodes. 
This will be required by affinity/anti-affinity implementation and cardinality.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7487) Make sure volume includes GPU base libraries exists after created by plugin

2017-11-13 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7487:


 Summary: Make sure volume includes GPU base libraries exists after 
created by plugin
 Key: YARN-7487
 URL: https://issues.apache.org/jira/browse/YARN-7487
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


YARN-7224 will create docker volume includes GPU base libraries when launch a 
docker container which needs GPU. 

This JIRA will add necessary checks to make sure docker volume exists before 
launching the container to reduce debug efforts if container fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7457) Delay scheduling should be an individual policy instead of part of scheduler implementation

2017-11-07 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7457:


 Summary: Delay scheduling should be an individual policy instead 
of part of scheduler implementation
 Key: YARN-7457
 URL: https://issues.apache.org/jira/browse/YARN-7457
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


Currently, different schedulers have slightly different delay scheduling 
implementations. Ideally we should make delay scheduling independent from 
scheduler implementation. Benefits of doing this:

1) Applications can choose which delay scheduling policy to use, it could be 
time-based / missed-opportunistic-based or whatever new delay scheduling policy 
supported by the cluster. Now it is global config of scheduler.

2) Make scheduler implementations simpler and reusable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7442) [YARN-7069] Limit format of resource type name

2017-11-03 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7442:


 Summary: [YARN-7069] Limit format of resource type name
 Key: YARN-7442
 URL: https://issues.apache.org/jira/browse/YARN-7442
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Priority: Blocker


I think we should limit format of resource type name. Otherwise it could be 
very hard to update in the future after release. 

I propose to have format:

{code}
[a-zA-Z0-9][a-zA-Z0-9_.-/]*
{code}

Adding this check to setResourceInformation might affect performance a lot. 
Probably we can add to {{ResourceUtils#initializeResourcesMap}} when resource 
types are loaded from config file.

[~templedf]/[~sunilg].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7438) Additional changes to make SchedulingPlacementSet agnostic to ResourceRequest / placement algorithm

2017-11-03 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7438:


 Summary: Additional changes to make SchedulingPlacementSet 
agnostic to ResourceRequest / placement algorithm
 Key: YARN-7438
 URL: https://issues.apache.org/jira/browse/YARN-7438
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Major


In additional to YARN-6040, we need to make changes to SchedulingPlacementSet 
to make it: 

1) Agnostic to ResourceRequest (so once we have YARN-6592 merged, we can add 
new SchedulingPlacementSet implementation in parallel with 
LocalitySchedulingPlacementSet to use/manage new requests API)

2) Agnostic to placement algorithm (now it is bind to delayed scheduling, we 
should update APIs to make sure new placement algorithms such as complex 
placement algorithms can be implemented by using SchedulingPlacementSet).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7437) Give SchedulingPlacementSet to a better name.

2017-11-03 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7437:


 Summary: Give SchedulingPlacementSet to a better name.
 Key: YARN-7437
 URL: https://issues.apache.org/jira/browse/YARN-7437
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Major


Currently, the SchedulingPlacementSet is very confusing. Here're its 
responsibilities:

1) Store ResourceRequests. (Or SchedulingRequest after YARN-6592).
2) Decide order of nodes to allocate when there're multiple node candidates.
3) Decide if we should reject node for given requests.
4) Store any states/cache can help make decision for #2/#3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5908) Add affinity/anti-affinity field to ResourceRequest API

2017-11-03 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-5908.
--
Resolution: Duplicate

Duplicated to YARN-6952

> Add affinity/anti-affinity field to ResourceRequest API
> ---
>
> Key: YARN-5908
> URL: https://issues.apache.org/jira/browse/YARN-5908
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7318) Fix shell check warnings of SLS.

2017-10-11 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7318:


 Summary: Fix shell check warnings of SLS.
 Key: YARN-7318
 URL: https://issues.apache.org/jira/browse/YARN-7318
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Warnings like: 
{code}
hadoop-tools/hadoop-sls/src/main/bin/rumen2sls.sh:75:77: warning: args is 
referenced but not assigned. [SC2154]
hadoop-tools/hadoop-sls/src/main/bin/slsrun.sh:113:61: warning: args is 
referenced but not assigned. [SC2154]
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-4122) Add support for GPU as a resource

2017-10-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-4122.
--
Resolution: Duplicate

This is duplicated by YARN-6620, closing as dup.

> Add support for GPU as a resource
> -
>
> Key: YARN-4122
> URL: https://issues.apache.org/jira/browse/YARN-4122
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: GPUAsAResourceDesign.pdf
>
>
> Use [cgroups 
> devcies|https://www.kernel.org/doc/Documentation/cgroups/devices.txt] to 
> isolate GPUs for containers. For docker containers, we could use 'docker run 
> --device=...'.
> Reference: [SLURM Resources isolation through 
> cgroups|http://slurm.schedmd.com/slurm_ug_2011/SLURM_UserGroup2011_cgroups.pdf].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7307) Revisit resource-types.xml loading behaviors

2017-10-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7307:


 Summary: Revisit resource-types.xml loading behaviors
 Key: YARN-7307
 URL: https://issues.apache.org/jira/browse/YARN-7307
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Sunil G


Existing feature requires every client has a resource-types.xml in order to use 
multiple resource types, should we allow client/AM update supported resource 
types via Yarn APIs?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7292) Revisit Resource Profile Behavior

2017-10-05 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7292:


 Summary: Revisit Resource Profile Behavior
 Key: YARN-7292
 URL: https://issues.apache.org/jira/browse/YARN-7292
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Blocker


Had discussions with [~templedf], [~vvasudev], [~sunilg] offline. There're a 
couple of resource profile related behavior 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed

2017-09-25 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-7249.
--
Resolution: Invalid

Sorry for the noise, it is not an issue for 2.8 as well. Closing as invalid.

> Fix CapacityScheduler NPE issue when a container preempted while the node is 
> being removed
> --
>
> Key: YARN-7249
> URL: https://issues.apache.org/jira/browse/YARN-7249
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
>
> This issue could happen when 3 conditions satisfied:
> 1) A node is removing from scheduler.
> 2) A container running on the node is being preempted. 
> 3) A rare race condition causes scheduler pass a null node to leaf queue.
> Fix of the problem is to add a null node check inside CapacityScheduler.
> Stack trace:
> {code}
> 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:run(714)) - Error in handling event type 
> KILL_RESERVED_CONTAINER to the scheduler 
> java.lang.NullPointerException 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705)
>  
> {code}
> This is an issue only existed in 2.8.x



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed

2017-09-25 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7249:


 Summary: Fix CapacityScheduler NPE issue when a container 
preempted while the node is being removed
 Key: YARN-7249
 URL: https://issues.apache.org/jira/browse/YARN-7249
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.1
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Blocker


This issue could happen when 3 conditions satisfied:

1) A node is removing from scheduler.
2) A container running on the node is being preempted. 
3) A rare race condition causes scheduler pass a null node to leaf queue.

Fix of the problem is to add a null node check inside CapacityScheduler.

Stack trace:
{code}
2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager 
(ResourceManager.java:run(714)) - Error in handling event type 
KILL_RESERVED_CONTAINER to the scheduler 
java.lang.NullPointerException 
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705)
 
{code}

This is an issue only existed in 2.8.x



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7242) Support support specify values of different resource types in DistributedShell for easier testing

2017-09-21 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7242:


 Summary: Support support specify values of different resource 
types in DistributedShell for easier testing
 Key: YARN-7242
 URL: https://issues.apache.org/jira/browse/YARN-7242
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


Currently, DS supports specify resource profile, it's better to allow user to 
directly specify resource keys/values from command line.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7237) Cleanup usages of ResourceProfiles

2017-09-21 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7237:


 Summary: Cleanup usages of ResourceProfiles
 Key: YARN-7237
 URL: https://issues.apache.org/jira/browse/YARN-7237
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Critical


While doing tests, there're a couple of issues:
1) When use {{ProfileCapability#getProfileCapabilityOverride}}, it does 
overwrite of whatever specified in resource-profiles.json when value >= 0. 
Which is different from javadocs of {{ProfileCapability}} 

bq. For example, if you have a resource profile "small" that maps to <4096M, 2 
cores, 1 gpu> and you set the capability override to <8192M, 0 cores, 0 gpu>, 
then the actual resource allocation on the ResourceManager will be <8192M, 2 
cores, 1 gpu>

To me, the correct behavior should do overwrite when value > 0. The reason is, 
by default resource value will be set to 0, For example, assume we have a 
profile {{"a" = (mem=3, vcore=5, res_1=7)}}, and create a capability-overwrite 
(capability = new resource(8). The final result should be (mem=8, vcore=5, 
res_1=7), instead of (mem=8, vcore=0, res_1=0).

2) ResourceProfileManager now loads minimum/maximum profile from config file 
(resource-profiles.json), to me this is not correct because minimum/maximum 
allocation for each resource types are already specified inside 
{{resource-types.xml}}. We should always use 
{{ResourceUtils#getResourceTypesMinimum/MaximumAllocation}} to get from 
resource-types.xml and yarn-site.xml. This value will be added to profiles so 
client can get these configs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7223) Document GPU isolation feature

2017-09-19 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7223:


 Summary: Document GPU isolation feature
 Key: YARN-7223
 URL: https://issues.apache.org/jira/browse/YARN-7223
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7208) CMAKE_C_STANDARD take effect in NodeManager package.

2017-09-15 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-7208.
--
Resolution: Duplicate

> CMAKE_C_STANDARD take effect in NodeManager package.
> 
>
> Key: YARN-7208
> URL: https://issues.apache.org/jira/browse/YARN-7208
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Blocker
>
> I just checked changes of this JIRA doesn't relate to issues I saw, I tried 
> to revert this patch but issue is still the same.
> It seems the set (CMAKE_C_STANDARD) doesn't work for the nodemanager project. 
> I hardcoded to change set (CMAKE_C_STANDARD 99) to set (CMAKE_C_STANDARD 90) 
> in nodemanager project. (Since we have code uses C99-only syntax, so changing 
> to 90 should fail build).
> I tried on two different environment:
> 1) Centos 6, cmake version 3.1.0, gcc 4.4.7
> For both 99/90 standard, all fail.
> 2) OSX v10.12.4, cmake version 3.5.2, cc = "Apple LLVM version 8.1.0 
> (clang-802.0.42)". 
> For both 99/90 standard, all succeeded.
> At least for the for loop in gpu-module.c is C99 only:
> {code}
> for (int i = 0; i < n_minor_devices_to_block; i++) {
>// ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7208) CMAKE_C_STANDARD take effect in NodeManager package.

2017-09-15 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-7208:


 Summary: CMAKE_C_STANDARD take effect in NodeManager package.
 Key: YARN-7208
 URL: https://issues.apache.org/jira/browse/YARN-7208
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Priority: Blocker


I just checked changes of this JIRA doesn't relate to issues I saw, I tried to 
revert this patch but issue is still the same.

It seems the set (CMAKE_C_STANDARD) doesn't work for the nodemanager project. 
I hardcoded to change set (CMAKE_C_STANDARD 99) to set (CMAKE_C_STANDARD 90) in 
nodemanager project. (Since we have code uses C99-only syntax, so changing to 
90 should fail build).

I tried on two different environment:
1) Centos 6, cmake version 3.1.0, gcc 4.4.7
For both 99/90 standard, all fail.
2) OSX v10.12.4, cmake version 3.5.2, cc = "Apple LLVM version 8.1.0 
(clang-802.0.42)". 
For both 99/90 standard, all succeeded.
At least for the for loop in gpu-module.c is C99 only:

{code}
for (int i = 0; i < n_minor_devices_to_block; i++) {
   // ...
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3926) Extend the YARN resource model for easier resource-type management and profiles

2017-09-12 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-3926.
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.1.0

This feature is merged to trunk (3.1.0). Thanks everybody for helping this 
feature, especially thanks [~vvasudev] for leading and driving the feature 
development from the beginning.

Just moved all pending items to YARN-7069 and mark this one as resolved.

> Extend the YARN resource model for easier resource-type management and 
> profiles
> ---
>
> Key: YARN-3926
> URL: https://issues.apache.org/jira/browse/YARN-3926
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Fix For: 3.1.0
>
> Attachments: Proposal for modifying resource model and profiles.pdf
>
>
> Currently, there are efforts to add support for various resource-types such 
> as disk(YARN-2139), network(YARN-2140), and  HDFS bandwidth(YARN-2681). These 
> efforts all aim to add support for a new resource type and are fairly 
> involved efforts. In addition, once support is added, it becomes harder for 
> users to specify the resources they need. All existing jobs have to be 
> modified, or have to use the minimum allocation.
> This ticket is a proposal to extend the YARN resource model to a more 
> flexible model which makes it easier to support additional resource-types. It 
> also considers the related aspect of “resource profiles” which allow users to 
> easily specify the various resources they need for any given container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



  1   2   3   4   >