[jira] [Resolved] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator

2021-11-11 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-10848.
---
Resolution: Not A Problem

I am closing this JIRA based on the above discussion.

> Vcore allocation problem with DefaultResourceCalculator
> ---
>
> Key: YARN-10848
> URL: https://issues.apache.org/jira/browse/YARN-10848
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Peter Bacsko
>Assignee: Minni Mittal
>Priority: Major
>  Labels: pull-request-available
> Attachments: TestTooManyContainers.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating 
> containers even if we run out of vcores.
> CS checks the the available resources at two places. The first check is 
> {{CapacityScheduler.allocateContainerOnSingleNode()}}:
> {noformat}
> if (calculator.computeAvailableContainers(Resources
> .add(node.getUnallocatedResource(), 
> node.getTotalKillableResources()),
> minimumAllocation) <= 0) {
>   LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient "
>   + "available or preemptible resource for minimum allocation");
> {noformat}
> The second, which is more important, is located in 
> {{RegularContainerAllocator.assignContainer()}}:
> {noformat}
> if (!Resources.fitsIn(rc, capability, totalResource)) {
>   LOG.warn("Node : " + node.getNodeID()
>   + " does not have sufficient resource for ask : " + pendingAsk
>   + " node total capability : " + node.getTotalResource());
>   // Skip this locality request
>   ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
>   activitiesManager, node, application, schedulerKey,
>   ActivityDiagnosticConstant.
>   NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST
>   + getResourceDiagnostics(capability, totalResource),
>   ActivityLevel.NODE);
>   return ContainerAllocation.LOCALITY_SKIPPED;
> }
> {noformat}
> Here, {{rc}} is the resource calculator instance, the other two values are:
> {noformat}
> Resource capability = pendingAsk.getPerAllocationResource();
> Resource available = node.getUnallocatedResource();
> {noformat}
> There is a repro unit test attatched to this case, which can demonstrate the 
> problem. The root cause is that we pass the resource calculator to 
> {{Resource.fitsIn()}}. Instead, we should use an overridden version, just 
> like in {{FSAppAttempt.assignContainer()}}:
> {noformat}
>// Can we allocate a container on this node?
> if (Resources.fitsIn(capability, available)) {
>   // Inform the application of the new container for this request
>   RMContainer allocatedContainer =
>   allocate(type, node, schedulerKey, pendingAsk,
>   reservedContainer);
> {noformat}
> In CS, if we switch to DominantResourceCalculator OR use 
> {{Resources.fitsIn()}} without the calculator in 
> {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit 
> test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9975) Support proxy ACL user for CapacityScheduler

2021-10-07 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-9975.
--
Resolution: Duplicate

I'm closing this as a dup of YARN-1115. Please reopen if you disagree.

> Support proxy ACL user for CapacityScheduler
> 
>
> Key: YARN-9975
> URL: https://issues.apache.org/jira/browse/YARN-9975
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As commented in YARN-9698.
> I will open a new jira for the proxy user feature. 
> The background is that we have long running  sql thriftserver for many users:
> {quote}{{user->sql proxy-> sql thriftserver}}{quote}
> But we do not have keytab for all users on 'sql proxy'. We just use a super 
> user like 'sql_prc' to submit the 'sql thriftserver' application. To support 
> this we should change the scheduler to support proxy user acl



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10935) AM Total Queue Limit goes below per-uwer AM Limit if parent is full.

2021-09-07 Thread Eric Payne (Jira)
Eric Payne created YARN-10935:
-

 Summary: AM Total Queue Limit goes below per-uwer AM Limit if 
parent is full.
 Key: YARN-10935
 URL: https://issues.apache.org/jira/browse/YARN-10935
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, capacityscheduler
Reporter: Eric Payne


This happens when DRF is enabled and all of one resource is consumed but the 
second resources still has plenty available.

This is reproduceable by setting up a parent queue where the capacity and max 
capacity are the same, with 2 or more sub-queues whose max capacity is 100%.

In one of the sub-queues, start a long-running app that consumes all resources 
in the parent queue's hieararchy. This app will consume all of the memory but 
not vary many vcores (for example)

In a second queue, submit an app. The *{{Max Application Master Resources Per 
User}}* limit is much more than the *{{Max Application Master Resources}}* 
limit.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.

2021-06-25 Thread Eric Payne (Jira)
Eric Payne created YARN-10834:
-

 Summary: Intra-queue preemption: apps that don't use defined 
custom resource won't be preempted.
 Key: YARN-10834
 URL: https://issues.apache.org/jira/browse/YARN-10834
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Payne
Assignee: Eric Payne


YARN-8292 added handling of negative resources during the preemption 
calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue 
preemption, the a single resource in the vector could go negative while 
calculating ideal assignments and preemptions. It also hard-coded it so that 
during intra-(in-)queue preemption calculations, no resource could not go 
negative. YARN-10613 made these options configurable.

However, in clusters where custom resources are defined, apps that don't use 
the extended resource won't be preempted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10613) Config to allow Intra-queue preemption to enable/disable conservativeDRF

2021-02-03 Thread Eric Payne (Jira)
Eric Payne created YARN-10613:
-

 Summary: Config to allow Intra-queue preemption to  enable/disable 
conservativeDRF
 Key: YARN-10613
 URL: https://issues.apache.org/jira/browse/YARN-10613
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, scheduler preemption
Affects Versions: 2.10.1, 3.1.4, 3.2.2, 3.3.0
Reporter: Eric Payne
Assignee: Eric Payne


YARN-8292 added code that prevents CS intra-queue preemption from preempting 
containers from an app unless all of the major resources used by the app are 
greater than the user limit for that user.

Ex:
| Used | User Limit |
| <58GB, 58> | <30GB, 300> |

In this example, only used memory is above the user limit, not used vcores. So, 
intra-queue preemption will not occur.

YARN-8292 added the {{conservativeDRF}} flag to 
{{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. 
If {{conservativeDRF}} is false, containers will be preempted from apps in the 
example state. If true, containers will not be preempted.

This flag is hard-coded to false for Inter-queue (cross-queue) preemption and 
true for intra-queue (in-queue) preemption.

I propose that in some cases, we want intra-queue preemption to be more 
aggressive and preempt in the example case. To accommodate that, I propose the 
addition of the following config property:
{code:xml}
  

yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.conservative-drf
true
  
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10164) Allow NM to start even when custom resource type not defined

2021-01-25 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-10164.
---
Resolution: Won't Do

> Allow NM to start even when custom resource type not defined
> 
>
> Key: YARN-10164
> URL: https://issues.apache.org/jira/browse/YARN-10164
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> In the [custom resource 
> documentation|https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html],
>  it tells you to add the number of custom resources to a property called 
> {{yarn.nodemanager.resource-type.}} in a file called 
> {{node-resources.xml}}.
> For GPU resources, this would look something like
> {code:xml}
>   
> yarn.nodemanager.resource-type.gpu
> 16
>   
> {code}
> A corresponding config property must also be in {{resource-types.xml}} called 
> yarn.resource-types:
> {code:xml}
>   
> yarn.resource-types
> gpu
> Custom resources to be used for scheduling. 
>   
> {code}
> If the yarn.nodemanager.resource-type.gpu property exists without the 
> corresponding yarn.resource-types property, the nodemanager fails to start.
> I would like the option to automatically create the node-resources.xml on all 
> new nodes regardless of whether or not the cluster supports GPU resources so 
> that if I deploy a GPU node into an existing cluster that does not (yet) 
> support GPU resources, the nodemanager will at least start. Even though it 
> doesn't support the GPU resource, the other supported resources will still be 
> available to be used by the apps in the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-23 Thread Eric Payne (Jira)
Eric Payne created YARN-10471:
-

 Summary: Prevent logs for any container from becoming larger than 
a configurable size.
 Key: YARN-10471
 URL: https://issues.apache.org/jira/browse/YARN-10471
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.1.4, 3.2.1
Reporter: Eric Payne
Assignee: Eric Payne


Configure a cluster such that a task attempt will be killed if any container 
log exceeds a configured size. This would help prevent logs from filling disks 
and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10456) RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry

2020-10-09 Thread Eric Payne (Jira)
Eric Payne created YARN-10456:
-

 Summary: RM PartitionQueueMetrics records are named QueueMetrics 
in Simon metrics registry
 Key: YARN-10456
 URL: https://issues.apache.org/jira/browse/YARN-10456
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.10.1, 3.1.4, 3.2.1, 3.3.0
Reporter: Eric Payne
Assignee: Eric Payne


Several queue metrics (such as AppsRunning, PendingContainers, etc.) stopped 
working after we upgraded to 2.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-02 Thread Eric Payne (Jira)
Eric Payne created YARN-10451:
-

 Summary: RM (v1) UI NodesPage can NPE when yarn.io/gpu resource 
type is defined.
 Key: YARN-10451
 URL: https://issues.apache.org/jira/browse/YARN-10451
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Payne


The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1741) XInclude support broken for YARN ResourceManager

2020-07-21 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-1741.
--
Resolution: Won't Fix

bq. Since branch-2.8 is EOL, I propose that we close this as Won't Fix.
+1

> XInclude support broken for YARN ResourceManager
> 
>
> Key: YARN-1741
> URL: https://issues.apache.org/jira/browse/YARN-1741
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Eric Sirianni
>Assignee: Xuan Gong
>Priority: Critical
>  Labels: regression
>
> The XInclude support in Hadoop configuration files (introduced via 
> HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to 
> YARN ResourceManager.  Specifically, YARN-1459 and, more generally, the 
> YARN-1611 family of JIRAs for ResourceManager HA.
> The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as 
> a {{Configuration}} resource for what was previously a {{Path}}-based 
> resource.  
> For {{Path}} resources, the absolute file path is used as the {{systemId}} 
> for the {{DocumentBuilder.parse()}} call:
> {code}
>   } else if (resource instanceof Path) {  // a file resource
> ...
>   doc = parse(builder, new BufferedInputStream(
>   new FileInputStream(file)), ((Path)resource).toString());
> }
> {code}
> The {{systemId}} is used to resolve XIncludes (among other things):
> {code}
> /**
>  * Parse the content of the given InputStream as an
>  * XML document and return a new DOM Document object.
> ...
>  * @param systemId Provide a base for resolving relative URIs.
> ...
>  */
> public Document parse(InputStream is, String systemId)
> {code}
> However, for loading raw {{InputStream}} resources, the {{systemId}} is set 
> to {{null}}:
> {code}
>   } else if (resource instanceof InputStream) {
> doc = parse(builder, (InputStream) resource, null);
> {code}
> causing XInclude resolution to fail.
> In our particular environment, we make extensive use of XIncludes to 
> standardize common configuration parameters across multiple Hadoop clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.

2020-07-07 Thread Eric Payne (Jira)
Eric Payne created YARN-10343:
-

 Summary: Legacy RM UI should include labeled metrics for 
allocated, total, and reserved resources.
 Key: YARN-10343
 URL: https://issues.apache.org/jira/browse/YARN-10343
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.1.3, 3.2.1, 2.10.0
Reporter: Eric Payne
Assignee: Eric Payne






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9767) PartitionQueueMetrics Issues

2020-06-04 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-9767.
--
Resolution: Duplicate

> PartitionQueueMetrics Issues
> 
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9767.001.patch
>
>
> The intent of the Jira is to capture the issues/observations encountered as 
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this 
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a 
> problem. For example, 
>  
> Node N has been mapped to Label X (Non exclusive). Queue A has been 
> configured with ANY Node label. App A requested resources from Queue A and 
> its containers ran on Node N for some reasons. During 
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) 
> would get used for calculation. Lets say allocate call has been fired for 3 
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>  
> is the outcome. Because app request has been fired without any label 
> specification and #a metrics has been derived. After allocation is over, 
> pending resources usually gets decreased. When this happens, it use node 
> partition info. hence #b metrics has derived. 
>  
> Given this kind of situation, We will need to put some thoughts on achieving 
> the metrics correctly.
>  
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would 
> like to retain the existing Queue Metrics for backward compatibility (as you 
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
> overridden either with some specific partition values or default partition 
> values. It could be vice - versa as well. For example, after the queues (say 
> queue A) has been initialised with some min and max cap and also with node 
> label's min and max cap, Queuemetrics (availableMB) for queue A return values 
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached 
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
> availableVcores is correct (Please refer above #2 observation). Added more 
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for 
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores 
> at root queue was not there even before. Retained the same behaviour. User 
> metrics for availableMB, availableVcores is available only at child queue's 
> level and also with partitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10251) Show extended resources on legacy RM UI.

2020-04-28 Thread Eric Payne (Jira)
Eric Payne created YARN-10251:
-

 Summary: Show extended resources on legacy RM UI.
 Key: YARN-10251
 URL: https://issues.apache.org/jira/browse/YARN-10251
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: Legacy RM UI With Not All Resources Shown.png, Updated 
Legacy RM UI With All Resources Shown.png





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10164) Allow NM to start even when custom resource type not defined

2020-02-25 Thread Eric Payne (Jira)
Eric Payne created YARN-10164:
-

 Summary: Allow NM to start even when custom resource type not 
defined
 Key: YARN-10164
 URL: https://issues.apache.org/jira/browse/YARN-10164
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Eric Payne
Assignee: Eric Payne


In the [custom resource 
documentation|https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html],
 it tells you to add the number of custom resources to a property called 
{{yarn.nodemanager.resource-type.}} in a file called 
{{node-resources.xml}}.

For GPU resources, this would look something like
{code:xml}
  
yarn.nodemanager.resource-type.gpu
16
  
{code}

A corresponding config property must also be in {{resource-types.xml}} called 
yarn.resource-types:
{code:xml}
  
yarn.resource-types
gpu
Custom resources to be used for scheduling. 
  
{code}

If the yarn.nodemanager.resource-type.gpu property exists without the 
corresponding yarn.resource-types property, the nodemanager fails to start.

I would like the option to automatically create the node-resources.xml on all 
new nodes regardless of whether or not the cluster supports GPU resources so 
that if I deploy a GPU node into an existing cluster that does not (yet) 
support GPU resources, the nodemanager will at least start. Even though it 
doesn't support the GPU resource, the other supported resources will still be 
available to be used by the apps in the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero

2020-01-23 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-9790.
--
Fix Version/s: 2.10.1
   3.1.4
   3.2.2
   Resolution: Fixed

> Failed to set default-application-lifetime if maximum-application-lifetime is 
> less than or equal to zero
> 
>
> Key: YARN-9790
> URL: https://issues.apache.org/jira/browse/YARN-9790
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4, 2.10.1
>
> Attachments: YARN-9790.001.patch, YARN-9790.002.patch, 
> YARN-9790.003.patch, YARN-9790.004.patch
>
>
> capacity-scheduler
> {code}
> ...
> yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1
> yarn.scheduler.capacity.root.dev.default-application-lifetime=604800
> {code}
> refreshQueue was failed as follows
> {code}
> 2019-08-28 15:21:57,423 WARN  resourcemanager.AdminService 
> (AdminService.java:logAndWrapException(910)) - Exception refresh queues.
> java.io.IOException: Failed to re-init queues : Default lifetime604800 can't 
> exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114)
> at 
> org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default 
> lifetime604800 can't exceed maximum lifetime -1
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472)
> ... 12 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10084) Allow inheritance of max app lifetime / default app lifetime

2020-01-13 Thread Eric Payne (Jira)
Eric Payne created YARN-10084:
-

 Summary: Allow inheritance of max app lifetime / default app 
lifetime
 Key: YARN-10084
 URL: https://issues.apache.org/jira/browse/YARN-10084
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Affects Versions: 3.1.3, 3.2.1, 2.10.0
Reporter: Eric Payne
Assignee: Eric Payne


Currently, {{maximum-application-lifetime}} and 
{{default-application-lifetime}} must be set for each leaf queue. If it is not 
set for a particular leaf queue, then there will be no time limit on apps 
running in that queue. It should be possible to set 
{{yarn.scheduler.capacity.root.maximum-application-lifetime}} for the root 
queue and allow child queues to override that value if desired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10033) TestProportionalCapacityPreemptionPolicy not initializing vcores for effective max resources

2019-12-13 Thread Eric Payne (Jira)
Eric Payne created YARN-10033:
-

 Summary: TestProportionalCapacityPreemptionPolicy not initializing 
vcores for effective max resources
 Key: YARN-10033
 URL: https://issues.apache.org/jira/browse/YARN-10033
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, test
Affects Versions: 3.3.0
Reporter: Eric Payne


TestProportionalCapacityPreemptionPolicy#testPreemptionWithVCoreResource is 
preempting more containers than would happen on a real cluster.
This is because the process for mocking CS queues in 
{{TestProportionalCapacityPreemptionPolicy}} fails to take into consideration 
vcores when mocking effective max resources.
This causes miscalculations for how many vcores to preempt when the DRF is 
being used in the test:
{code:title=TempQueuePerPartition#offer}
Resource absMaxCapIdealAssignedDelta = Resources.componentwiseMax(
Resources.subtract(getMax(), idealAssigned),
Resource.newInstance(0, 0));
{code}
In the above code, the preemption policy is offering resources to an 
underserved queue. {{getMax()}} will use the effective max resource if it 
exists. Since this test is mocking effective max resources, it will return that 
value. However, since the mock doesn't include vcores, the test treats memory 
as the dominant resource and awards too many preempted containers to the 
underserved queue.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Eric Payne (Jira)
Eric Payne created YARN-10009:
-

 Summary: DRF can treat minimum user limit percent as a max when 
custom resource is defined
 Key: YARN-10009
 URL: https://issues.apache.org/jira/browse/YARN-10009
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Payne


| | Memory | Vcores | res_1 |
| Queue1 Totals | 20GB | 100 | 80 |
| Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of total) 
| 80 (100% of total) |

In the previous use case:
- Queue1 has a value of 25 for {{miminum-user-limit-percent}}
- User1 has requested 8 containers with {{}} 
each
- {{res_1}} will be the dominant resource this case.

All 8 containers should be assigned by the capacity scheduler, but with min 
user limit pct set to 25, only 3 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9773) Add QueueMetrics for Custom Resources

2019-10-17 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-9773.
--
Fix Version/s: 3.1.4
   3.2.2
   3.3.0
   Resolution: Fixed

Thanks [~maniraj...@gmail.com] . I have committed this to trunk, branch-3.2 and 
branch-3.1

> Add QueueMetrics for Custom Resources
> -
>
> Key: YARN-9773
> URL: https://issues.apache.org/jira/browse/YARN-9773
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9773.001.patch, YARN-9773.002.patch, 
> YARN-9773.003.patch
>
>
> Although the custom resource metrics are calculated and saved as a 
> QueueMetricsForCustomResources object within the QueueMetrics class, the JMX 
> and Simon QueueMetrics do not report that information for custom resources. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9911) Backport YARN-9773 (Add QueueMetrics for Custom Resources) to branch-2 and branch-2.10

2019-10-17 Thread Eric Payne (Jira)
Eric Payne created YARN-9911:


 Summary: Backport YARN-9773 (Add QueueMetrics for Custom 
Resources) to branch-2 and branch-2.10
 Key: YARN-9911
 URL: https://issues.apache.org/jira/browse/YARN-9911
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, yarn
Affects Versions: 2.10.1, 2.11.0
Reporter: Eric Payne


The feature for tracking queue metrics for custom resources was added in 
YARN-9773. We would like to utilize this same feature in branch-2.

If the same design is to be backported to branch-2, several prerequisites must 
also be backported. Some (but perhaps not all) are listed below. An alternative 
design may be preferable.

{panel:title=Prerequisites for YARN-9773}
YARN-7541
YARN-5707
YARN-7739
YARN-8202
YARN-8750 (backported to branch-2 and branch-2.10)
YARN-8842 (backported to 3.2, 3.1--still needs to go into branch-2)
{panel}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9894) CapacitySchedulerPerf test for measuring hundreds of apps in a large number of queues.

2019-10-11 Thread Eric Payne (Jira)
Eric Payne created YARN-9894:


 Summary: CapacitySchedulerPerf test for measuring hundreds of apps 
in a large number of queues.
 Key: YARN-9894
 URL: https://issues.apache.org/jira/browse/YARN-9894
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, test
Affects Versions: 3.1.3, 3.2.1, 2.8.5, 2.9.2
Reporter: Eric Payne


I have developed a unit test based on the existing TestCapacitySchedulerPerf 
tests that will measure the performance of a configurable number of apps in a 
configurable number of queues. It will also test the performance of a cluster 
that has many queues but only a portion of them are active.

{code:title=For example:}
$ mvn test 
-Dtest=TestCapacitySchedulerPerf#testUserLimitThroughputWithManyQueues \
  -DRunCapacitySchedulerPerfTests=true
  -DNumberOfQueues=100 \
  -DNumberOfApplications=200 \
  -DPercentActiveQueues=100
{code}

- Parameters:
-- RunCapacitySchedulerPerfTests=true:
Needed in order to trigger the test
-- NumberOfQueues
Configurable number of queues
-- NumberOfApplications
Total number of apps to run in the whole cluster, distributed evenly across all 
queues
-- PercentActiveQueues
Percentage of the queues that contain active applications



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9756) Create metric that sums total memory/vcores preempted per round

2019-08-16 Thread Eric Payne (JIRA)
Eric Payne created YARN-9756:


 Summary: Create metric that sums total memory/vcores preempted per 
round
 Key: YARN-9756
 URL: https://issues.apache.org/jira/browse/YARN-9756
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Affects Versions: 3.1.2, 2.8.5, 3.0.3, 2.9.2, 3.2.0
Reporter: Eric Payne






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9685) NPE when rendering the info table of leaf queue in non-accessible partitions

2019-08-08 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-9685.
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.1.3
   3.2.1
   3.3.0

Thanks again, [~Tao Yang]. I have committed to trunk, branch-3.2, and 
branch-3.1. Prior releases did not have the issue.

> NPE when rendering the info table of leaf queue in non-accessible partitions
> 
>
> Key: YARN-9685
> URL: https://issues.apache.org/jira/browse/YARN-9685
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9685.001.patch
>
>
> I found incomplete queue info shown on scheduler page and NPE in RM log when 
> rendering the info table of leaf queue in non-accessible partitions.
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> {noformat}
> The direct cause is that PartitionQueueCapacitiesInfo of leaf queues in 
> non-accessible partitions are incomplete(part of fields are null such as 
> configuredMinResource/configuredMaxResource/effectiveMinResource/effectiveMaxResource)
>  but some places in CapacitySchedulerPage don't consider that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8425) Yarn container getting killed due to running beyond physical memory limits

2018-06-14 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-8425.
--
Resolution: Not A Bug

> Yarn container getting killed due to running beyond physical memory limits
> --
>
> Key: YARN-8425
> URL: https://issues.apache.org/jira/browse/YARN-8425
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: applications, container-queuing, yarn
>Affects Versions: 2.7.6
>Reporter: Tapas Sen
>Priority: Major
> Attachments: yarn_configuration_1.PNG, yarn_configuration_2.PNG, 
> yarn_configuration_3.PNG
>
>
> Hi,
> Getting these error.
>  
> 2018-06-12 17:59:07,193 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics 
> report from attempt_1527758146858_45040_m_08_3: Container 
> [pid=15498,containerID=container_e60_1527758146858_45040_01_41] is 
> running beyond physical memory limits. Current usage: 8.1 GB of 8 GB physical 
> memory used; 12.2 GB of 16.8 GB virtual memory used. Killing container.
>  
> Yarn resource configuration will in attachment. 
>  
>  Any lead would be appreciated.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7947) Capacity Scheduler intra-queue preemption can NPE for non-schedulable apps

2018-02-19 Thread Eric Payne (JIRA)
Eric Payne created YARN-7947:


 Summary: Capacity Scheduler intra-queue preemption can NPE for 
non-schedulable apps
 Key: YARN-7947
 URL: https://issues.apache.org/jira/browse/YARN-7947
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, scheduler preemption
Reporter: Eric Payne


Intra-queue preemption policy can cause NPE for pending users with no 
schedulable apps.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7927) YARN-7813 caused test failure in TestRMWebServicesSchedulerActivities

2018-02-13 Thread Eric Payne (JIRA)
Eric Payne created YARN-7927:


 Summary: YARN-7813 caused test failure in 
TestRMWebServicesSchedulerActivities 
 Key: YARN-7927
 URL: https://issues.apache.org/jira/browse/YARN-7927
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Eric Payne






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7813) Capacity Scheduler Intra-queue Preemption should be configurable for each queue

2018-01-24 Thread Eric Payne (JIRA)
Eric Payne created YARN-7813:


 Summary: Capacity Scheduler Intra-queue Preemption should be 
configurable for each queue
 Key: YARN-7813
 URL: https://issues.apache.org/jira/browse/YARN-7813
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, scheduler preemption
Affects Versions: 3.0.0, 2.8.3, 2.9.0
Reporter: Eric Payne
Assignee: Eric Payne


Just as inter-queue (a.k.a. cross-queue) preemption is configurable per queue, 
intra-queue (a.k.a. in-queue) preemption should be configurable per queue. If a 
queue does not have a setting for intra-queue preemption, it should inherit its 
parents value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP

2018-01-10 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-7424.
--
Resolution: Invalid

bq. In order to create the "desired" behavior, we would have to fundamentally 
change the way the capacity scheduler works,
Closing

> Capacity Scheduler Intra-queue preemption: add property to only preempt up to 
> configured MULP
> -
>
> Key: YARN-7424
> URL: https://issues.apache.org/jira/browse/YARN-7424
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> If the queue's configured minimum user limit percent (MULP) is something 
> small like 1%, all users will max out well over their MULP until 100 users 
> have apps in the queue. Since the intra-queue preemption monitor tries to 
> balance the resource among the users, most of the time in this use case it 
> will be preempting containers on behalf of users that are already over their 
> MULP guarantee.
> This JIRA proposes that a property should be provided so that a queue can be 
> configured to only preempt on behalf of a user until that user has reached 
> its MULP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7728) Expose and expand container preemptions in Capacity Scheduler queue metrics

2018-01-10 Thread Eric Payne (JIRA)
Eric Payne created YARN-7728:


 Summary: Expose and expand container preemptions in Capacity 
Scheduler queue metrics
 Key: YARN-7728
 URL: https://issues.apache.org/jira/browse/YARN-7728
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.0.0, 2.8.3, 2.9.0
Reporter: Eric Payne
Assignee: Eric Payne


YARN-1047 exposed queue metrics for the number of preempted containers to the 
fair scheduler. I would like to also expose these to the capacity scheduler and 
add metrics for the amount of lost memory seconds and vcore seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7658) Capacity scheduler UI hangs when rendering if labels are present

2017-12-14 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-7658.
--
Resolution: Duplicate

> Capacity scheduler UI hangs when rendering if labels are present
> 
>
> Key: YARN-7658
> URL: https://issues.apache.org/jira/browse/YARN-7658
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Eric Payne
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7658) Capacity scheduler UI hangs when rendering if labels are present

2017-12-14 Thread Eric Payne (JIRA)
Eric Payne created YARN-7658:


 Summary: Capacity scheduler UI hangs when rendering if labels are 
present
 Key: YARN-7658
 URL: https://issues.apache.org/jira/browse/YARN-7658
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Eric Payne






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7619) Max AM Resource value in CS UI is different for every user

2017-12-06 Thread Eric Payne (JIRA)
Eric Payne created YARN-7619:


 Summary: Max AM Resource value in CS UI is different for every user
 Key: YARN-7619
 URL: https://issues.apache.org/jira/browse/YARN-7619
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2, 3.1.0
Reporter: Eric Payne
Assignee: Eric Payne


YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity 
scheduler UI used to contain the queue-level AM limit instead of the user-level 
AM limit. It fixed this by using the user-specific AM limit that is calculated 
in {{LeafQueue#activateApplications}}, stored in each user's {{LeafQueue#User}} 
object, and retrieved via {{UserInfo#getResourceUsageInfo}}.

The problem is that this user-specific AM limit depends on the activity of 
other users and other applications in a queue, and it is only calculated and 
updated when a user's application is activated. So, when 
{{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale 
value unless an application was recently activated for a particular user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue

2017-11-28 Thread Eric Payne (JIRA)
Eric Payne created YARN-7575:


 Summary: When using absolute capacity configuration with no max 
capacity, scheduler UI NPEs and can't grow queue
 Key: YARN-7575
 URL: https://issues.apache.org/jira/browse/YARN-7575
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Eric Payne


I encountered the following while reviewing and testing branch YARN-5881.

The design document from YARN-5881 says that for max-capacity:
{quote}
3)  For each queue, we require:
a) if max-resource not set, it automatically set to parent.max-resource
{quote}

When I try leaving blank {{yarn.scheduler.capacity.< 
queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It 
looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}:
{noformat}
2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error 
handling URI: /cluster/scheduler
java.lang.reflect.InvocationTargetException
...
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129)
{noformat}

Also... A job will run in the leaf queue with no max capacity set and it will 
grow to the max capacity of the cluster, but if I add resources to the node, 
the job won't grow any more even though it has pending resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7501) Capacity Scheduler Intra-queue preemption should have a "dead zone" around user limit

2017-11-15 Thread Eric Payne (JIRA)
Eric Payne created YARN-7501:


 Summary: Capacity Scheduler Intra-queue preemption should have a 
"dead zone" around user limit
 Key: YARN-7501
 URL: https://issues.apache.org/jira/browse/YARN-7501
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, scheduler preemption
Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2, 3.1.0
Reporter: Eric Payne






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-14 Thread Eric Payne (JIRA)
Eric Payne created YARN-7496:


 Summary: CS Intra-queue preemption user-limit calculations are not 
in line with LeafQueue user-limit calculations
 Key: YARN-7496
 URL: https://issues.apache.org/jira/browse/YARN-7496
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.2
Reporter: Eric Payne
Assignee: Eric Payne


Only a problem in 2.8.

Preemption could oscillate due to the difference in how user limit is 
calculated between 2.8 and later releases.

Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
limit on the Capacity Scheduler side in 2.8 is {{total used resources / number 
of active users}} while the calculation in later releases is {{total active 
resources / number of active users}}. When intra-queue preemption was 
backported to 2.8, it's calculations for user limit were more aligned with the 
latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-09 Thread Eric Payne (JIRA)
Eric Payne created YARN-7469:


 Summary: Capacity Scheduler Intra-queue preemption: User can 
starve if newest app is exactly at user limit
 Key: YARN-7469
 URL: https://issues.apache.org/jira/browse/YARN-7469
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2
Reporter: Eric Payne
Assignee: Eric Payne






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP

2017-10-31 Thread Eric Payne (JIRA)
Eric Payne created YARN-7424:


 Summary: Capacity Scheduler Intra-queue preemption: add property 
to only preempt up to configured MULP
 Key: YARN-7424
 URL: https://issues.apache.org/jira/browse/YARN-7424
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, scheduler preemption
Affects Versions: 3.0.0-beta1, 2.8.2
Reporter: Eric Payne







--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7370) Intra-queue preemption properties should be refreshable

2017-10-19 Thread Eric Payne (JIRA)
Eric Payne created YARN-7370:


 Summary: Intra-queue preemption properties should be refreshable
 Key: YARN-7370
 URL: https://issues.apache.org/jira/browse/YARN-7370
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, scheduler preemption
Affects Versions: 3.0.0-alpha3, 2.8.0
Reporter: Eric Payne


At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
should be refreshable. It would also be nice to make 
{{intra-queue-preemption.enabled}} and {{preemption-order-policy}} refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user

2017-09-22 Thread Eric Payne (JIRA)
Eric Payne created YARN-7245:


 Summary: In Cap Sched UI, Max AM Resource column in Active Users 
Info section should be per-user
 Key: YARN-7245
 URL: https://issues.apache.org/jira/browse/YARN-7245
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 3.0.0-alpha4, 2.8.1, 2.9.0
Reporter: Eric Payne






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-01 Thread Eric Payne (JIRA)
Eric Payne created YARN-7149:


 Summary: Cross-queue preemption sometimes starves an underserved 
queue
 Key: YARN-7149
 URL: https://issues.apache.org/jira/browse/YARN-7149
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 3.0.0-alpha3, 2.9.0
Reporter: Eric Payne
Assignee: Eric Payne


In branch 2 and trunk, I am consistently seeing some use cases where 
cross-queue preemption does not happen when it should. I do not see this in 
branch-2.8.

Use Case:
| | *Size* | *Minimum Container Size* |
|MyCluster | 20 GB | 0.5 GB |

| *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit Percent 
(MULP)* | *User Limit Factor (ULF)* |
|Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
|Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |

- {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
- {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
- _Note: containers are 0.5 GB._
- Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
- Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
- _No more containers are ever preempted, even though {{Q2}} is far underserved_




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7120) CapacitySchedulerPage NPE in "Aggregate scheduler counts" section

2017-08-29 Thread Eric Payne (JIRA)
Eric Payne created YARN-7120:


 Summary: CapacitySchedulerPage NPE in "Aggregate scheduler counts" 
section
 Key: YARN-7120
 URL: https://issues.apache.org/jira/browse/YARN-7120
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha3, 2.8.1, 2.9.0
Reporter: Eric Payne
Assignee: Eric Payne
Priority: Minor


The problem manifests itself by having the bottom part of the "Aggregated 
scheduler counts" section cut off on the GUI and an NPE in the RM log.
{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$HealthBlock.render(CapacitySchedulerPage.java:558)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:478)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86)
... 58 more
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7052) RM SchedulingMonitor should use HadoopExecutors when creating ScheduledExecutorService

2017-08-18 Thread Eric Payne (JIRA)
Eric Payne created YARN-7052:


 Summary: RM SchedulingMonitor should use HadoopExecutors when 
creating ScheduledExecutorService
 Key: YARN-7052
 URL: https://issues.apache.org/jira/browse/YARN-7052
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Eric Payne


In YARN-7051, we ran into a case where the preemption monitor thread hung with 
no indication of why. This was because the preemption monitor is started by the 
{{SchedulingExecutorService}} from {{SchedulingMonigor#serviceStart}}, and then 
nothing ever gets the result of the future or allows it to throw an exception 
if needed.

At least with {{HadoopExecutor}}, it will provide a 
{{HadoopScheduledThreadPoolExecutor}} that logs the exception if one happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7051) FifoIntraQueuePreemptionPlugin can get concurrent modification exception/

2017-08-18 Thread Eric Payne (JIRA)
Eric Payne created YARN-7051:


 Summary: FifoIntraQueuePreemptionPlugin can get concurrent 
modification exception/
 Key: YARN-7051
 URL: https://issues.apache.org/jira/browse/YARN-7051
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha3, 2.8.1, 2.9.0
Reporter: Eric Payne
Priority: Critical


{{FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue}} has the 
following code:
{code}
Collection runningApps = leafQueue.getApplications();
Resource amUsed = Resources.createResource(0, 0);

for (FiCaSchedulerApp app : runningApps) {
{code}
{{runningApps}} is unmodifiable but not concurrent. This caused the preemption 
monitor thread to crash in the RM in one of our clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6585) RM fails to start when upgrading from 2.7 to 2.8 for clusters with node labels.

2017-05-11 Thread Eric Payne (JIRA)
Eric Payne created YARN-6585:


 Summary: RM fails to start when upgrading from 2.7 to 2.8 for 
clusters with node labels.
 Key: YARN-6585
 URL: https://issues.apache.org/jira/browse/YARN-6585
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Eric Payne


{noformat}
Caused by: java.io.IOException: Not all labels being replaced contained by 
known label collections, please check, new labels=[abc]
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.checkReplaceLabelsOnNode(CommonNodeLabelsManager.java:718)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.replaceLabelsOnNode(CommonNodeLabelsManager.java:737)
at 
org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.replaceLabelsOnNode(RMNodeLabelsManager.java:189)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.loadFromMirror(FileSystemNodeLabelsStore.java:181)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:208)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:251)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:265)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
... 13 more
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6248) Killing an app with pending container requests leaves the user in UsersManager

2017-02-27 Thread Eric Payne (JIRA)
Eric Payne created YARN-6248:


 Summary: Killing an app with pending container requests leaves the 
user in UsersManager
 Key: YARN-6248
 URL: https://issues.apache.org/jira/browse/YARN-6248
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha3
Reporter: Eric Payne
Assignee: Eric Payne


If an app is still asking for resources when it is killed, the user is left in 
the UsersManager structure and shows up on the GUI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6165) Intra-queue preemption occurs even when preemption is turned off for a specific queue.

2017-02-09 Thread Eric Payne (JIRA)
Eric Payne created YARN-6165:


 Summary: Intra-queue preemption occurs even when preemption is 
turned off for a specific queue.
 Key: YARN-6165
 URL: https://issues.apache.org/jira/browse/YARN-6165
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, scheduler preemption
Affects Versions: 3.0.0-alpha2
Reporter: Eric Payne


Intra-queue preemption occurs even when preemption is turned on for the whole 
cluster ({{yarn.resourcemanager.scheduler.monitor.enable == true}}) but turned 
off for a specific queue 
({{yarn.scheduler.capacity.root.queue1.disable_preemption == true}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5973) TestCapacitySchedulerSurgicalPreemption sometimes fails

2016-12-06 Thread Eric Payne (JIRA)
Eric Payne created YARN-5973:


 Summary: TestCapacitySchedulerSurgicalPreemption sometimes fails
 Key: YARN-5973
 URL: https://issues.apache.org/jira/browse/YARN-5973
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, scheduler preemption
Affects Versions: 2.8.0
Reporter: Eric Payne
Priority: Minor


The tests in {{TestCapacitySchedulerSurgicalPreemption}} appear to be racy. 
They often pass, but  the following errors sometimes occur:
{noformat}
testSimpleSurgicalPreemption(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption)
  Time elapsed: 14.671 sec  <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.fail(Assert.java:95)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerPreemptionTestBase.waitNumberOfLiveContainersFromApp(CapacitySchedulerPreemptionTestBase.java:110)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption.testSimpleSurgicalPreemption(TestCapacitySchedulerSurgicalPreemption.java:143)
{noformat}
{noformat}
testSurgicalPreemptionWithAvailableResource(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption)
  Time elapsed: 9.503 sec  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<2>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption.testSurgicalPreemptionWithAvailableResource(TestCapacitySchedulerSurgicalPreemption.java:220)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-4751) In 2.7, Labeled queue usage not shown properly in capacity scheduler UI

2016-03-01 Thread Eric Payne (JIRA)
Eric Payne created YARN-4751:


 Summary: In 2.7, Labeled queue usage not shown properly in 
capacity scheduler UI
 Key: YARN-4751
 URL: https://issues.apache.org/jira/browse/YARN-4751
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 2.7.3
Reporter: Eric Payne
Assignee: Eric Payne


In 2.6 and 2.7, the capacity scheduler UI does not have the queue graphs 
separated by partition. When applications are running on a labeled queue, no 
color is shown in the bar graph, and several of the "Used" metrics are zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4390) Consider container request size during CS preemption

2015-12-14 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-4390.
--
Resolution: Duplicate

Closing this ticket in favor of YARN-4108

> Consider container request size during CS preemption
> 
>
> Key: YARN-4390
> URL: https://issues.apache.org/jira/browse/YARN-4390
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.0, 2.8.0, 2.7.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> There are multiple reasons why preemption could unnecessarily preempt 
> containers. One is that an app could be requesting a large container (say 
> 8-GB), and the preemption monitor could conceivably preempt multiple 
> containers (say 8, 1-GB containers) in order to fill the large container 
> request. These smaller containers would then be rejected by the requesting AM 
> and potentially given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI

2015-12-14 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-4226.
--
Resolution: Won't Fix

Since the code works and is only slightly confusing, I am closing this ticket 
as WontFix.

> Make capacity scheduler queue's preemption status REST API consistent with GUI
> --
>
> Key: YARN-4226
> URL: https://issues.apache.org/jira/browse/YARN-4226
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.7.1
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
>
> In the capacity scheduler GUI, the preemption status has the following form:
> {code}
> Preemption:   disabled
> {code}
> However, the REST API shows the following for the same status:
> {code}
> preemptionDisabled":true
> {code}
> The latter is confusing and should be consistent with the format in the GUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page

2015-12-04 Thread Eric Payne (JIRA)
Eric Payne created YARN-4422:


 Summary: Generic AHS sometimes doesn't show started, node, or logs 
on App page
 Key: YARN-4422
 URL: https://issues.apache.org/jira/browse/YARN-4422
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Eric Payne
Assignee: Eric Payne


Sometimes the AM container for an app isn't able to start the JVM. This can 
happen if bogus JVM options are given to the AM container ( 
{{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when 
misconfiguring the AM container's environment variables 
({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}})

When the AM container for an app isn't able to start the JVM, the Application 
page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and 
{{Logs}} columns. It _does_ have links for each app attempt, and if you click 
on one of them, you go to the Application Attempt page, where you can see all 
containers with links to their logs and nodes, including the AM container. But 
none of that shows up for the app attempts on the Application page.

Also, on the Application Attempt page, in the {{Application Attempt Overview}} 
section, the {{AM Container}} value is {{null}} and the {{Node}} value is 
{{N/A}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4390) Consider container request size during CS preemption

2015-11-24 Thread Eric Payne (JIRA)
Eric Payne created YARN-4390:


 Summary: Consider container request size during CS preemption
 Key: YARN-4390
 URL: https://issues.apache.org/jira/browse/YARN-4390
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 3.0.0, 2.8.0, 2.7.3
Reporter: Eric Payne
Assignee: Eric Payne


There are multiple reasons why preemption could unnecessarily preempt 
containers. One is that an app could be requesting a large container (say 
8-GB), and the preemption monitor could conceivably preempt multiple containers 
(say 8, 1-GB containers) in order to fill the large container request. These 
smaller containers would then be rejected by the requesting AM and potentially 
given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4225) Add preemption status to {{yarn queue -status}}

2015-10-05 Thread Eric Payne (JIRA)
Eric Payne created YARN-4225:


 Summary: Add preemption status to {{yarn queue -status}}
 Key: YARN-4225
 URL: https://issues.apache.org/jira/browse/YARN-4225
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI

2015-10-05 Thread Eric Payne (JIRA)
Eric Payne created YARN-4226:


 Summary: Make capacity scheduler queue's preemption status REST 
API consistent with GUI
 Key: YARN-4226
 URL: https://issues.apache.org/jira/browse/YARN-4226
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
Priority: Minor


In the capacity scheduler GUI, the preemption status has the following form:
{code}
Preemption: disabled
{code}
However, the REST API shows the following for the same status:
{code}
preemptionDisabled":true
{code}
The latter is confusing and should be consistent with the format in the GUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-25 Thread Eric Payne (JIRA)
Eric Payne created YARN-3978:


 Summary: Configurably turn off the saving of container info in 
Generic AHS
 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Reporter: Eric Payne
Assignee: Eric Payne


Depending on how each application's metadata is stored, one week's worth of 
data stored in the Generic Application History Server's database can grow to be 
almost a terabyte of local disk space. In order to alleviate this, I suggest 
that there is a need for a configuration option to turn off saving of non-AM 
container metadata in the GAHS data store.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart

2015-07-09 Thread Eric Payne (JIRA)
Eric Payne created YARN-3905:


 Summary: Application History Server UI NPEs when accessing apps 
run after RM restart
 Key: YARN-3905
 URL: https://issues.apache.org/jira/browse/YARN-3905
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.7.1, 2.7.0, 2.8.0
Reporter: Eric Payne
Assignee: Eric Payne


From the Application History URL (http://RmHostName:8188/applicationhistory), 
clicking on the application ID of an app that was run after the RM daemon has 
been restarted results in a 500 error:
{noformat}
Sorry, got error 500
Please consult RFC 2616 for meanings of the error code.
{noformat}

The stack trace is as follows:
{code}
2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO 
applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading 
history information of all application attempts of application 
application_1436472584878_0001
2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: 
Failed to read the AM container of the application attempt 
appattempt_1436472584878_0001_01.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205)
at 
org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272)
at 
org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
at 
org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266)
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit

2015-06-04 Thread Eric Payne (JIRA)
Eric Payne created YARN-3769:


 Summary: Preemption occurring unnecessarily because preemption 
doesn't consider user limit
 Key: YARN-3769
 URL: https://issues.apache.org/jira/browse/YARN-3769
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.7.0, 2.6.0, 2.8.0
Reporter: Eric Payne
Assignee: Eric Payne


We are seeing the preemption monitor preempting containers from queue A and 
then seeing the capacity scheduler giving them immediately back to queue A. 
This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3540) Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler

2015-04-23 Thread Eric Payne (JIRA)
Eric Payne created YARN-3540:


 Summary: Fetcher#copyMapOutput is leaking usedMemory upon 
IOException during InMemoryMapOutput shuffle handler
 Key: YARN-3540
 URL: https://issues.apache.org/jira/browse/YARN-3540
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne
Priority: Blocker


We are seeing this happen when
- an NM's disk goes bad during the creation of map output(s)
- the reducer's fetcher can read the shuffle header and reserve the memory
- but gets an IOException when trying to shuffle for InMemoryMapOutput
- shuffle fetch retry is enabled




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3275) Preemption happening on non-preemptable queues

2015-02-27 Thread Eric Payne (JIRA)
Eric Payne created YARN-3275:


 Summary: Preemption happening on non-preemptable queues
 Key: YARN-3275
 URL: https://issues.apache.org/jira/browse/YARN-3275
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Eric Payne
Assignee: Eric Payne


YARN-2056 introduced the ability to turn preemption on and off at the queue 
level. In cases where a queue goes over its absolute max capacity (YARN:3243, 
for example), containers can be preempted from that queue, even though the 
queue is marked as non-preemptable.

We are using this feature in large, busy clusters and seeing this behavior.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2015-02-27 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved YARN-2592.
--
Resolution: Invalid

 Preemption can kill containers to fulfil need of already over-capacity queue.
 -

 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.1
Reporter: Eric Payne

 There are scenarios in which one over-capacity queue can cause preemption of 
 another over-capacity queue. However, since killing containers may lose work, 
 it doesn't make sense to me to kill containers to feed an already 
 over-capacity queue.
 Consider the following:
 {code}
 root has A,B,C, total capacity = 90
 A.guaranteed = 30, A.pending = 5, A.current = 40
 B.guaranteed = 30, B.pending = 0, B.current = 50
 C.guaranteed = 30, C.pending = 0, C.current = 0
 {code}
 In this case, the queue preemption monitor will kill 5 resources from queue B 
 so that queue A can pick them up, even though queue A is already over its 
 capacity. This could lose any work that those containers in B had already 
 done.
 Is there a use case for this behavior? It seems to me that if a queue is 
 already over its capacity, it shouldn't destroy the work of other queues. If 
 the over-capacity queue needs more resources, that seems to be a problem that 
 should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)



[jira] [Created] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2014-09-23 Thread Eric Payne (JIRA)
Eric Payne created YARN-2592:


 Summary: Preemption can kill containers to fulfil need of already 
over-capacity queue.
 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.1, 3.0.0
Reporter: Eric Payne


There are scenarios in which one over-capacity queue can cause preemption of 
another over-capacity queue. However, since killing containers may lose work, 
it doesn't make sense to me to kill containers to feed an already over-capacity 
queue.

Consider the following:

{code}
root has A,B,C, total capacity = 90
A.guaranteed = 30, A.pending = 5, A.current = 40
B.guaranteed = 30, B.pending = 0, B.current = 50
C.guaranteed = 30, C.pending = 0, C.current = 0
{code}

In this case, the queue preemption monitor will kill 5 resources from queue B 
so that queue A can pick them up, even though queue A is already over its 
capacity. This could lose any work that those containers in B had already done.

Is there a use case for this behavior? It seems to me that if a queue is 
already over its capacity, it shouldn't destroy the work of other queues. If 
the over-capacity queue needs more resources, that seems to be a problem that 
should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2024) IOException in AppLogAggregatorImpl does not give stacktrace and leaves aggregated TFile in a bad state.

2014-05-06 Thread Eric Payne (JIRA)
Eric Payne created YARN-2024:


 Summary: IOException in AppLogAggregatorImpl does not give 
stacktrace and leaves aggregated TFile in a bad state.
 Key: YARN-2024
 URL: https://issues.apache.org/jira/browse/YARN-2024
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 0.23.10
Reporter: Eric Payne


Multiple issues were encountered when AppLogAggregatorImpl encountered an 
IOException in AppLogAggregatorImpl#uploadLogsForContainer while aggregating 
yarn-logs for an application that had very large (150G each) error logs.
- An IOException was encountered during the LogWriter#append call, and a 
message was printed, but no stacktrace was provided. Message: ERROR: Couldn't 
upload logs for container_n_nnn_nn_nn. Skipping this 
container.
- After the IOExceptin, the TFile is in a bad state, so subsequent calls to 
LogWriter#append fail with the following stacktrace:
2014-04-16 13:29:09,772 [LogAggregationService #17907] ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LogAggregationService #17907,5,main] threw an Exception.
java.lang.IllegalStateException: Incorrect state to start a new key: IN_VALUE
at 
org.apache.hadoop.io.file.tfile.TFile$Writer.prepareAppendKey(TFile.java:528)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:262)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainer(AppLogAggregatorImpl.java:128)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:164)
...
- At this point, the yarn-logs cleaner still thinks the thread is aggregating, 
so the huge yarn-logs never get cleaned up for that application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1115) Provide optional means for a scheduler to check real user ACLs

2013-08-28 Thread Eric Payne (JIRA)
Eric Payne created YARN-1115:


 Summary: Provide optional means for a scheduler to check real user 
ACLs
 Key: YARN-1115
 URL: https://issues.apache.org/jira/browse/YARN-1115
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 0.23.9, 2.1.0-beta
Reporter: Eric Payne


In the framework for secure implementation using UserGroupInformation.doAs 
(http://hadoop.apache.org/docs/stable/Secure_Impersonation.html), a trusted 
superuser can submit jobs on behalf of another user in a secure way. In this 
framework, the superuser is referred to as the real user and the proxied user 
is referred to as the effective user.

Currently when a job is submitted as an effective user, the ACLs for the 
effective user are checked against the queue on which the job is to be run. 
Depending on an optional configuration, the scheduler should also check the 
ACLs of the real user if the configuration to do so is set.

For example, suppose my superuser name is super, and super is configured to 
securely proxy as joe. Also suppose there is a Hadoop queue named ops which 
only allows ACLs for super, not for joe.

When super proxies to joe in order to submit a job to the ops queue, it will 
fail because joe, as the effective user, does not have ACLs on the ops queue.

In many cases this is what you want, in order to protect queues that joe should 
not be using.

However, there are times when super may need to proxy to many users, and the 
client running as super just wants to use the ops queue because the ops queue 
is already dedicated to the client's purpose, and, to keep the ops queue 
dedicated to that purpose, super doesn't want to open up ACLs to joe in general 
on the ops queue. Without this functionality, in this case, the client running 
as super needs to figure out which queue each user has ACLs opened up for, and 
then coordinate with other tasks using those queues.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira