[jira] [Resolved] (YARN-10848) Vcore allocation problem with DefaultResourceCalculator
[ https://issues.apache.org/jira/browse/YARN-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-10848. --- Resolution: Not A Problem I am closing this JIRA based on the above discussion. > Vcore allocation problem with DefaultResourceCalculator > --- > > Key: YARN-10848 > URL: https://issues.apache.org/jira/browse/YARN-10848 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Reporter: Peter Bacsko >Assignee: Minni Mittal >Priority: Major > Labels: pull-request-available > Attachments: TestTooManyContainers.java > > Time Spent: 20m > Remaining Estimate: 0h > > If we use DefaultResourceCalculator, then Capacity Scheduler keeps allocating > containers even if we run out of vcores. > CS checks the the available resources at two places. The first check is > {{CapacityScheduler.allocateContainerOnSingleNode()}}: > {noformat} > if (calculator.computeAvailableContainers(Resources > .add(node.getUnallocatedResource(), > node.getTotalKillableResources()), > minimumAllocation) <= 0) { > LOG.debug("This node " + node.getNodeID() + " doesn't have sufficient " > + "available or preemptible resource for minimum allocation"); > {noformat} > The second, which is more important, is located in > {{RegularContainerAllocator.assignContainer()}}: > {noformat} > if (!Resources.fitsIn(rc, capability, totalResource)) { > LOG.warn("Node : " + node.getNodeID() > + " does not have sufficient resource for ask : " + pendingAsk > + " node total capability : " + node.getTotalResource()); > // Skip this locality request > ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( > activitiesManager, node, application, schedulerKey, > ActivityDiagnosticConstant. > NODE_TOTAL_RESOURCE_INSUFFICIENT_FOR_REQUEST > + getResourceDiagnostics(capability, totalResource), > ActivityLevel.NODE); > return ContainerAllocation.LOCALITY_SKIPPED; > } > {noformat} > Here, {{rc}} is the resource calculator instance, the other two values are: > {noformat} > Resource capability = pendingAsk.getPerAllocationResource(); > Resource available = node.getUnallocatedResource(); > {noformat} > There is a repro unit test attatched to this case, which can demonstrate the > problem. The root cause is that we pass the resource calculator to > {{Resource.fitsIn()}}. Instead, we should use an overridden version, just > like in {{FSAppAttempt.assignContainer()}}: > {noformat} >// Can we allocate a container on this node? > if (Resources.fitsIn(capability, available)) { > // Inform the application of the new container for this request > RMContainer allocatedContainer = > allocate(type, node, schedulerKey, pendingAsk, > reservedContainer); > {noformat} > In CS, if we switch to DominantResourceCalculator OR use > {{Resources.fitsIn()}} without the calculator in > {{RegularContainerAllocator.assignContainer()}}, that fixes the failing unit > test (see {{testTooManyContainers()}} in {{TestTooManyContainers.java}}). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9975) Support proxy ACL user for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-9975. -- Resolution: Duplicate I'm closing this as a dup of YARN-1115. Please reopen if you disagree. > Support proxy ACL user for CapacityScheduler > > > Key: YARN-9975 > URL: https://issues.apache.org/jira/browse/YARN-9975 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > > As commented in YARN-9698. > I will open a new jira for the proxy user feature. > The background is that we have long running sql thriftserver for many users: > {quote}{{user->sql proxy-> sql thriftserver}}{quote} > But we do not have keytab for all users on 'sql proxy'. We just use a super > user like 'sql_prc' to submit the 'sql thriftserver' application. To support > this we should change the scheduler to support proxy user acl -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10935) AM Total Queue Limit goes below per-uwer AM Limit if parent is full.
Eric Payne created YARN-10935: - Summary: AM Total Queue Limit goes below per-uwer AM Limit if parent is full. Key: YARN-10935 URL: https://issues.apache.org/jira/browse/YARN-10935 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, capacityscheduler Reporter: Eric Payne This happens when DRF is enabled and all of one resource is consumed but the second resources still has plenty available. This is reproduceable by setting up a parent queue where the capacity and max capacity are the same, with 2 or more sub-queues whose max capacity is 100%. In one of the sub-queues, start a long-running app that consumes all resources in the parent queue's hieararchy. This app will consume all of the memory but not vary many vcores (for example) In a second queue, submit an app. The *{{Max Application Master Resources Per User}}* limit is much more than the *{{Max Application Master Resources}}* limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10834) Intra-queue preemption: apps that don't use defined custom resource won't be preempted.
Eric Payne created YARN-10834: - Summary: Intra-queue preemption: apps that don't use defined custom resource won't be preempted. Key: YARN-10834 URL: https://issues.apache.org/jira/browse/YARN-10834 Project: Hadoop YARN Issue Type: Improvement Reporter: Eric Payne Assignee: Eric Payne YARN-8292 added handling of negative resources during the preemption calculation phase. That JIRA hard-coded it so that for inter-(cross-)queue preemption, the a single resource in the vector could go negative while calculating ideal assignments and preemptions. It also hard-coded it so that during intra-(in-)queue preemption calculations, no resource could not go negative. YARN-10613 made these options configurable. However, in clusters where custom resources are defined, apps that don't use the extended resource won't be preempted. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10613) Config to allow Intra-queue preemption to enable/disable conservativeDRF
Eric Payne created YARN-10613: - Summary: Config to allow Intra-queue preemption to enable/disable conservativeDRF Key: YARN-10613 URL: https://issues.apache.org/jira/browse/YARN-10613 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, scheduler preemption Affects Versions: 2.10.1, 3.1.4, 3.2.2, 3.3.0 Reporter: Eric Payne Assignee: Eric Payne YARN-8292 added code that prevents CS intra-queue preemption from preempting containers from an app unless all of the major resources used by the app are greater than the user limit for that user. Ex: | Used | User Limit | | <58GB, 58> | <30GB, 300> | In this example, only used memory is above the user limit, not used vcores. So, intra-queue preemption will not occur. YARN-8292 added the {{conservativeDRF}} flag to {{CapacitySchedulerPreemptionUtils#tryPreemptContainerAndDeductResToObtain}}. If {{conservativeDRF}} is false, containers will be preempted from apps in the example state. If true, containers will not be preempted. This flag is hard-coded to false for Inter-queue (cross-queue) preemption and true for intra-queue (in-queue) preemption. I propose that in some cases, we want intra-queue preemption to be more aggressive and preempt in the example case. To accommodate that, I propose the addition of the following config property: {code:xml} yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.conservative-drf true {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10164) Allow NM to start even when custom resource type not defined
[ https://issues.apache.org/jira/browse/YARN-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-10164. --- Resolution: Won't Do > Allow NM to start even when custom resource type not defined > > > Key: YARN-10164 > URL: https://issues.apache.org/jira/browse/YARN-10164 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > > In the [custom resource > documentation|https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html], > it tells you to add the number of custom resources to a property called > {{yarn.nodemanager.resource-type.}} in a file called > {{node-resources.xml}}. > For GPU resources, this would look something like > {code:xml} > > yarn.nodemanager.resource-type.gpu > 16 > > {code} > A corresponding config property must also be in {{resource-types.xml}} called > yarn.resource-types: > {code:xml} > > yarn.resource-types > gpu > Custom resources to be used for scheduling. > > {code} > If the yarn.nodemanager.resource-type.gpu property exists without the > corresponding yarn.resource-types property, the nodemanager fails to start. > I would like the option to automatically create the node-resources.xml on all > new nodes regardless of whether or not the cluster supports GPU resources so > that if I deploy a GPU node into an existing cluster that does not (yet) > support GPU resources, the nodemanager will at least start. Even though it > doesn't support the GPU resource, the other supported resources will still be > available to be used by the apps in the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.
Eric Payne created YARN-10471: - Summary: Prevent logs for any container from becoming larger than a configurable size. Key: YARN-10471 URL: https://issues.apache.org/jira/browse/YARN-10471 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.1.4, 3.2.1 Reporter: Eric Payne Assignee: Eric Payne Configure a cluster such that a task attempt will be killed if any container log exceeds a configured size. This would help prevent logs from filling disks and also prevent the need to aggregate enormous logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10456) RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry
Eric Payne created YARN-10456: - Summary: RM PartitionQueueMetrics records are named QueueMetrics in Simon metrics registry Key: YARN-10456 URL: https://issues.apache.org/jira/browse/YARN-10456 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.10.1, 3.1.4, 3.2.1, 3.3.0 Reporter: Eric Payne Assignee: Eric Payne Several queue metrics (such as AppsRunning, PendingContainers, etc.) stopped working after we upgraded to 2.10. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
Eric Payne created YARN-10451: - Summary: RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined. Key: YARN-10451 URL: https://issues.apache.org/jira/browse/YARN-10451 Project: Hadoop YARN Issue Type: Improvement Reporter: Eric Payne The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} property defines {{yarn.io}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-1741) XInclude support broken for YARN ResourceManager
[ https://issues.apache.org/jira/browse/YARN-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-1741. -- Resolution: Won't Fix bq. Since branch-2.8 is EOL, I propose that we close this as Won't Fix. +1 > XInclude support broken for YARN ResourceManager > > > Key: YARN-1741 > URL: https://issues.apache.org/jira/browse/YARN-1741 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Eric Sirianni >Assignee: Xuan Gong >Priority: Critical > Labels: regression > > The XInclude support in Hadoop configuration files (introduced via > HADOOP-4944) was broken by the recent {{ConfigurationProvider}} changes to > YARN ResourceManager. Specifically, YARN-1459 and, more generally, the > YARN-1611 family of JIRAs for ResourceManager HA. > The issue is that {{ConfigurationProvider}} provides a raw {{InputStream}} as > a {{Configuration}} resource for what was previously a {{Path}}-based > resource. > For {{Path}} resources, the absolute file path is used as the {{systemId}} > for the {{DocumentBuilder.parse()}} call: > {code} > } else if (resource instanceof Path) { // a file resource > ... > doc = parse(builder, new BufferedInputStream( > new FileInputStream(file)), ((Path)resource).toString()); > } > {code} > The {{systemId}} is used to resolve XIncludes (among other things): > {code} > /** > * Parse the content of the given InputStream as an > * XML document and return a new DOM Document object. > ... > * @param systemId Provide a base for resolving relative URIs. > ... > */ > public Document parse(InputStream is, String systemId) > {code} > However, for loading raw {{InputStream}} resources, the {{systemId}} is set > to {{null}}: > {code} > } else if (resource instanceof InputStream) { > doc = parse(builder, (InputStream) resource, null); > {code} > causing XInclude resolution to fail. > In our particular environment, we make extensive use of XIncludes to > standardize common configuration parameters across multiple Hadoop clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10343) Legacy RM UI should include labeled metrics for allocated, total, and reserved resources.
Eric Payne created YARN-10343: - Summary: Legacy RM UI should include labeled metrics for allocated, total, and reserved resources. Key: YARN-10343 URL: https://issues.apache.org/jira/browse/YARN-10343 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.1.3, 3.2.1, 2.10.0 Reporter: Eric Payne Assignee: Eric Payne -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9767) PartitionQueueMetrics Issues
[ https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-9767. -- Resolution: Duplicate > PartitionQueueMetrics Issues > > > Key: YARN-9767 > URL: https://issues.apache.org/jira/browse/YARN-9767 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Attachments: YARN-9767.001.patch > > > The intent of the Jira is to capture the issues/observations encountered as > part of YARN-6492 development separately for ease of tracking. > Observations: > Please refer this > https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027 > 1. Since partition info are being extracted from request and node, there is a > problem. For example, > > Node N has been mapped to Label X (Non exclusive). Queue A has been > configured with ANY Node label. App A requested resources from Queue A and > its containers ran on Node N for some reasons. During > AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) > would get used for calculation. Lets say allocate call has been fired for 3 > containers of 1 GB each, then > a. PartitionDefault * queue A -> pending mb is 3 GB > b. PartitionX * queue A -> pending mb is -3 GB > > is the outcome. Because app request has been fired without any label > specification and #a metrics has been derived. After allocation is over, > pending resources usually gets decreased. When this happens, it use node > partition info. hence #b metrics has derived. > > Given this kind of situation, We will need to put some thoughts on achieving > the metrics correctly. > > 2. Though the intent of this jira is to do Partition Queue Metrics, we would > like to retain the existing Queue Metrics for backward compatibility (as you > can see from jira's discussion). > With this patch and YARN-9596 patch, queuemetrics (for queue's) would be > overridden either with some specific partition values or default partition > values. It could be vice - versa as well. For example, after the queues (say > queue A) has been initialised with some min and max cap and also with node > label's min and max cap, Queuemetrics (availableMB) for queue A return values > based on node label's cap config. > I've been working on these observations to provide a fix and attached > .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, > availableVcores is correct (Please refer above #2 observation). Added more > asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for > #2 is working properly. > Also one more thing to note is, user metrics for availableMB, availableVcores > at root queue was not there even before. Retained the same behaviour. User > metrics for availableMB, availableVcores is available only at child queue's > level and also with partitions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10251) Show extended resources on legacy RM UI.
Eric Payne created YARN-10251: - Summary: Show extended resources on legacy RM UI. Key: YARN-10251 URL: https://issues.apache.org/jira/browse/YARN-10251 Project: Hadoop YARN Issue Type: Improvement Reporter: Eric Payne Assignee: Eric Payne Attachments: Legacy RM UI With Not All Resources Shown.png, Updated Legacy RM UI With All Resources Shown.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10164) Allow NM to start even when custom resource type not defined
Eric Payne created YARN-10164: - Summary: Allow NM to start even when custom resource type not defined Key: YARN-10164 URL: https://issues.apache.org/jira/browse/YARN-10164 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Eric Payne Assignee: Eric Payne In the [custom resource documentation|https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html], it tells you to add the number of custom resources to a property called {{yarn.nodemanager.resource-type.}} in a file called {{node-resources.xml}}. For GPU resources, this would look something like {code:xml} yarn.nodemanager.resource-type.gpu 16 {code} A corresponding config property must also be in {{resource-types.xml}} called yarn.resource-types: {code:xml} yarn.resource-types gpu Custom resources to be used for scheduling. {code} If the yarn.nodemanager.resource-type.gpu property exists without the corresponding yarn.resource-types property, the nodemanager fails to start. I would like the option to automatically create the node-resources.xml on all new nodes regardless of whether or not the cluster supports GPU resources so that if I deploy a GPU node into an existing cluster that does not (yet) support GPU resources, the nodemanager will at least start. Even though it doesn't support the GPU resource, the other supported resources will still be available to be used by the apps in the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9790) Failed to set default-application-lifetime if maximum-application-lifetime is less than or equal to zero
[ https://issues.apache.org/jira/browse/YARN-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-9790. -- Fix Version/s: 2.10.1 3.1.4 3.2.2 Resolution: Fixed > Failed to set default-application-lifetime if maximum-application-lifetime is > less than or equal to zero > > > Key: YARN-9790 > URL: https://issues.apache.org/jira/browse/YARN-9790 > Project: Hadoop YARN > Issue Type: Bug >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4, 2.10.1 > > Attachments: YARN-9790.001.patch, YARN-9790.002.patch, > YARN-9790.003.patch, YARN-9790.004.patch > > > capacity-scheduler > {code} > ... > yarn.scheduler.capacity.root.dev.maximum-application-lifetime=-1 > yarn.scheduler.capacity.root.dev.default-application-lifetime=604800 > {code} > refreshQueue was failed as follows > {code} > 2019-08-28 15:21:57,423 WARN resourcemanager.AdminService > (AdminService.java:logAndWrapException(910)) - Exception refresh queues. > java.io.IOException: Failed to re-init queues : Default lifetime604800 can't > exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:477) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:394) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Default > lifetime604800 can't exceed maximum lifetime -1 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:268) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:162) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:141) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:259) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:283) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:171) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:726) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:472) > ... 12 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10084) Allow inheritance of max app lifetime / default app lifetime
Eric Payne created YARN-10084: - Summary: Allow inheritance of max app lifetime / default app lifetime Key: YARN-10084 URL: https://issues.apache.org/jira/browse/YARN-10084 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Affects Versions: 3.1.3, 3.2.1, 2.10.0 Reporter: Eric Payne Assignee: Eric Payne Currently, {{maximum-application-lifetime}} and {{default-application-lifetime}} must be set for each leaf queue. If it is not set for a particular leaf queue, then there will be no time limit on apps running in that queue. It should be possible to set {{yarn.scheduler.capacity.root.maximum-application-lifetime}} for the root queue and allow child queues to override that value if desired. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10033) TestProportionalCapacityPreemptionPolicy not initializing vcores for effective max resources
Eric Payne created YARN-10033: - Summary: TestProportionalCapacityPreemptionPolicy not initializing vcores for effective max resources Key: YARN-10033 URL: https://issues.apache.org/jira/browse/YARN-10033 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, test Affects Versions: 3.3.0 Reporter: Eric Payne TestProportionalCapacityPreemptionPolicy#testPreemptionWithVCoreResource is preempting more containers than would happen on a real cluster. This is because the process for mocking CS queues in {{TestProportionalCapacityPreemptionPolicy}} fails to take into consideration vcores when mocking effective max resources. This causes miscalculations for how many vcores to preempt when the DRF is being used in the test: {code:title=TempQueuePerPartition#offer} Resource absMaxCapIdealAssignedDelta = Resources.componentwiseMax( Resources.subtract(getMax(), idealAssigned), Resource.newInstance(0, 0)); {code} In the above code, the preemption policy is offering resources to an underserved queue. {{getMax()}} will use the effective max resource if it exists. Since this test is mocking effective max resources, it will return that value. However, since the mock doesn't include vcores, the test treats memory as the dominant resource and awards too many preempted containers to the underserved queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined
Eric Payne created YARN-10009: - Summary: DRF can treat minimum user limit percent as a max when custom resource is defined Key: YARN-10009 URL: https://issues.apache.org/jira/browse/YARN-10009 Project: Hadoop YARN Issue Type: Improvement Reporter: Eric Payne | | Memory | Vcores | res_1 | | Queue1 Totals | 20GB | 100 | 80 | | Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of total) | 80 (100% of total) | In the previous use case: - Queue1 has a value of 25 for {{miminum-user-limit-percent}} - User1 has requested 8 containers with {{}} each - {{res_1}} will be the dominant resource this case. All 8 containers should be assigned by the capacity scheduler, but with min user limit pct set to 25, only 3 containers are assigned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9773) Add QueueMetrics for Custom Resources
[ https://issues.apache.org/jira/browse/YARN-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-9773. -- Fix Version/s: 3.1.4 3.2.2 3.3.0 Resolution: Fixed Thanks [~maniraj...@gmail.com] . I have committed this to trunk, branch-3.2 and branch-3.1 > Add QueueMetrics for Custom Resources > - > > Key: YARN-9773 > URL: https://issues.apache.org/jira/browse/YARN-9773 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9773.001.patch, YARN-9773.002.patch, > YARN-9773.003.patch > > > Although the custom resource metrics are calculated and saved as a > QueueMetricsForCustomResources object within the QueueMetrics class, the JMX > and Simon QueueMetrics do not report that information for custom resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9911) Backport YARN-9773 (Add QueueMetrics for Custom Resources) to branch-2 and branch-2.10
Eric Payne created YARN-9911: Summary: Backport YARN-9773 (Add QueueMetrics for Custom Resources) to branch-2 and branch-2.10 Key: YARN-9911 URL: https://issues.apache.org/jira/browse/YARN-9911 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, yarn Affects Versions: 2.10.1, 2.11.0 Reporter: Eric Payne The feature for tracking queue metrics for custom resources was added in YARN-9773. We would like to utilize this same feature in branch-2. If the same design is to be backported to branch-2, several prerequisites must also be backported. Some (but perhaps not all) are listed below. An alternative design may be preferable. {panel:title=Prerequisites for YARN-9773} YARN-7541 YARN-5707 YARN-7739 YARN-8202 YARN-8750 (backported to branch-2 and branch-2.10) YARN-8842 (backported to 3.2, 3.1--still needs to go into branch-2) {panel} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9894) CapacitySchedulerPerf test for measuring hundreds of apps in a large number of queues.
Eric Payne created YARN-9894: Summary: CapacitySchedulerPerf test for measuring hundreds of apps in a large number of queues. Key: YARN-9894 URL: https://issues.apache.org/jira/browse/YARN-9894 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, test Affects Versions: 3.1.3, 3.2.1, 2.8.5, 2.9.2 Reporter: Eric Payne I have developed a unit test based on the existing TestCapacitySchedulerPerf tests that will measure the performance of a configurable number of apps in a configurable number of queues. It will also test the performance of a cluster that has many queues but only a portion of them are active. {code:title=For example:} $ mvn test -Dtest=TestCapacitySchedulerPerf#testUserLimitThroughputWithManyQueues \ -DRunCapacitySchedulerPerfTests=true -DNumberOfQueues=100 \ -DNumberOfApplications=200 \ -DPercentActiveQueues=100 {code} - Parameters: -- RunCapacitySchedulerPerfTests=true: Needed in order to trigger the test -- NumberOfQueues Configurable number of queues -- NumberOfApplications Total number of apps to run in the whole cluster, distributed evenly across all queues -- PercentActiveQueues Percentage of the queues that contain active applications -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9756) Create metric that sums total memory/vcores preempted per round
Eric Payne created YARN-9756: Summary: Create metric that sums total memory/vcores preempted per round Key: YARN-9756 URL: https://issues.apache.org/jira/browse/YARN-9756 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Affects Versions: 3.1.2, 2.8.5, 3.0.3, 2.9.2, 3.2.0 Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9685) NPE when rendering the info table of leaf queue in non-accessible partitions
[ https://issues.apache.org/jira/browse/YARN-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-9685. -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.1.3 3.2.1 3.3.0 Thanks again, [~Tao Yang]. I have committed to trunk, branch-3.2, and branch-3.1. Prior releases did not have the issue. > NPE when rendering the info table of leaf queue in non-accessible partitions > > > Key: YARN-9685 > URL: https://issues.apache.org/jira/browse/YARN-9685 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.3.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9685.001.patch > > > I found incomplete queue info shown on scheduler page and NPE in RM log when > rendering the info table of leaf queue in non-accessible partitions. > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > {noformat} > The direct cause is that PartitionQueueCapacitiesInfo of leaf queues in > non-accessible partitions are incomplete(part of fields are null such as > configuredMinResource/configuredMaxResource/effectiveMinResource/effectiveMaxResource) > but some places in CapacitySchedulerPage don't consider that. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8425) Yarn container getting killed due to running beyond physical memory limits
[ https://issues.apache.org/jira/browse/YARN-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-8425. -- Resolution: Not A Bug > Yarn container getting killed due to running beyond physical memory limits > -- > > Key: YARN-8425 > URL: https://issues.apache.org/jira/browse/YARN-8425 > Project: Hadoop YARN > Issue Type: Task > Components: applications, container-queuing, yarn >Affects Versions: 2.7.6 >Reporter: Tapas Sen >Priority: Major > Attachments: yarn_configuration_1.PNG, yarn_configuration_2.PNG, > yarn_configuration_3.PNG > > > Hi, > Getting these error. > > 2018-06-12 17:59:07,193 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics > report from attempt_1527758146858_45040_m_08_3: Container > [pid=15498,containerID=container_e60_1527758146858_45040_01_41] is > running beyond physical memory limits. Current usage: 8.1 GB of 8 GB physical > memory used; 12.2 GB of 16.8 GB virtual memory used. Killing container. > > Yarn resource configuration will in attachment. > > Any lead would be appreciated. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7947) Capacity Scheduler intra-queue preemption can NPE for non-schedulable apps
Eric Payne created YARN-7947: Summary: Capacity Scheduler intra-queue preemption can NPE for non-schedulable apps Key: YARN-7947 URL: https://issues.apache.org/jira/browse/YARN-7947 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, scheduler preemption Reporter: Eric Payne Intra-queue preemption policy can cause NPE for pending users with no schedulable apps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7927) YARN-7813 caused test failure in TestRMWebServicesSchedulerActivities
Eric Payne created YARN-7927: Summary: YARN-7813 caused test failure in TestRMWebServicesSchedulerActivities Key: YARN-7927 URL: https://issues.apache.org/jira/browse/YARN-7927 Project: Hadoop YARN Issue Type: Bug Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7813) Capacity Scheduler Intra-queue Preemption should be configurable for each queue
Eric Payne created YARN-7813: Summary: Capacity Scheduler Intra-queue Preemption should be configurable for each queue Key: YARN-7813 URL: https://issues.apache.org/jira/browse/YARN-7813 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, scheduler preemption Affects Versions: 3.0.0, 2.8.3, 2.9.0 Reporter: Eric Payne Assignee: Eric Payne Just as inter-queue (a.k.a. cross-queue) preemption is configurable per queue, intra-queue (a.k.a. in-queue) preemption should be configurable per queue. If a queue does not have a setting for intra-queue preemption, it should inherit its parents value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP
[ https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-7424. -- Resolution: Invalid bq. In order to create the "desired" behavior, we would have to fundamentally change the way the capacity scheduler works, Closing > Capacity Scheduler Intra-queue preemption: add property to only preempt up to > configured MULP > - > > Key: YARN-7424 > URL: https://issues.apache.org/jira/browse/YARN-7424 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, scheduler preemption >Affects Versions: 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > > If the queue's configured minimum user limit percent (MULP) is something > small like 1%, all users will max out well over their MULP until 100 users > have apps in the queue. Since the intra-queue preemption monitor tries to > balance the resource among the users, most of the time in this use case it > will be preempting containers on behalf of users that are already over their > MULP guarantee. > This JIRA proposes that a property should be provided so that a queue can be > configured to only preempt on behalf of a user until that user has reached > its MULP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7728) Expose and expand container preemptions in Capacity Scheduler queue metrics
Eric Payne created YARN-7728: Summary: Expose and expand container preemptions in Capacity Scheduler queue metrics Key: YARN-7728 URL: https://issues.apache.org/jira/browse/YARN-7728 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.0.0, 2.8.3, 2.9.0 Reporter: Eric Payne Assignee: Eric Payne YARN-1047 exposed queue metrics for the number of preempted containers to the fair scheduler. I would like to also expose these to the capacity scheduler and add metrics for the amount of lost memory seconds and vcore seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7658) Capacity scheduler UI hangs when rendering if labels are present
[ https://issues.apache.org/jira/browse/YARN-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-7658. -- Resolution: Duplicate > Capacity scheduler UI hangs when rendering if labels are present > > > Key: YARN-7658 > URL: https://issues.apache.org/jira/browse/YARN-7658 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Eric Payne > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7658) Capacity scheduler UI hangs when rendering if labels are present
Eric Payne created YARN-7658: Summary: Capacity scheduler UI hangs when rendering if labels are present Key: YARN-7658 URL: https://issues.apache.org/jira/browse/YARN-7658 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7619) Max AM Resource value in CS UI is different for every user
Eric Payne created YARN-7619: Summary: Max AM Resource value in CS UI is different for every user Key: YARN-7619 URL: https://issues.apache.org/jira/browse/YARN-7619 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2, 3.1.0 Reporter: Eric Payne Assignee: Eric Payne YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity scheduler UI used to contain the queue-level AM limit instead of the user-level AM limit. It fixed this by using the user-specific AM limit that is calculated in {{LeafQueue#activateApplications}}, stored in each user's {{LeafQueue#User}} object, and retrieved via {{UserInfo#getResourceUsageInfo}}. The problem is that this user-specific AM limit depends on the activity of other users and other applications in a queue, and it is only calculated and updated when a user's application is activated. So, when {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale value unless an application was recently activated for a particular user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue
Eric Payne created YARN-7575: Summary: When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue Key: YARN-7575 URL: https://issues.apache.org/jira/browse/YARN-7575 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Eric Payne I encountered the following while reviewing and testing branch YARN-5881. The design document from YARN-5881 says that for max-capacity: {quote} 3) For each queue, we require: a) if max-resource not set, it automatically set to parent.max-resource {quote} When I try leaving blank {{yarn.scheduler.capacity.< queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}: {noformat} 2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException ... at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129) {noformat} Also... A job will run in the leaf queue with no max capacity set and it will grow to the max capacity of the cluster, but if I add resources to the node, the job won't grow any more even though it has pending resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7501) Capacity Scheduler Intra-queue preemption should have a "dead zone" around user limit
Eric Payne created YARN-7501: Summary: Capacity Scheduler Intra-queue preemption should have a "dead zone" around user limit Key: YARN-7501 URL: https://issues.apache.org/jira/browse/YARN-7501 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, scheduler preemption Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2, 3.1.0 Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
Eric Payne created YARN-7496: Summary: CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations Key: YARN-7496 URL: https://issues.apache.org/jira/browse/YARN-7496 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.2 Reporter: Eric Payne Assignee: Eric Payne Only a problem in 2.8. Preemption could oscillate due to the difference in how user limit is calculated between 2.8 and later releases. Basically (ignoring ULF, MULP, and maybe others), the calculation for user limit on the Capacity Scheduler side in 2.8 is {{total used resources / number of active users}} while the calculation in later releases is {{total active resources / number of active users}}. When intra-queue preemption was backported to 2.8, it's calculations for user limit were more aligned with the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
Eric Payne created YARN-7469: Summary: Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit Key: YARN-7469 URL: https://issues.apache.org/jira/browse/YARN-7469 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2 Reporter: Eric Payne Assignee: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP
Eric Payne created YARN-7424: Summary: Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP Key: YARN-7424 URL: https://issues.apache.org/jira/browse/YARN-7424 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, scheduler preemption Affects Versions: 3.0.0-beta1, 2.8.2 Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7370) Intra-queue preemption properties should be refreshable
Eric Payne created YARN-7370: Summary: Intra-queue preemption properties should be refreshable Key: YARN-7370 URL: https://issues.apache.org/jira/browse/YARN-7370 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, scheduler preemption Affects Versions: 3.0.0-alpha3, 2.8.0 Reporter: Eric Payne At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} should be refreshable. It would also be nice to make {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user
Eric Payne created YARN-7245: Summary: In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user Key: YARN-7245 URL: https://issues.apache.org/jira/browse/YARN-7245 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 3.0.0-alpha4, 2.8.1, 2.9.0 Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
Eric Payne created YARN-7149: Summary: Cross-queue preemption sometimes starves an underserved queue Key: YARN-7149 URL: https://issues.apache.org/jira/browse/YARN-7149 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 3.0.0-alpha3, 2.9.0 Reporter: Eric Payne Assignee: Eric Payne In branch 2 and trunk, I am consistently seeing some use cases where cross-queue preemption does not happen when it should. I do not see this in branch-2.8. Use Case: | | *Size* | *Minimum Container Size* | |MyCluster | 20 GB | 0.5 GB | | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit Percent (MULP)* | *User Limit Factor (ULF)* | |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB - _Note: containers are 0.5 GB._ - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. - _No more containers are ever preempted, even though {{Q2}} is far underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7120) CapacitySchedulerPage NPE in "Aggregate scheduler counts" section
Eric Payne created YARN-7120: Summary: CapacitySchedulerPage NPE in "Aggregate scheduler counts" section Key: YARN-7120 URL: https://issues.apache.org/jira/browse/YARN-7120 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha3, 2.8.1, 2.9.0 Reporter: Eric Payne Assignee: Eric Payne Priority: Minor The problem manifests itself by having the bottom part of the "Aggregated scheduler counts" section cut off on the GUI and an NPE in the RM log. {noformat} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$HealthBlock.render(CapacitySchedulerPage.java:558) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:478) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86) ... 58 more {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7052) RM SchedulingMonitor should use HadoopExecutors when creating ScheduledExecutorService
Eric Payne created YARN-7052: Summary: RM SchedulingMonitor should use HadoopExecutors when creating ScheduledExecutorService Key: YARN-7052 URL: https://issues.apache.org/jira/browse/YARN-7052 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Eric Payne In YARN-7051, we ran into a case where the preemption monitor thread hung with no indication of why. This was because the preemption monitor is started by the {{SchedulingExecutorService}} from {{SchedulingMonigor#serviceStart}}, and then nothing ever gets the result of the future or allows it to throw an exception if needed. At least with {{HadoopExecutor}}, it will provide a {{HadoopScheduledThreadPoolExecutor}} that logs the exception if one happens. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7051) FifoIntraQueuePreemptionPlugin can get concurrent modification exception/
Eric Payne created YARN-7051: Summary: FifoIntraQueuePreemptionPlugin can get concurrent modification exception/ Key: YARN-7051 URL: https://issues.apache.org/jira/browse/YARN-7051 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha3, 2.8.1, 2.9.0 Reporter: Eric Payne Priority: Critical {{FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue}} has the following code: {code} Collection runningApps = leafQueue.getApplications(); Resource amUsed = Resources.createResource(0, 0); for (FiCaSchedulerApp app : runningApps) { {code} {{runningApps}} is unmodifiable but not concurrent. This caused the preemption monitor thread to crash in the RM in one of our clusters. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6585) RM fails to start when upgrading from 2.7 to 2.8 for clusters with node labels.
Eric Payne created YARN-6585: Summary: RM fails to start when upgrading from 2.7 to 2.8 for clusters with node labels. Key: YARN-6585 URL: https://issues.apache.org/jira/browse/YARN-6585 Project: Hadoop YARN Issue Type: Bug Reporter: Eric Payne {noformat} Caused by: java.io.IOException: Not all labels being replaced contained by known label collections, please check, new labels=[abc] at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.checkReplaceLabelsOnNode(CommonNodeLabelsManager.java:718) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.replaceLabelsOnNode(CommonNodeLabelsManager.java:737) at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.replaceLabelsOnNode(RMNodeLabelsManager.java:189) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.loadFromMirror(FileSystemNodeLabelsStore.java:181) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:208) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:251) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:265) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6248) Killing an app with pending container requests leaves the user in UsersManager
Eric Payne created YARN-6248: Summary: Killing an app with pending container requests leaves the user in UsersManager Key: YARN-6248 URL: https://issues.apache.org/jira/browse/YARN-6248 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha3 Reporter: Eric Payne Assignee: Eric Payne If an app is still asking for resources when it is killed, the user is left in the UsersManager structure and shows up on the GUI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6165) Intra-queue preemption occurs even when preemption is turned off for a specific queue.
Eric Payne created YARN-6165: Summary: Intra-queue preemption occurs even when preemption is turned off for a specific queue. Key: YARN-6165 URL: https://issues.apache.org/jira/browse/YARN-6165 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, scheduler preemption Affects Versions: 3.0.0-alpha2 Reporter: Eric Payne Intra-queue preemption occurs even when preemption is turned on for the whole cluster ({{yarn.resourcemanager.scheduler.monitor.enable == true}}) but turned off for a specific queue ({{yarn.scheduler.capacity.root.queue1.disable_preemption == true}}). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5973) TestCapacitySchedulerSurgicalPreemption sometimes fails
Eric Payne created YARN-5973: Summary: TestCapacitySchedulerSurgicalPreemption sometimes fails Key: YARN-5973 URL: https://issues.apache.org/jira/browse/YARN-5973 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, scheduler preemption Affects Versions: 2.8.0 Reporter: Eric Payne Priority: Minor The tests in {{TestCapacitySchedulerSurgicalPreemption}} appear to be racy. They often pass, but the following errors sometimes occur: {noformat} testSimpleSurgicalPreemption(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption) Time elapsed: 14.671 sec <<< FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.fail(Assert.java:95) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerPreemptionTestBase.waitNumberOfLiveContainersFromApp(CapacitySchedulerPreemptionTestBase.java:110) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption.testSimpleSurgicalPreemption(TestCapacitySchedulerSurgicalPreemption.java:143) {noformat} {noformat} testSurgicalPreemptionWithAvailableResource(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption) Time elapsed: 9.503 sec <<< FAILURE! java.lang.AssertionError: expected:<3> but was:<2> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption.testSurgicalPreemptionWithAvailableResource(TestCapacitySchedulerSurgicalPreemption.java:220) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-4751) In 2.7, Labeled queue usage not shown properly in capacity scheduler UI
Eric Payne created YARN-4751: Summary: In 2.7, Labeled queue usage not shown properly in capacity scheduler UI Key: YARN-4751 URL: https://issues.apache.org/jira/browse/YARN-4751 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 2.7.3 Reporter: Eric Payne Assignee: Eric Payne In 2.6 and 2.7, the capacity scheduler UI does not have the queue graphs separated by partition. When applications are running on a labeled queue, no color is shown in the bar graph, and several of the "Used" metrics are zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4390) Consider container request size during CS preemption
[ https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-4390. -- Resolution: Duplicate Closing this ticket in favor of YARN-4108 > Consider container request size during CS preemption > > > Key: YARN-4390 > URL: https://issues.apache.org/jira/browse/YARN-4390 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.0.0, 2.8.0, 2.7.3 >Reporter: Eric Payne >Assignee: Eric Payne > > There are multiple reasons why preemption could unnecessarily preempt > containers. One is that an app could be requesting a large container (say > 8-GB), and the preemption monitor could conceivably preempt multiple > containers (say 8, 1-GB containers) in order to fill the large container > request. These smaller containers would then be rejected by the requesting AM > and potentially given right back to the preempted app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI
[ https://issues.apache.org/jira/browse/YARN-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-4226. -- Resolution: Won't Fix Since the code works and is only slightly confusing, I am closing this ticket as WontFix. > Make capacity scheduler queue's preemption status REST API consistent with GUI > -- > > Key: YARN-4226 > URL: https://issues.apache.org/jira/browse/YARN-4226 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > > In the capacity scheduler GUI, the preemption status has the following form: > {code} > Preemption: disabled > {code} > However, the REST API shows the following for the same status: > {code} > preemptionDisabled":true > {code} > The latter is confusing and should be consistent with the format in the GUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4422) Generic AHS sometimes doesn't show started, node, or logs on App page
Eric Payne created YARN-4422: Summary: Generic AHS sometimes doesn't show started, node, or logs on App page Key: YARN-4422 URL: https://issues.apache.org/jira/browse/YARN-4422 Project: Hadoop YARN Issue Type: Bug Reporter: Eric Payne Assignee: Eric Payne Sometimes the AM container for an app isn't able to start the JVM. This can happen if bogus JVM options are given to the AM container ( {{-Dyarn.app.mapreduce.am.command-opts=-InvalidJvmOption}}) or when misconfiguring the AM container's environment variables ({{-Dyarn.app.mapreduce.am.env="JAVA_HOME=/foo/bar/baz}}) When the AM container for an app isn't able to start the JVM, the Application page for that application shows {{N/A}} for the {{Started}}, {{Node}}, and {{Logs}} columns. It _does_ have links for each app attempt, and if you click on one of them, you go to the Application Attempt page, where you can see all containers with links to their logs and nodes, including the AM container. But none of that shows up for the app attempts on the Application page. Also, on the Application Attempt page, in the {{Application Attempt Overview}} section, the {{AM Container}} value is {{null}} and the {{Node}} value is {{N/A}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4390) Consider container request size during CS preemption
Eric Payne created YARN-4390: Summary: Consider container request size during CS preemption Key: YARN-4390 URL: https://issues.apache.org/jira/browse/YARN-4390 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 3.0.0, 2.8.0, 2.7.3 Reporter: Eric Payne Assignee: Eric Payne There are multiple reasons why preemption could unnecessarily preempt containers. One is that an app could be requesting a large container (say 8-GB), and the preemption monitor could conceivably preempt multiple containers (say 8, 1-GB containers) in order to fill the large container request. These smaller containers would then be rejected by the requesting AM and potentially given right back to the preempted app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4225) Add preemption status to {{yarn queue -status}}
Eric Payne created YARN-4225: Summary: Add preemption status to {{yarn queue -status}} Key: YARN-4225 URL: https://issues.apache.org/jira/browse/YARN-4225 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4226) Make capacity scheduler queue's preemption status REST API consistent with GUI
Eric Payne created YARN-4226: Summary: Make capacity scheduler queue's preemption status REST API consistent with GUI Key: YARN-4226 URL: https://issues.apache.org/jira/browse/YARN-4226 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Priority: Minor In the capacity scheduler GUI, the preemption status has the following form: {code} Preemption: disabled {code} However, the REST API shows the following for the same status: {code} preemptionDisabled":true {code} The latter is confusing and should be consistent with the format in the GUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
Eric Payne created YARN-3978: Summary: Configurably turn off the saving of container info in Generic AHS Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Reporter: Eric Payne Assignee: Eric Payne Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3905) Application History Server UI NPEs when accessing apps run after RM restart
Eric Payne created YARN-3905: Summary: Application History Server UI NPEs when accessing apps run after RM restart Key: YARN-3905 URL: https://issues.apache.org/jira/browse/YARN-3905 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.7.1, 2.7.0, 2.8.0 Reporter: Eric Payne Assignee: Eric Payne From the Application History URL (http://RmHostName:8188/applicationhistory), clicking on the application ID of an app that was run after the RM daemon has been restarted results in a 500 error: {noformat} Sorry, got error 500 Please consult RFC 2616 for meanings of the error code. {noformat} The stack trace is as follows: {code} 2015-07-09 20:13:15,584 [2068024519@qtp-769046918-3] INFO applicationhistoryservice.FileSystemApplicationHistoryStore: Completed reading history information of all application attempts of application application_1436472584878_0001 2015-07-09 20:13:15,591 [2068024519@qtp-769046918-3] ERROR webapp.AppBlock: Failed to read the AM container of the application attempt appattempt_1436472584878_0001_01. java.lang.NullPointerException at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.convertToContainerReport(ApplicationHistoryManagerImpl.java:206) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getContainer(ApplicationHistoryManagerImpl.java:199) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainerReport(ApplicationHistoryClientService.java:205) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:272) at org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:267) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.webapp.AppBlock.generateApplicationTable(AppBlock.java:266) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3769) Preemption occurring unnecessarily because preemption doesn't consider user limit
Eric Payne created YARN-3769: Summary: Preemption occurring unnecessarily because preemption doesn't consider user limit Key: YARN-3769 URL: https://issues.apache.org/jira/browse/YARN-3769 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0, 2.6.0, 2.8.0 Reporter: Eric Payne Assignee: Eric Payne We are seeing the preemption monitor preempting containers from queue A and then seeing the capacity scheduler giving them immediately back to queue A. This happens quite often and causes a lot of churn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3540) Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler
Eric Payne created YARN-3540: Summary: Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler Key: YARN-3540 URL: https://issues.apache.org/jira/browse/YARN-3540 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Priority: Blocker We are seeing this happen when - an NM's disk goes bad during the creation of map output(s) - the reducer's fetcher can read the shuffle header and reserve the memory - but gets an IOException when trying to shuffle for InMemoryMapOutput - shuffle fetch retry is enabled -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3275) Preemption happening on non-preemptable queues
Eric Payne created YARN-3275: Summary: Preemption happening on non-preemptable queues Key: YARN-3275 URL: https://issues.apache.org/jira/browse/YARN-3275 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne YARN-2056 introduced the ability to turn preemption on and off at the queue level. In cases where a queue goes over its absolute max capacity (YARN:3243, for example), containers can be preempted from that queue, even though the queue is marked as non-preemptable. We are using this feature in large, busy clusters and seeing this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.
[ https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-2592. -- Resolution: Invalid Preemption can kill containers to fulfil need of already over-capacity queue. - Key: YARN-2592 URL: https://issues.apache.org/jira/browse/YARN-2592 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.1 Reporter: Eric Payne There are scenarios in which one over-capacity queue can cause preemption of another over-capacity queue. However, since killing containers may lose work, it doesn't make sense to me to kill containers to feed an already over-capacity queue. Consider the following: {code} root has A,B,C, total capacity = 90 A.guaranteed = 30, A.pending = 5, A.current = 40 B.guaranteed = 30, B.pending = 0, B.current = 50 C.guaranteed = 30, C.pending = 0, C.current = 0 {code} In this case, the queue preemption monitor will kill 5 resources from queue B so that queue A can pick them up, even though queue A is already over its capacity. This could lose any work that those containers in B had already done. Is there a use case for this behavior? It seems to me that if a queue is already over its capacity, it shouldn't destroy the work of other queues. If the over-capacity queue needs more resources, that seems to be a problem that should be solved by increasing its guarantee. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.
Eric Payne created YARN-2592: Summary: Preemption can kill containers to fulfil need of already over-capacity queue. Key: YARN-2592 URL: https://issues.apache.org/jira/browse/YARN-2592 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1, 3.0.0 Reporter: Eric Payne There are scenarios in which one over-capacity queue can cause preemption of another over-capacity queue. However, since killing containers may lose work, it doesn't make sense to me to kill containers to feed an already over-capacity queue. Consider the following: {code} root has A,B,C, total capacity = 90 A.guaranteed = 30, A.pending = 5, A.current = 40 B.guaranteed = 30, B.pending = 0, B.current = 50 C.guaranteed = 30, C.pending = 0, C.current = 0 {code} In this case, the queue preemption monitor will kill 5 resources from queue B so that queue A can pick them up, even though queue A is already over its capacity. This could lose any work that those containers in B had already done. Is there a use case for this behavior? It seems to me that if a queue is already over its capacity, it shouldn't destroy the work of other queues. If the over-capacity queue needs more resources, that seems to be a problem that should be solved by increasing its guarantee. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2024) IOException in AppLogAggregatorImpl does not give stacktrace and leaves aggregated TFile in a bad state.
Eric Payne created YARN-2024: Summary: IOException in AppLogAggregatorImpl does not give stacktrace and leaves aggregated TFile in a bad state. Key: YARN-2024 URL: https://issues.apache.org/jira/browse/YARN-2024 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 0.23.10 Reporter: Eric Payne Multiple issues were encountered when AppLogAggregatorImpl encountered an IOException in AppLogAggregatorImpl#uploadLogsForContainer while aggregating yarn-logs for an application that had very large (150G each) error logs. - An IOException was encountered during the LogWriter#append call, and a message was printed, but no stacktrace was provided. Message: ERROR: Couldn't upload logs for container_n_nnn_nn_nn. Skipping this container. - After the IOExceptin, the TFile is in a bad state, so subsequent calls to LogWriter#append fail with the following stacktrace: 2014-04-16 13:29:09,772 [LogAggregationService #17907] ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[LogAggregationService #17907,5,main] threw an Exception. java.lang.IllegalStateException: Incorrect state to start a new key: IN_VALUE at org.apache.hadoop.io.file.tfile.TFile$Writer.prepareAppendKey(TFile.java:528) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:262) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainer(AppLogAggregatorImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:164) ... - At this point, the yarn-logs cleaner still thinks the thread is aggregating, so the huge yarn-logs never get cleaned up for that application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1115) Provide optional means for a scheduler to check real user ACLs
Eric Payne created YARN-1115: Summary: Provide optional means for a scheduler to check real user ACLs Key: YARN-1115 URL: https://issues.apache.org/jira/browse/YARN-1115 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 0.23.9, 2.1.0-beta Reporter: Eric Payne In the framework for secure implementation using UserGroupInformation.doAs (http://hadoop.apache.org/docs/stable/Secure_Impersonation.html), a trusted superuser can submit jobs on behalf of another user in a secure way. In this framework, the superuser is referred to as the real user and the proxied user is referred to as the effective user. Currently when a job is submitted as an effective user, the ACLs for the effective user are checked against the queue on which the job is to be run. Depending on an optional configuration, the scheduler should also check the ACLs of the real user if the configuration to do so is set. For example, suppose my superuser name is super, and super is configured to securely proxy as joe. Also suppose there is a Hadoop queue named ops which only allows ACLs for super, not for joe. When super proxies to joe in order to submit a job to the ops queue, it will fail because joe, as the effective user, does not have ACLs on the ops queue. In many cases this is what you want, in order to protect queues that joe should not be using. However, there are times when super may need to proxy to many users, and the client running as super just wants to use the ops queue because the ops queue is already dedicated to the client's purpose, and, to keep the ops queue dedicated to that purpose, super doesn't want to open up ACLs to joe in general on the ops queue. Without this functionality, in this case, the client running as super needs to figure out which queue each user has ACLs opened up for, and then coordinate with other tasks using those queues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira