[ https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745591#comment-16745591 ]
Wangda Tan commented on YARN-9205: ---------------------------------- [~tangzhankun], >From the log it looks like by design. There's a method: >org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker#getMaxAllowedAllocation It is added to reject resource requests when no node has that amount of resource (for example, if app asks 100G mem, however the maximum node's resource is 50GB, such request will be rejected: {code:java} please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation=<memory:8192, vCores:4, cmp.com/hdw: 9223372036854775807>{code} And this behavior is triggered once: {code:java} if (forceConfiguredMaxAllocation && System.currentTimeMillis() - ResourceManager.getClusterTimeStamp() > configuredMaxAllocationWaitTime) { forceConfiguredMaxAllocation = false; }{code} And the configuredMaxAllocationWaitTime is decided by: {code:java} long configuredMaximumAllocationWaitTime = conf.getLong(YarnConfiguration.RM_WORK_PRESERVING_RECOVERY_SCHEDULING_WAIT_MS, YarnConfiguration.DEFAULT_RM_WORK_PRESERVING_RECOVERY_SCHEDULING_WAIT_MS);{code} Which is 10 sec by default, that means if node with the custom resource not registered within 10 sec after RM start, apps will be rejected till the NM with custom resource registered. Let me know if it makes sense to you or any other issue I missed, we can make the error msg to be more specific if needed. > When using custom resource type, application will fail to run due to the > CapacityScheduler throws > InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-9205 > URL: https://issues.apache.org/jira/browse/YARN-9205 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.3.0 > Reporter: Zhankun Tang > Assignee: Zhankun Tang > Priority: Critical > Attachments: YARN-9205-trunk.001.patch > > > In a non-secure cluster. Reproduce it as follows: > # Set capacity scheduler in yarn-site.xml > # Use default capacity-scheduler.xml > # Set custom resource type "cmp.com/hdw" in resource-types.xml > # Set a value say 10 in node-resources.xml > # Start cluster > # Submit a distribute shell application which requests some "cmp.com/hdw" > The AM will get an exception from CapacityScheduler and then failed. This bug > doesn't exist in FairScheduler. > {code:java} > 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[<memory:2048, vCores:2, cmp.com/hdw: > 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: > GUARANTEED, Enforce Execution Type: false}]Resource Profile[] > 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[cmp.com/hdw], > Requested resource=<memory:2048, vCores:2, cmp.com/hdw: 2>, maximum allowed > allocation=<memory:8192, vCores:4>, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation=<memory:8192, vCores:4, cmp.com/hdw: 9223372036854775807> > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > ...{code} > Did a roughly debugging, below method return the wrong maximum capacity. > DefaultAMSProcessor.java, Line 234. > {code:java} > Resource maximumCapacity = > getScheduler().getMaximumResourceCapability(app.getQueue());{code} > The above code seems should return "<memory:8192, vCores:4, cmp.com/hdw:10>" > but returns "<memory:8192, vCores:4>". > This might be caused by queue maximum allocation calculation: > AbstractCSQueue.java Line364 > {code:java} > this.maximumAllocation = > configuration.getMaximumAllocationPerQueue( > getQueuePath());{code} > And this invokes CapacitySchedulerConfiguration.java Line 895: > {code:java} > Resource clusterMax = ResourceUtils.fetchMaximumAllocationFromConfig(this); > {code} > Passing a "this" which is not a YarnConfiguration instance will cause below > code return null for resource names and then only contains mandatory > resources. This might be the root cause. > {code:java} > private static Map<String, ResourceInformation> > getResourceInformationMapFromConfig( > ... > // NULL value here! > String[] resourceNames = conf.getStrings(YarnConfiguration.RESOURCE_TYPES); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org