[jira] [Updated] (YARN-9173) FairShare calculation broken for large values after YARN-8833
[ https://issues.apache.org/jira/browse/YARN-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9173: - Fix Version/s: (was: 3.1.3) 3.1.2 > FairShare calculation broken for large values after YARN-8833 > - > > Key: YARN-9173 > URL: https://issues.apache.org/jira/browse/YARN-9173 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.3.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.0.4, 3.1.2, 3.3.0, 3.2.1 > > Attachments: YARN-9137-branch-3.1.001.patch, > YARN-9137-branch3.1.001.patch, YARN-9173.001.patch, YARN-9173.002.patch > > > After the fix for the infinite loop in YARN-8833 we now get the wrong values > back for fairshare calculations under certain circumstances. The current > implementation works when the total resource is smaller than Integer.MAXVALUE. > When the total resource goes above that value the number of iterations is not > enough to converge to the correct value. > The new test {{testResourceUsedWithWeightToResourceRatio()}} only checks that > the calculation does not hang but does not check the outcome of the > calculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9173) FairShare calculation broken for large values after YARN-8833
[ https://issues.apache.org/jira/browse/YARN-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748107#comment-16748107 ] Wangda Tan commented on YARN-9173: -- Cherry-picked to branch-3.1.2 as well. Updated fix version > FairShare calculation broken for large values after YARN-8833 > - > > Key: YARN-9173 > URL: https://issues.apache.org/jira/browse/YARN-9173 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.3.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Fix For: 3.0.4, 3.1.2, 3.3.0, 3.2.1 > > Attachments: YARN-9137-branch-3.1.001.patch, > YARN-9137-branch3.1.001.patch, YARN-9173.001.patch, YARN-9173.002.patch > > > After the fix for the infinite loop in YARN-8833 we now get the wrong values > back for fairshare calculations under certain circumstances. The current > implementation works when the total resource is smaller than Integer.MAXVALUE. > When the total resource goes above that value the number of iterations is not > enough to converge to the correct value. > The new test {{testResourceUsedWithWeightToResourceRatio()}} only checks that > the calculation does not hang but does not check the outcome of the > calculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9205) When using custom resource type, application will fail to run due to the CapacityScheduler throws InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION)
[ https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9205: - Target Version/s: 3.1.2, 3.2.1 (was: 3.1.2) > When using custom resource type, application will fail to run due to the > CapacityScheduler throws > InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) > --- > > Key: YARN-9205 > URL: https://issues.apache.org/jira/browse/YARN-9205 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-9205-trunk.001.patch, YARN-9205-trunk.002.patch > > > In a non-secure cluster. Reproduce it as follows: > # Set capacity scheduler in yarn-site.xml > # Use default capacity-scheduler.xml > # Set custom resource type "cmp.com/hdw" in resource-types.xml > # Set a value say 10 in node-resources.xml > # Start cluster > # Submit a distribute shell application which requests some "cmp.com/hdw" > The AM will get an exception from CapacityScheduler and then failed. This bug > doesn't exist in FairScheduler. > {code:java} > 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[ 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: > GUARANTEED, Enforce Execution Type: false}]Resource Profile[] > 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[cmp.com/hdw], > Requested resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation= > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > ...{code} > Did a roughly debugging, below method return the wrong maximum capacity. > DefaultAMSProcessor.java, Line 234. > {code:java} > Resource maximumCapacity = > getScheduler().getMaximumResourceCapability(app.getQueue());{code} > The above code seems should return "" > but returns "". > This incorrect value might be caused by queue maximum allocation calculation > involved in YARN-8720: > AbstractCSQueue.java Line364 > {code:java} > this.maximumAllocation = > configuration.getMaximumAllocationPerQueue( > getQueuePath());{code} > And this invokes CapacitySchedulerConfiguration.java Line 895: > {code:java} > Resource clusterMax = ResourceUtils.fetchMaximumAllocationFromConfig(this); > {code} > Passing a "this" which is not a YarnConfiguration instance will cause below > code return null for resource names and then only contains mandatory > resources. This might be the root cause. > {code:java} > private static Map > getResourceInformationMapFromConfig( > ... > // NULL value here! > String[] resourceNames = conf.getStrings(YarnConfiguration.RESOURCE_TYPES); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands
[jira] [Commented] (YARN-9205) When using custom resource type, application will fail to run due to the CapacityScheduler throws InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION)
[ https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748090#comment-16748090 ] Wangda Tan commented on YARN-9205: -- Gotcha, Thanks [~tangzhankun], then ver.2 patch looks good, could u also provide tests to avoid future regression? Since 3.1.2 is delayed, I want to include this in 3.1.2. It gonna be great if we can get patch committed by tomorrow. > When using custom resource type, application will fail to run due to the > CapacityScheduler throws > InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) > --- > > Key: YARN-9205 > URL: https://issues.apache.org/jira/browse/YARN-9205 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-9205-trunk.001.patch, YARN-9205-trunk.002.patch > > > In a non-secure cluster. Reproduce it as follows: > # Set capacity scheduler in yarn-site.xml > # Use default capacity-scheduler.xml > # Set custom resource type "cmp.com/hdw" in resource-types.xml > # Set a value say 10 in node-resources.xml > # Start cluster > # Submit a distribute shell application which requests some "cmp.com/hdw" > The AM will get an exception from CapacityScheduler and then failed. This bug > doesn't exist in FairScheduler. > {code:java} > 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[ 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: > GUARANTEED, Enforce Execution Type: false}]Resource Profile[] > 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[cmp.com/hdw], > Requested resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation= > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > ...{code} > Did a roughly debugging, below method return the wrong maximum capacity. > DefaultAMSProcessor.java, Line 234. > {code:java} > Resource maximumCapacity = > getScheduler().getMaximumResourceCapability(app.getQueue());{code} > The above code seems should return "" > but returns "". > This incorrect value might be caused by queue maximum allocation calculation > involved in YARN-8720: > AbstractCSQueue.java Line364 > {code:java} > this.maximumAllocation = > configuration.getMaximumAllocationPerQueue( > getQueuePath());{code} > And this invokes CapacitySchedulerConfiguration.java Line 895: > {code:java} > Resource clusterMax = ResourceUtils.fetchMaximumAllocationFromConfig(this); > {code} > Passing a "this" which is not a YarnConfiguration instance will cause below > code return null for resource names and then only contains mandatory > resources. This might be the root cause. > {code:java} > private static Map > getResourceInformationMapFromConfig( > ... > // NULL value here! > String[] resourceNames = conf.getStrings(YarnConfiguration.R
[jira] [Updated] (YARN-9205) When using custom resource type, application will fail to run due to the CapacityScheduler throws InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION)
[ https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9205: - Target Version/s: 3.1.2 > When using custom resource type, application will fail to run due to the > CapacityScheduler throws > InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) > --- > > Key: YARN-9205 > URL: https://issues.apache.org/jira/browse/YARN-9205 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-9205-trunk.001.patch, YARN-9205-trunk.002.patch > > > In a non-secure cluster. Reproduce it as follows: > # Set capacity scheduler in yarn-site.xml > # Use default capacity-scheduler.xml > # Set custom resource type "cmp.com/hdw" in resource-types.xml > # Set a value say 10 in node-resources.xml > # Start cluster > # Submit a distribute shell application which requests some "cmp.com/hdw" > The AM will get an exception from CapacityScheduler and then failed. This bug > doesn't exist in FairScheduler. > {code:java} > 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[ 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: > GUARANTEED, Enforce Execution Type: false}]Resource Profile[] > 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[cmp.com/hdw], > Requested resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation= > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > ...{code} > Did a roughly debugging, below method return the wrong maximum capacity. > DefaultAMSProcessor.java, Line 234. > {code:java} > Resource maximumCapacity = > getScheduler().getMaximumResourceCapability(app.getQueue());{code} > The above code seems should return "" > but returns "". > This incorrect value might be caused by queue maximum allocation calculation > involved in YARN-8720: > AbstractCSQueue.java Line364 > {code:java} > this.maximumAllocation = > configuration.getMaximumAllocationPerQueue( > getQueuePath());{code} > And this invokes CapacitySchedulerConfiguration.java Line 895: > {code:java} > Resource clusterMax = ResourceUtils.fetchMaximumAllocationFromConfig(this); > {code} > Passing a "this" which is not a YarnConfiguration instance will cause below > code return null for resource names and then only contains mandatory > resources. This might be the root cause. > {code:java} > private static Map > getResourceInformationMapFromConfig( > ... > // NULL value here! > String[] resourceNames = conf.getStrings(YarnConfiguration.RESOURCE_TYPES); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues
[jira] [Commented] (YARN-9204) yarn.scheduler.capacity..accessible-node-labels..capacity can not support absolute resource value
[ https://issues.apache.org/jira/browse/YARN-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16747644#comment-16747644 ] Wangda Tan commented on YARN-9204: -- [~cheersyang], this sounds like a critical instead of blocker, I can get it done for 3.1.2 since I haven't sent RC0 out, I can redo it if you could get the patch committed sooner if possible. +1 to the latest patch, please let me know if you think different. > > yarn.scheduler.capacity..accessible-node-labels..capacity > can not support absolute resource value > -- > > Key: YARN-9204 > URL: https://issues.apache.org/jira/browse/YARN-9204 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.3 >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Blocker > Attachments: YARN-9204.001.patch, YARN-9204.002.patch, > YARN-9204.003.patch, YARN-9204.004.patch, YARN-9204.005.patch > > > When I set *yarn.scheduler.capacity..capacity* and > *yarn.scheduler.capacity..accessible-node-labels..capacity* > to absolute resource value, staring RM fails, and throw following > exception, and after diving into relate code, I found the logic of checking > absolute resource value maybe wrong. > {code:java} > 2019-01-17 20:25:45,716 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting > ResourceManager > java.lang.NumberFormatException: For input string: "[memory=40960,vcore=48]" > at > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) > at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) > at java.lang.Float.parseFloat(Float.java:451) > at > org.apache.hadoop.conf.Configuration.getFloat(Configuration.java:1606) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.internalGetLabeledQueue > Capacity(CapacitySchedulerConfiguration.java:655) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getLabeledQueueCapacity > (CapacitySchedulerConfiguration.java:670) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.loadCapacitiesByLabelsFromConf(CSQueueUti > ls.java:135) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.loadUpdateAndCheckCapacities(CSQueueUtils > .java:110) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupConfigurableCapacities(AbstractCS > Queue.java:179) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java > :356) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java > :323) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setupQueueConfigs(ParentQueue.java:130) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.(ParentQueue.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySched > ulerQueueManager.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(Capacit > ySchedulerQueueManager.java:158) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.j > ava:715) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java > :360) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:4 > 25) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:817) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1218) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:317) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.mai
[jira] [Commented] (YARN-9210) YARN UI can not display node info
[ https://issues.apache.org/jira/browse/YARN-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746513#comment-16746513 ] Wangda Tan commented on YARN-9210: -- LGTM, +1 to the patch. > YARN UI can not display node info > - > > Key: YARN-9210 > URL: https://issues.apache.org/jira/browse/YARN-9210 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-9210.001.patch, screenshot-1.png > > > visting http://rm_hostname:8088/cluster/nodes, there are one "Active Nodes" > in the area of "Cluster Nodes Metrics" , but detailed info of node does not > display. > Just as showed in > [screenshot-1.png|https://issues.apache.org/jira/secure/attachment/12955358/screenshot-1.png] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9205) When using custom resource type, application will fail to run due to the CapacityScheduler throws InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION)
[ https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746510#comment-16746510 ] Wangda Tan commented on YARN-9205: -- [~tangzhankun], Thanks for troubleshooting and find the root cause. It looks reasonable to me, previously even we have max allocation in queue level, but that is never enforced by scheduler till very recent. Could u add the code block (and make it to a small method) to the same place of org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#reinitialize And we need add two tests to make sure it is honored, one for init scheduler case, one for refresh scheduler case. Also, what versions are impacted by the issue? What is the workaround for that? > When using custom resource type, application will fail to run due to the > CapacityScheduler throws > InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) > --- > > Key: YARN-9205 > URL: https://issues.apache.org/jira/browse/YARN-9205 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-9205-trunk.001.patch, YARN-9205-trunk.002.patch > > > In a non-secure cluster. Reproduce it as follows: > # Set capacity scheduler in yarn-site.xml > # Use default capacity-scheduler.xml > # Set custom resource type "cmp.com/hdw" in resource-types.xml > # Set a value say 10 in node-resources.xml > # Start cluster > # Submit a distribute shell application which requests some "cmp.com/hdw" > The AM will get an exception from CapacityScheduler and then failed. This bug > doesn't exist in FairScheduler. > {code:java} > 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[ 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: > GUARANTEED, Enforce Execution Type: false}]Resource Profile[] > 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[cmp.com/hdw], > Requested resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation= > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > ...{code} > Did a roughly debugging, below method return the wrong maximum capacity. > DefaultAMSProcessor.java, Line 234. > {code:java} > Resource maximumCapacity = > getScheduler().getMaximumResourceCapability(app.getQueue());{code} > The above code seems should return "" > but returns "". > This incorrect value might be caused by queue maximum allocation calculation > involved in YARN-8720: > AbstractCSQueue.java Line364 > {code:java} > this.maximumAllocation = > configuration.getMaximumAllocationPerQueue( > getQueuePath());{code} > And this invokes CapacitySchedulerConfiguration.java Line 895: > {code:java} > Resource clusterMax = ResourceUtils.fetchMaximumAllocationFromConfig(this); > {code} > P
[jira] [Updated] (YARN-9194) Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and NullPointerException happens in RM while shutdown a NM
[ https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9194: - Fix Version/s: 3.1.3 3.2.1 3.3.0 > Invalid event: REGISTERED and LAUNCH_FAILED at FAILED, and > NullPointerException happens in RM while shutdown a NM > - > > Key: YARN-9194 > URL: https://issues.apache.org/jira/browse/YARN-9194 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Assignee: lujie >Priority: Critical > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, > YARN-9194_4.patch, YARN-9194_5.patch, YARN-9194_6.patch, > hadoop-hires-resourcemanager-hadoop11.log > > > While the attempt fails, the REGISTERED comes, hence the > InvalidStateTransitionException happens. > > {code:java} > 2019-01-13 00:41:57,127 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > App attempt: appattempt_1547311267249_0001_02 can't handle this event at > current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > REGISTERED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9206) RMServerUtils does not count SHUTDOWN as an accepted state
[ https://issues.apache.org/jira/browse/YARN-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745628#comment-16745628 ] Wangda Tan commented on YARN-9206: -- [~kshukla], could u please add a method to NodeState such as "isInactiveState" for easier maintenance? + [~suma.shivaprasad] could you please help to get this patch committed? > RMServerUtils does not count SHUTDOWN as an accepted state > -- > > Key: YARN-9206 > URL: https://issues.apache.org/jira/browse/YARN-9206 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.3 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-9206.001.patch > > > {code} > if (acceptedStates.contains(NodeState.DECOMMISSIONED) || > acceptedStates.contains(NodeState.LOST) || > acceptedStates.contains(NodeState.REBOOTED)) { > for (RMNode rmNode : context.getInactiveRMNodes().values()) { > if ((rmNode != null) && acceptedStates.contains(rmNode.getState())) { > results.add(rmNode); > } > } > } > return results; > } > {code} > This should include SHUTDOWN state as they are inactive too. This method is > used for node reports and such so might be useful to account for them as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9204) yarn.scheduler.capacity..accessible-node-labels..capacity can not support absolute resource value
[ https://issues.apache.org/jira/browse/YARN-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745625#comment-16745625 ] Wangda Tan commented on YARN-9204: -- [~yangjiandan], thanks, could you please provide a UT to prevent this issue happens in the future? > > yarn.scheduler.capacity..accessible-node-labels..capacity > can not support absolute resource value > -- > > Key: YARN-9204 > URL: https://issues.apache.org/jira/browse/YARN-9204 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.3 >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-9204.001.patch > > > When I set *yarn.scheduler.capacity..capacity* and > *yarn.scheduler.capacity..accessible-node-labels..capacity* > to absolute resource value, staring RM fails, and throw following > exception, and after diving into relate code, I found the logic of checking > absolute resource value maybe wrong. > {code:java} > 2019-01-17 20:25:45,716 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting > ResourceManager > java.lang.NumberFormatException: For input string: "[memory=40960,vcore=48]" > at > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) > at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) > at java.lang.Float.parseFloat(Float.java:451) > at > org.apache.hadoop.conf.Configuration.getFloat(Configuration.java:1606) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.internalGetLabeledQueue > Capacity(CapacitySchedulerConfiguration.java:655) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getLabeledQueueCapacity > (CapacitySchedulerConfiguration.java:670) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.loadCapacitiesByLabelsFromConf(CSQueueUti > ls.java:135) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.loadUpdateAndCheckCapacities(CSQueueUtils > .java:110) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupConfigurableCapacities(AbstractCS > Queue.java:179) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java > :356) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java > :323) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setupQueueConfigs(ParentQueue.java:130) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.(ParentQueue.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySched > ulerQueueManager.java:275) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(Capacit > ySchedulerQueueManager.java:158) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.j > ava:715) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java > :360) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:4 > 25) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:817) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1218) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:317) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1500) > 2019-01-17 20:25:45,719 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover
[ https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745611#comment-16745611 ] Wangda Tan commented on YARN-9195: -- [~ssy], thanks for filing the issue and provide analysis. We definitely wanna to fix both side: Server side to reject negative requests, and client side avoid sending such requests. If you could provide a patch, we can get it committed. cc: [~cheersyang], [~sunil.gov...@gmail.com] > RM Queue's pending container number might get decreased unexpectedly or even > become negative once RM failover > - > > Key: YARN-9195 > URL: https://issues.apache.org/jira/browse/YARN-9195 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.1.0 >Reporter: Shengyang Sha >Priority: Critical > Attachments: cases_to_recreate_negative_pending_requests_scenario.diff > > > Hi, all: > Previously we have encountered a serious problem in ResourceManager, we found > that pending container number of one RM queue became negative after RM failed > over. Since queues in RM are managed in hierarchical structure, the root > queue's pending containers became negative at last, thus the scheduling > process of the whole cluster became affected. > The version of both our RM server and AMRM client in our application are > based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method > in our application to request resources from RM. > After investigation, we found that the direct cause was numAllocations of > some AMs' requests became negative after RM failed over. And there are at > lease three necessary conditions: > (1) Use schedulingRequests in AMRM client, and the application set zero to > the numAllocations for a schedulingRequest. In our batch job scenario, the > numAllocations of a schedulingRequest could turn to zero because > theoretically we can run a full batch job using only one container. > (2) RM failovers. > (3) Before AM reregisters itself to RM after RM restarts, RM has already > recovered some of the application's containers assigned before. > Here are some more details about the implementation: > (1) After RM recovers, RM will send all alive containers to AM once it > re-register itself through > RegisterApplicationMasterResponse#getContainersFromPreviousAttempts. > (2) During registerApplicationMaster, AMRMClientImpl will > removeFromOutstandingSchedulingRequests once AM gets > ContainersFromPreviousAttempts without checking whether these containers have > been assigned before. As a consequence, its outstanding requests might be > decreased unexpectedly even if it may not become negative. > (3) There is no sanity check in RM to validate requests from AMs. > For better illustrating this case, I've written a test case based on the > latest hadoop trunk, posted in the attachment. You may try case > testAMRMClientWithNegativePendingRequestsOnRMRestart and > testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart . > To solve this issue, I propose to filter allocated containers before > removeFromOutstandingSchedulingRequests in AMRMClientImpl during > registerApplicationMaster, and some sanity checks are also needed to prevent > things from getting worse. > More comments and suggestions are welcomed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9074) Docker container rm command should be executed after stop
[ https://issues.apache.org/jira/browse/YARN-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745607#comment-16745607 ] Wangda Tan commented on YARN-9074: -- [~shaneku...@gmail.com], could u get the patch committed if you're fine with it? > Docker container rm command should be executed after stop > - > > Key: YARN-9074 > URL: https://issues.apache.org/jira/browse/YARN-9074 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > Attachments: YARN-9074.001.patch, image-2018-12-01-11-36-12-448.png, > image-2018-12-01-11-38-18-191.png > > > {code:java} > @Override > public void transition(ContainerImpl container, ContainerEvent event) { > container.setIsReInitializing(false); > // Set exit code to 0 on success > container.exitCode = 0; > // TODO: Add containerWorkDir to the deletion service. > if (DockerLinuxContainerRuntime.isDockerContainerRequested( > container.daemonConf, > container.getLaunchContext().getEnvironment())) { > removeDockerContainer(container); > } > if (clCleanupRequired) { > container.dispatcher.getEventHandler().handle( > new ContainersLauncherEvent(container, > ContainersLauncherEventType.CLEANUP_CONTAINER)); > } > container.cleanup(); > }{code} > Now, when container is finished, NM firstly execute "_docker rm xxx"_ to > remove it and this thread is placed in DeletionService. see more in YARN-5366 > . > Next, NM will execute "_docker stop_" and "docker kill" command. these tow > commands are wrapped up in ContainerCleanup thread and executed by > ContainersLauncher. see more in YARN-7644. > The above will cause the container's cleanup to be split into two threads. I > think we should refactor these code to make all docker container killing > process be place in ContainerCleanup thread and "_docker rm_" should be > executed last. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9197) NPE in service AM when failed to launch container
[ https://issues.apache.org/jira/browse/YARN-9197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745606#comment-16745606 ] Wangda Tan commented on YARN-9197: -- Thanks [~kyungwan nam] for filing and working on the patch. +[~billie.rinaldi], [~eyang] could u help to review the patch? Haven't dig into details of the patch, when the state of ComponentInstanceEvent will be null and triggers the issue? Should we make the field name more specific / add more comments for easier maintenance? > NPE in service AM when failed to launch container > - > > Key: YARN-9197 > URL: https://issues.apache.org/jira/browse/YARN-9197 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9197.001.patch > > > I’ve met NPE in service AM as follows. > {code} > 2019-01-02 22:35:47,582 [Component dispatcher] INFO component.Component - > [COMPONENT regionserver]: Assigned container_e15_1542704944343_0001_01_01 > to component instance regionserver-1 and launch on host test2.com:45454 > 2019-01-02 22:35:47,588 [pool-6-thread-5] WARN ipc.Client - Exception > encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for yarn-ats: HDFS_DELEGATION_TOKEN owner=yarn-ats, > renewer=yarn, realUser=rm/test1.nfra...@example.com, issueDate=1542704946397, > maxDate=1543309746397, sequenceNumber=97, masterKeyId=90) can't be found in > cache > 2019-01-02 22:35:47,592 [pool-6-thread-5] ERROR > containerlaunch.ContainerLaunchService - [COMPINSTANCE regionserver-1 : > container_e15_1542704944343_0001_01_01]: Failed to launch container. > java.io.IOException: Package doesn't exist as a resource: > /hdp/apps/3.0.0.0-1634/hbase/hbase.tar.gz > at > org.apache.hadoop.yarn.service.provider.tarball.TarballProviderService.processArtifact(TarballProviderService.java:41) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:144) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:107) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2019-01-02 22:35:47,592 [Component dispatcher] INFO component.Component - > [COMPONENT regionserver] Requesting for 1 container(s) > 2019-01-02 22:35:47,592 [Component dispatcher] INFO component.Component - > [COMPONENT regionserver] Submitting scheduling request: > SchedulingRequestPBImpl{priority=1, allocationReqId=1, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[regionserver], > resourceSizing=ResourceSizingPBImpl{numAllocations=1, resources= vCores:1>}, placementConstraint=notin,node,regionserver} > 2019-01-02 22:35:47,593 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE regionserver-1 : > container_e15_1542704944343_0001_01_01]: > container_e15_1542704944343_0001_01_01 completed. Reinsert back to > pending list and requested a new container. > exitStatus=null, diagnostics=failed before launch > 2019-01-02 22:35:47,593 [Component dispatcher] INFO > instance.ComponentInstance - Publishing component instance status > container_e15_1542704944343_0001_01_01 FAILED > 2019-01-02 22:35:47,593 [Component dispatcher] ERROR > service.ServiceScheduler - [COMPINSTANCE regionserver-1 : > container_e15_1542704944343_0001_01_01]: Error in handling event type STOP > java.lang.NullPointerException > at > org.apache.hadoop.yarn.service.component.instance.ComponentInstance.handleComponentInstanceRelaunch(ComponentInstance.java:342) > at > org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStoppedTransition.transition(ComponentInstance.java:482) > at > org.apache.hadoop.yarn.service.component.instance.ComponentInstance$ContainerStoppedTransition.transition(ComponentInstance.java:375) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.s
[jira] [Commented] (YARN-9205) When using custom resource type, application will fail to run due to the CapacityScheduler throws InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION)
[ https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745591#comment-16745591 ] Wangda Tan commented on YARN-9205: -- [~tangzhankun], >From the log it looks like by design. There's a method: >org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker#getMaxAllowedAllocation It is added to reject resource requests when no node has that amount of resource (for example, if app asks 100G mem, however the maximum node's resource is 50GB, such request will be rejected: {code:java} please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation={code} And this behavior is triggered once: {code:java} if (forceConfiguredMaxAllocation && System.currentTimeMillis() - ResourceManager.getClusterTimeStamp() > configuredMaxAllocationWaitTime) { forceConfiguredMaxAllocation = false; }{code} And the configuredMaxAllocationWaitTime is decided by: {code:java} long configuredMaximumAllocationWaitTime = conf.getLong(YarnConfiguration.RM_WORK_PRESERVING_RECOVERY_SCHEDULING_WAIT_MS, YarnConfiguration.DEFAULT_RM_WORK_PRESERVING_RECOVERY_SCHEDULING_WAIT_MS);{code} Which is 10 sec by default, that means if node with the custom resource not registered within 10 sec after RM start, apps will be rejected till the NM with custom resource registered. Let me know if it makes sense to you or any other issue I missed, we can make the error msg to be more specific if needed. > When using custom resource type, application will fail to run due to the > CapacityScheduler throws > InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) > --- > > Key: YARN-9205 > URL: https://issues.apache.org/jira/browse/YARN-9205 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-9205-trunk.001.patch > > > In a non-secure cluster. Reproduce it as follows: > # Set capacity scheduler in yarn-site.xml > # Use default capacity-scheduler.xml > # Set custom resource type "cmp.com/hdw" in resource-types.xml > # Set a value say 10 in node-resources.xml > # Start cluster > # Submit a distribute shell application which requests some "cmp.com/hdw" > The AM will get an exception from CapacityScheduler and then failed. This bug > doesn't exist in FairScheduler. > {code:java} > 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested > container ask: Capability[ 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: > GUARANTEED, Enforce Execution Type: false}]Resource Profile[] > 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request! Cannot allocate containers as requested resource is greater > than maximum allowed allocation. Requested resource type=[cmp.com/hdw], > Requested resource=, maximum allowed > allocation=, please note that maximum allowed > allocation is calculated by scheduler based on maximum resource of registered > NodeManagers, which might be less than configured maximum > allocation= > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMast
[jira] [Commented] (YARN-9200) Enable resource configuration of queue capacity for different resources independently
[ https://issues.apache.org/jira/browse/YARN-9200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743248#comment-16743248 ] Wangda Tan commented on YARN-9200: -- [~aihuaxu], adding percentage to different resource types is what we planned to do but didn't start, it makes a lot of sense for heterogeneous cluster and queue requirements. [~rohithsharma] mentioned above this before. [~rohithsharma] is there any Jira filed before? And what's the state of that Jira now? > Enable resource configuration of queue capacity for different resources > independently > - > > Key: YARN-9200 > URL: https://issues.apache.org/jira/browse/YARN-9200 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > > In capacity scheduler, currently two resource allocations are supported. 1. > percentage allocation for child queues - the child queue gets a defined > percentage of the resources for all the resource types; 2. absolute values > (YARN-5881) - each resource is configured an absolute values. > Right now we can't mix these case together and it would also very confusing > to mix them in one cluster. The second case actually is more targeting toward > cloud env. > In a non-cloud env, the ability to configure each resource independently is > also useful, but percentage is preferable over absolute value. One thought > here is to add the percentage configuration for each resource type on the > queue. That would allow us to configure memory bounded queues, or CPU bounded > queues. We can also keep backward compatible: each resource type just gets > the same percentage if no percentage is configured for individual resource > type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9194) Invalid event: REGISTERED at FAILED
[ https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743240#comment-16743240 ] Wangda Tan commented on YARN-9194: -- Fix LGTM, thanks [~xiaoheipangzi], will commit the patch tomorrow if no objections. > Invalid event: REGISTERED at FAILED > --- > > Key: YARN-9194 > URL: https://issues.apache.org/jira/browse/YARN-9194 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Assignee: lujie >Priority: Major > Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, > YARN-9194_4.patch, hadoop-hires-resourcemanager-hadoop11.log > > > While the attempt fails, the REGISTERED comes, hence the > InvalidStateTransitionException happens. > > {code:java} > 2019-01-13 00:41:57,127 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > App attempt: appattempt_1547311267249_0001_02 can't handle this event at > current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > REGISTERED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9194) Invalid event: REGISTERED at FAILED, and NullPointerException happens in RM while shutdown a NM
[ https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9194: - Target Version/s: 3.2.1, 3.1.3 Priority: Critical (was: Major) Summary: Invalid event: REGISTERED at FAILED, and NullPointerException happens in RM while shutdown a NM (was: Invalid event: REGISTERED at FAILED) > Invalid event: REGISTERED at FAILED, and NullPointerException happens in RM > while shutdown a NM > --- > > Key: YARN-9194 > URL: https://issues.apache.org/jira/browse/YARN-9194 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Assignee: lujie >Priority: Critical > Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, > YARN-9194_4.patch, hadoop-hires-resourcemanager-hadoop11.log > > > While the attempt fails, the REGISTERED comes, hence the > InvalidStateTransitionException happens. > > {code:java} > 2019-01-13 00:41:57,127 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > App attempt: appattempt_1547311267249_0001_02 can't handle this event at > current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > REGISTERED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9199) Compatible issue: AM throws NoSuchMethodError when 2.7.2 client submits mr job to 3.1.3 RM
[ https://issues.apache.org/jira/browse/YARN-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743226#comment-16743226 ] Wangda Tan commented on YARN-9199: -- [~yangjiandan], Thanks for reporting the issue, but I'm not sure how this exception could happen: If an old client talk to RM using old protocol (which doesn't have trackingUrl field inside AllocateRequest). RM should just assume it is null. There's one possibility could cause the issue: Do you run RM but with old yarn-commons/yarn-api jars? When do rolling upgrade, you should make sure all jars on one node is upgraded, using new RM jars but old client/api jars will cause issues. And even if you upgraded all jars, you should make sure no old yarn commons/api jars exist in RM CLASSPATH. > Compatible issue: AM throws NoSuchMethodError when 2.7.2 client submits mr > job to 3.1.3 RM > -- > > Key: YARN-9199 > URL: https://issues.apache.org/jira/browse/YARN-9199 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.2.0, 3.1.2, 3.1.3 >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-9199.001.patch > > > *Background* > Rolling upgrade Yarn from 2.7.2 to 3.1.3-SNAPSHOT; the version of RM is > 3.1.3-SNAPSHOT, but the version of NM is 2.7.2 > AM throws NoSuchMethodError when 2.7.2 client submits mr job to 3.1.3 RM > {code:java} > 2019-01-14 17:20:36,131 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > org.apache.hadoop.ipc.RemoteException(java.lang.NoSuchMethodError): > org.apache.hadoop.yarn.api.protocolrecords.AllocateRequest.getTrackingUrl()Ljava/lang/String; > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:386) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:201) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:433) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > at org.apache.hadoop.ipc.Client.call(Client.java:1503) > at org.apache.hadoop.ipc.Client.call(Client.java:1441) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy80.allocate(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) > at com.sun.proxy.$Proxy81.allocate(Unknown Source) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:204) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:684) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:257) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$A
[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues
[ https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737856#comment-16737856 ] Wangda Tan commented on YARN-9116: -- Maybe we can support maximum-allocation-mb/vcores to parent queues first, and once YARN-9161 is done, we can file a separate Jira to support customer resource types of maximum-allocation. > Capacity Scheduler: add the default maximum-allocation-mb and > maximum-allocation-vcores for the queues > -- > > Key: YARN-9116 > URL: https://issues.apache.org/jira/browse/YARN-9116 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9116.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue > which is targeting to support larger container features on dedicated queues > (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . > While to achieve larger container configuration, we need to increase the > global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and > then override those configurations with desired values on the queues since > queue configuration can't be larger than cluster configuration. There are > many queues in the system and if we forget to configure such values when > adding a new queue, then such queue gets default 120G/256 which typically is > not what we want. > We can come up with a queue-default configuration (set to normal queue > configuration like 16G/8), so the leaf queues gets such values by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues
[ https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737855#comment-16737855 ] Wangda Tan commented on YARN-9116: -- [~cheersyang], [~aihuaxu], The proposal from Weiwei sounds good to me. We have several other fields in CS have similar inheritable behaviors. To me it is not an incompatible change since we don't expect user set maximum-allocation of parent queues. And also, the old config only support setting maximum allocation for mem and vcores, can we add a field to support customized resources like GPU/FPGA? I think we can refer to Absolute Resource Specification in Capacity Scheduler. See: [https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html], search for "Resource Allocation using Absolute Resources configuration". I think we should either deprecated the old maximum-allocation-mb/vcores or at least make maximum-allocation can override maximum-allocation-mb/vcores. There's an on-going work of YARN-9161 to add support of Absolute Resource for customized resource types to Capacity Scheduler. We should make sure consistent format is being used. Thoughts? > Capacity Scheduler: add the default maximum-allocation-mb and > maximum-allocation-vcores for the queues > -- > > Key: YARN-9116 > URL: https://issues.apache.org/jira/browse/YARN-9116 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9116.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue > which is targeting to support larger container features on dedicated queues > (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . > While to achieve larger container configuration, we need to increase the > global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and > then override those configurations with desired values on the queues since > queue configuration can't be larger than cluster configuration. There are > many queues in the system and if we forget to configure such values when > adding a new queue, then such queue gets default 120G/256 which typically is > not what we want. > We can come up with a queue-default configuration (set to normal queue > configuration like 16G/8), so the leaf queues gets such values by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA
[ https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9161: - Component/s: capacity scheduler > Absolute resources of capacity scheduler doesn't support GPU and FPGA > - > > Key: YARN-9161 > URL: https://issues.apache.org/jira/browse/YARN-9161 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-9161.001.patch, YARN-9161.002.patch > > > As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two > elements: memory and vcores, which would filter out absolute resources > configuration of gpu and fpga in > AbstractCSQueue.updateConfigurableResourceRequirement. > This issue would cause gpu and fpga can't be allocated correctly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736803#comment-16736803 ] Wangda Tan commented on YARN-6695: -- Thanks [~eyang]/[~rohithsharma], I'm going to update target version to next release and unblock 3.1.2 and 3.2.0. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-6695: - Target Version/s: 3.2.1, 3.1.3 (was: 3.2.0, 3.1.2) > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-6695: - Target Version/s: 3.2.0, 3.1.2 > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8822) Nvidia-docker v2 support for YARN GPU feature
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736322#comment-16736322 ] Wangda Tan commented on YARN-8822: -- Thanks [~Charo Zhang] for the patch and [~tangzhankun] for reviews/tests. I just committed the patch to trunk, branch-3.2, branch-3.1, branch-3.1.2; There're some conflicts in branch-3.2.0, I uploaded patch. [~sunilg] could u help to get it committed if the Jenkins report green? > Nvidia-docker v2 support for YARN GPU feature > - > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Fix For: 3.1.2, 3.3.0, 3.2.1 > > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.001.patch, YARN-8822-branch-3.1.1.001.patch, > YARN-8822-branch-3.2.0-001.patch, YARN-8822-branch-3.2.001.patch, > YARN-8822.001.patch, YARN-8822.002.patch, YARN-8822.003.patch, > YARN-8822.004.patch, YARN-8822.005.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support for YARN GPU feature
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Target Version/s: 3.2.0 (was: 3.1.2, 3.3.0, 3.2.1) > Nvidia-docker v2 support for YARN GPU feature > - > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Fix For: 3.1.2, 3.3.0, 3.2.1 > > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.001.patch, YARN-8822-branch-3.1.1.001.patch, > YARN-8822-branch-3.2.0-001.patch, YARN-8822-branch-3.2.001.patch, > YARN-8822.001.patch, YARN-8822.002.patch, YARN-8822.003.patch, > YARN-8822.004.patch, YARN-8822.005.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support for YARN GPU feature
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Fix Version/s: 3.2.1 3.3.0 3.1.2 > Nvidia-docker v2 support for YARN GPU feature > - > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Fix For: 3.1.2, 3.3.0, 3.2.1 > > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.001.patch, YARN-8822-branch-3.1.1.001.patch, > YARN-8822-branch-3.2.0-001.patch, YARN-8822-branch-3.2.001.patch, > YARN-8822.001.patch, YARN-8822.002.patch, YARN-8822.003.patch, > YARN-8822.004.patch, YARN-8822.005.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support for YARN GPU feature
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Attachment: YARN-8822-branch-3.2.0-001.patch > Nvidia-docker v2 support for YARN GPU feature > - > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.001.patch, YARN-8822-branch-3.1.1.001.patch, > YARN-8822-branch-3.2.0-001.patch, YARN-8822-branch-3.2.001.patch, > YARN-8822.001.patch, YARN-8822.002.patch, YARN-8822.003.patch, > YARN-8822.004.patch, YARN-8822.005.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support for YARN GPU feature
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Summary: Nvidia-docker v2 support for YARN GPU feature (was: Nvidia-docker v2 support) > Nvidia-docker v2 support for YARN GPU feature > - > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.001.patch, YARN-8822-branch-3.1.1.001.patch, > YARN-8822-branch-3.2.001.patch, YARN-8822.001.patch, YARN-8822.002.patch, > YARN-8822.003.patch, YARN-8822.004.patch, YARN-8822.005.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736107#comment-16736107 ] Wangda Tan commented on YARN-8822: -- Thanks [~Charo Zhang], Latest patches LGTM, will get them committed. [~sunil.gov...@gmail.com], is it possible to pick it up in the 3.2.0 release if you haven't started RC1? I plan to include it in 3.1.2. > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.001.patch, YARN-8822-branch-3.1.1.001.patch, > YARN-8822-branch-3.2.001.patch, YARN-8822.001.patch, YARN-8822.002.patch, > YARN-8822.003.patch, YARN-8822.004.patch, YARN-8822.005.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9053) Support set environment variables for Docker Containers In nonEntryPoint mode
[ https://issues.apache.org/jira/browse/YARN-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9053: - Target Version/s: 3.1.3 (was: 3.1.2) > Support set environment variables for Docker Containers In nonEntryPoint mode > - > > Key: YARN-9053 > URL: https://issues.apache.org/jira/browse/YARN-9053 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 3.1.1 >Reporter: Charo Zhang >Priority: Major > Labels: Docker > Attachments: YARN-9053.patch > > > In yarn 3.1.1, users can only set environment variables with "-shell_env" in > ENTRYPOINT mode, and variables must be registered in > yarn.nodemanager.env-whitelist. > But in nonEntryPoint mode, we should allow users to set environment variables > like "-e KEY=VAULE" in docker run command, too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Target Version/s: 3.1.2, 3.3.0, 3.2.1 (was: 3.3.0, 3.2.1, 3.1.3) > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.001.patch, YARN-8822-branch-3.1.1.001.patch, > YARN-8822-branch-3.2.001.patch, YARN-8822.001.patch, YARN-8822.002.patch, > YARN-8822.003.patch, YARN-8822.004.patch, YARN-8822.005.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9160) [Submarine] Document "PYTHONPATH" environment variable setting when using -localization options
[ https://issues.apache.org/jira/browse/YARN-9160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735285#comment-16735285 ] Wangda Tan commented on YARN-9160: -- Committed to trunk, thanks [~tangzhankun]. > [Submarine] Document "PYTHONPATH" environment variable setting when using > -localization options > --- > > Key: YARN-9160 > URL: https://issues.apache.org/jira/browse/YARN-9160 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9160-trunk.001.patch > > > An infra platform might want to provide the user a Zepplin notebook and > execute user's job with user's command input like "python entry_point.py > ...". This is better for the end user because he/she feels that the > "entry_point.py" seems in the local workbench. > This may translate to below submarine command in the platform when submitting > the job: > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_script1.py:./ > --localization depedency_script2.py:./ > --worker_launch_cmd "python entry_point.py .." > {code} > Or > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_scripts_dir:./ > --worker_launch_cmd "python entry_script.py .." > {code} > > When running with the above command, both will fail due to module import > error from the entry_point.py. This is because YARN only creates symbol links > in the container's work dir (the real scripts files are in different cache > folders) and python module import won't know that. > One possible solution is set localization with a directory containing all > scripts and change the worker_launch_cmd to "cd scripts_dir && python > entry_script.py". But this solution makes the user experience bad which feels > not in a local workbench. > And another solution is using "PYTHONPATH" environment variable. This > solution can keep the user experience good and won't need YARN localization > internal changes. > {code:java} > ... job run > # the entry point > --localization entry_script.py:/entry_script.py > # the dependency Python scripts of the entry point > --localization depedency_scripts_dir:/dependency_scripts_dir > # the PYTHONPATH env to make dependency available to entry script > --env PYTHONPATH="/dependency_scripts_dir" > --worker_launch_cmd "python /entry_script.py ..."{code} > And we should document this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9141) [submarine] JobStatus outputs with system UTC clock, not local clock
[ https://issues.apache.org/jira/browse/YARN-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9141: - Fix Version/s: 3.3.0 > [submarine] JobStatus outputs with system UTC clock, not local clock > > > Key: YARN-9141 > URL: https://issues.apache.org/jira/browse/YARN-9141 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9141.001.patch > > > The current time is Mon Dec 17 12:26:31 CST 2018. > But submarine job status output is like this: > Job Name=distributed-tf-gpu-ml4, status=RUNNING time=2018-12-17T04:29:20.873Z > Components: > -- > The time is not local time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9141) [submarine] JobStatus outputs with system UTC clock, not local clock
[ https://issues.apache.org/jira/browse/YARN-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9141: - Fix Version/s: 3.2.1 > [submarine] JobStatus outputs with system UTC clock, not local clock > > > Key: YARN-9141 > URL: https://issues.apache.org/jira/browse/YARN-9141 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9141.001.patch > > > The current time is Mon Dec 17 12:26:31 CST 2018. > But submarine job status output is like this: > Job Name=distributed-tf-gpu-ml4, status=RUNNING time=2018-12-17T04:29:20.873Z > Components: > -- > The time is not local time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9160) [Submarine] Document "PYTHONPATH" environment variable setting when using -localization options
[ https://issues.apache.org/jira/browse/YARN-9160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735285#comment-16735285 ] Wangda Tan edited comment on YARN-9160 at 1/6/19 7:17 PM: -- Committed to trunk and branch-3.2, thanks [~tangzhankun]. was (Author: leftnoteasy): Committed to trunk, thanks [~tangzhankun]. > [Submarine] Document "PYTHONPATH" environment variable setting when using > -localization options > --- > > Key: YARN-9160 > URL: https://issues.apache.org/jira/browse/YARN-9160 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9160-trunk.001.patch > > > An infra platform might want to provide the user a Zepplin notebook and > execute user's job with user's command input like "python entry_point.py > ...". This is better for the end user because he/she feels that the > "entry_point.py" seems in the local workbench. > This may translate to below submarine command in the platform when submitting > the job: > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_script1.py:./ > --localization depedency_script2.py:./ > --worker_launch_cmd "python entry_point.py .." > {code} > Or > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_scripts_dir:./ > --worker_launch_cmd "python entry_script.py .." > {code} > > When running with the above command, both will fail due to module import > error from the entry_point.py. This is because YARN only creates symbol links > in the container's work dir (the real scripts files are in different cache > folders) and python module import won't know that. > One possible solution is set localization with a directory containing all > scripts and change the worker_launch_cmd to "cd scripts_dir && python > entry_script.py". But this solution makes the user experience bad which feels > not in a local workbench. > And another solution is using "PYTHONPATH" environment variable. This > solution can keep the user experience good and won't need YARN localization > internal changes. > {code:java} > ... job run > # the entry point > --localization entry_script.py:/entry_script.py > # the dependency Python scripts of the entry point > --localization depedency_scripts_dir:/dependency_scripts_dir > # the PYTHONPATH env to make dependency available to entry script > --env PYTHONPATH="/dependency_scripts_dir" > --worker_launch_cmd "python /entry_script.py ..."{code} > And we should document this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9160) [Submarine] Document "PYTHONPATH" environment variable setting when using -localization options
[ https://issues.apache.org/jira/browse/YARN-9160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9160: - Fix Version/s: 3.2.1 > [Submarine] Document "PYTHONPATH" environment variable setting when using > -localization options > --- > > Key: YARN-9160 > URL: https://issues.apache.org/jira/browse/YARN-9160 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9160-trunk.001.patch > > > An infra platform might want to provide the user a Zepplin notebook and > execute user's job with user's command input like "python entry_point.py > ...". This is better for the end user because he/she feels that the > "entry_point.py" seems in the local workbench. > This may translate to below submarine command in the platform when submitting > the job: > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_script1.py:./ > --localization depedency_script2.py:./ > --worker_launch_cmd "python entry_point.py .." > {code} > Or > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_scripts_dir:./ > --worker_launch_cmd "python entry_script.py .." > {code} > > When running with the above command, both will fail due to module import > error from the entry_point.py. This is because YARN only creates symbol links > in the container's work dir (the real scripts files are in different cache > folders) and python module import won't know that. > One possible solution is set localization with a directory containing all > scripts and change the worker_launch_cmd to "cd scripts_dir && python > entry_script.py". But this solution makes the user experience bad which feels > not in a local workbench. > And another solution is using "PYTHONPATH" environment variable. This > solution can keep the user experience good and won't need YARN localization > internal changes. > {code:java} > ... job run > # the entry point > --localization entry_script.py:/entry_script.py > # the dependency Python scripts of the entry point > --localization depedency_scripts_dir:/dependency_scripts_dir > # the PYTHONPATH env to make dependency available to entry script > --env PYTHONPATH="/dependency_scripts_dir" > --worker_launch_cmd "python /entry_script.py ..."{code} > And we should document this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9141) [submarine] JobStatus outputs with system UTC clock, not local clock
[ https://issues.apache.org/jira/browse/YARN-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735286#comment-16735286 ] Wangda Tan edited comment on YARN-9141 at 1/6/19 7:17 PM: -- Thanks [~yuan_zac], committed to trunk and branch-3.2 was (Author: leftnoteasy): Thanks [~yuan_zac], committed to trunk. > [submarine] JobStatus outputs with system UTC clock, not local clock > > > Key: YARN-9141 > URL: https://issues.apache.org/jira/browse/YARN-9141 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9141.001.patch > > > The current time is Mon Dec 17 12:26:31 CST 2018. > But submarine job status output is like this: > Job Name=distributed-tf-gpu-ml4, status=RUNNING time=2018-12-17T04:29:20.873Z > Components: > -- > The time is not local time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9160) [Submarine] Document "PYTHONPATH" environment variable setting when using -localization options
[ https://issues.apache.org/jira/browse/YARN-9160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9160: - Fix Version/s: 3.3.0 > [Submarine] Document "PYTHONPATH" environment variable setting when using > -localization options > --- > > Key: YARN-9160 > URL: https://issues.apache.org/jira/browse/YARN-9160 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9160-trunk.001.patch > > > An infra platform might want to provide the user a Zepplin notebook and > execute user's job with user's command input like "python entry_point.py > ...". This is better for the end user because he/she feels that the > "entry_point.py" seems in the local workbench. > This may translate to below submarine command in the platform when submitting > the job: > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_script1.py:./ > --localization depedency_script2.py:./ > --worker_launch_cmd "python entry_point.py .." > {code} > Or > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_scripts_dir:./ > --worker_launch_cmd "python entry_script.py .." > {code} > > When running with the above command, both will fail due to module import > error from the entry_point.py. This is because YARN only creates symbol links > in the container's work dir (the real scripts files are in different cache > folders) and python module import won't know that. > One possible solution is set localization with a directory containing all > scripts and change the worker_launch_cmd to "cd scripts_dir && python > entry_script.py". But this solution makes the user experience bad which feels > not in a local workbench. > And another solution is using "PYTHONPATH" environment variable. This > solution can keep the user experience good and won't need YARN localization > internal changes. > {code:java} > ... job run > # the entry point > --localization entry_script.py:/entry_script.py > # the dependency Python scripts of the entry point > --localization depedency_scripts_dir:/dependency_scripts_dir > # the PYTHONPATH env to make dependency available to entry script > --env PYTHONPATH="/dependency_scripts_dir" > --worker_launch_cmd "python /entry_script.py ..."{code} > And we should document this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9141) [submarine] JobStatus outputs with system UTC clock, not local clock
[ https://issues.apache.org/jira/browse/YARN-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735286#comment-16735286 ] Wangda Tan commented on YARN-9141: -- Thanks [~yuan_zac], committed to trunk. > [submarine] JobStatus outputs with system UTC clock, not local clock > > > Key: YARN-9141 > URL: https://issues.apache.org/jira/browse/YARN-9141 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9141.001.patch > > > The current time is Mon Dec 17 12:26:31 CST 2018. > But submarine job status output is like this: > Job Name=distributed-tf-gpu-ml4, status=RUNNING time=2018-12-17T04:29:20.873Z > Components: > -- > The time is not local time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Target Version/s: 3.2.1, 3.1.3 (was: 3.1.3) > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, YARN-8822.002.patch, > YARN-8822.003.patch, YARN-8822.004.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735273#comment-16735273 ] Wangda Tan commented on YARN-8822: -- I moved this Jira out of 3.1.2, I will work on RC for 3.1.2 tomorrow. If it can get in branch sooner, we can still get it, otherwise we will move it to next release. > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, YARN-8822.002.patch, > YARN-8822.003.patch, YARN-8822.004.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Target Version/s: 3.3.0, 3.2.1, 3.1.3 (was: 3.2.1, 3.1.3) > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, YARN-8822.002.patch, > YARN-8822.003.patch, YARN-8822.004.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Target Version/s: 3.1.3 (was: 3.1.2) > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, YARN-8822.002.patch, > YARN-8822.003.patch, YARN-8822.004.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735272#comment-16735272 ] Wangda Tan commented on YARN-8822: -- [~Charo Zhang], latest patch looks good, apologize for my late responses, mind to check latest comment from [~tangzhankun] and make sure patch applies against trunk? And could u please update branch-3.1/branch-3.2(if different) patches to allow us backport to older releases? > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: Nv2-1.png, Nv2-2.png, Nv2-3.png, > YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, YARN-8822.002.patch, > YARN-8822.003.patch, YARN-8822.004.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9141) [submarine] JobStatus outputs with system UTC clock, not local clock
[ https://issues.apache.org/jira/browse/YARN-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735241#comment-16735241 ] Wangda Tan commented on YARN-9141: -- straightforward fix, +1, will commit later. > [submarine] JobStatus outputs with system UTC clock, not local clock > > > Key: YARN-9141 > URL: https://issues.apache.org/jira/browse/YARN-9141 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-9141.001.patch > > > The current time is Mon Dec 17 12:26:31 CST 2018. > But submarine job status output is like this: > Job Name=distributed-tf-gpu-ml4, status=RUNNING time=2018-12-17T04:29:20.873Z > Components: > -- > The time is not local time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735239#comment-16735239 ] Wangda Tan commented on YARN-8489: -- [~yuan_zac], Thanks for working on this ticket. 1) terminateServiceIfServiceStateComponentsFinished => terminateServiceIfDomiantCompnentFinished 2) Both of terminateServiceIfServiceStateComponentsFinished/terminateServiceIfAllComponentsFinished can be private visibility. 3) Changes of TimelineServiceV2Publisher, is it a specific issue related to this change? If it is a corner case we need to take care, I suggest to file a separate JIRA and add unit test. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-8489.001.patch, YARN-8489.002.patch, > YARN-8489.003.patch > > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9160) [Submarine] Document "PYTHONPATH" environment variable setting when using -localization options
[ https://issues.apache.org/jira/browse/YARN-9160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735246#comment-16735246 ] Wangda Tan commented on YARN-9160: -- Straightforward fix, +1. Thanks [~tangzhankun]. > [Submarine] Document "PYTHONPATH" environment variable setting when using > -localization options > --- > > Key: YARN-9160 > URL: https://issues.apache.org/jira/browse/YARN-9160 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9160-trunk.001.patch > > > An infra platform might want to provide the user a Zepplin notebook and > execute user's job with user's command input like "python entry_point.py > ...". This is better for the end user because he/she feels that the > "entry_point.py" seems in the local workbench. > This may translate to below submarine command in the platform when submitting > the job: > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_script1.py:./ > --localization depedency_script2.py:./ > --worker_launch_cmd "python entry_point.py .." > {code} > Or > > {code:java} > ... job run > --localization entry_script.py:./ > --localization depedency_scripts_dir:./ > --worker_launch_cmd "python entry_script.py .." > {code} > > When running with the above command, both will fail due to module import > error from the entry_point.py. This is because YARN only creates symbol links > in the container's work dir (the real scripts files are in different cache > folders) and python module import won't know that. > One possible solution is set localization with a directory containing all > scripts and change the worker_launch_cmd to "cd scripts_dir && python > entry_script.py". But this solution makes the user experience bad which feels > not in a local workbench. > And another solution is using "PYTHONPATH" environment variable. This > solution can keep the user experience good and won't need YARN localization > internal changes. > {code:java} > ... job run > # the entry point > --localization entry_script.py:/entry_script.py > # the dependency Python scripts of the entry point > --localization depedency_scripts_dir:/dependency_scripts_dir > # the PYTHONPATH env to make dependency available to entry script > --env PYTHONPATH="/dependency_scripts_dir" > --worker_launch_cmd "python /entry_script.py ..."{code} > And we should document this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9155) Can't re-run a submarine job, if the previous job with the same service name has finished
[ https://issues.apache.org/jira/browse/YARN-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735244#comment-16735244 ] Wangda Tan commented on YARN-9155: -- [~yuan_zac]. can we just add a Submarine cli option to remove old job folder if exists? By default we can turn it off, and print a log to Submarine cli output to hint user about the option if job dir exists. > Can't re-run a submarine job, if the previous job with the same service name > has finished > - > > Key: YARN-9155 > URL: https://issues.apache.org/jira/browse/YARN-9155 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > > Yarn native service doesn't clean up its HDFS service path when it is > finished. > So if we don't execute "yarn app -destroy " command before the next run of a > submarine job. we would get the following exception: > 2018-12-24 11:38:02,493 ERROR > org.apache.hadoop.yarn.service.utils.CoreFileSystem: Dir > /user/hadoop//services/distributed-tf-gpu-ml4/${service_name}.json > exists: hdfs://mldev/user/hadoop/** > /services/distributed-tf-gpu-ml4/${service_name}.json 8472 > 2018-12-24 11:38:02,494 ERROR > org.apache.hadoop.yarn.service.webapp.ApiServer: Failed to create service > ${service_name}: {} > java.lang.reflect.UndeclaredThrowableException > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748) > at > org.apache.hadoop.yarn.service.webapp.ApiServer.createService(ApiServer.java:131) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOu > tInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJav > aMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:8 > 4) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1 > 542) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1 > 473) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14 > 19) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14 > 09) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:1 > 79) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119) > at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) > at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) > at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130) > at > org.ecl
[jira] [Commented] (YARN-9144) WebAppProxyServlet can't redirect to ATS V1.5 when a yarn native service app is finished
[ https://issues.apache.org/jira/browse/YARN-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735240#comment-16735240 ] Wangda Tan commented on YARN-9144: -- Thanks [~yuan_zac] for working on this Jira? [~sunil.gov...@gmail.com], [~rohithsharma], could u take look at this fix? I'm not sure if it causes other issues for non-yarn-service apps. > WebAppProxyServlet can't redirect to ATS V1.5 when a yarn native service app > is finished > > > Key: YARN-9144 > URL: https://issues.apache.org/jira/browse/YARN-9144 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-9144.001.patch, YARN-9144.002.patch > > > When a yarn native service app is finished, RM web UI v1 should redirect > tracking URL to ATS V1.5 if it's enabled. So that users can check the app > logs like MR jobs. But the tracking URL points to RM app page. > The root cause is that WebAppProxyServlet may get app report from RM, as RM > would cache a small amount of apps status. Then WebAppProxyServlet redirect > to RM, not ATS -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8967) Change FairScheduler to use PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733580#comment-16733580 ] Wangda Tan commented on YARN-8967: -- Thanks [~wilfreds], I'm very glad to see that the original YARN-3635 works is finally consumed by FS after 3 years since the JIRA got committed :). One quick question / comment: The returned ApplicationPlacementContext should be enough for scheduler to make decision like dynamically create queues, etc. Could you please double check and explain why new methods added to PlacementRule? (cc: [~suma.shivaprasad]) I will leave detailed reviews to others :), and it gonna be better if we can get FS folks to better understand changes in FS. (cc: [~haibochen]) > Change FairScheduler to use PlacementRule interface > --- > > Key: YARN-8967 > URL: https://issues.apache.org/jira/browse/YARN-8967 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-8967.001.patch, YARN-8967.002.patch, > YARN-8967.003.patch > > > The PlacementRule interface was introduced to be used by all schedulers as > per YARN-3635. The CapacityScheduler is using it but the FairScheduler is not > and is using its own rule definition. > YARN-8948 cleans up the implementation and removes the CS references which > should allow this change to go through. > This would be the first step in using one placement rule engine for both > schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA
[ https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732270#comment-16732270 ] Wangda Tan commented on YARN-9161: -- [~yuan_zac], thanks for reporting this issue. [~sunilg], do you remember this limit? What I remember is, absolute resource config supports multiple dimensions. But from the code it seems the reported issue is valid. And [~yuan_zac], we need to make sure what the configuration is backward-compatible after this change? > Absolute resources of capacity scheduler doesn't support GPU and FPGA > - > > Key: YARN-9161 > URL: https://issues.apache.org/jira/browse/YARN-9161 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-9161.001.patch > > > As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two > elements: memory and vcores, which would filter out absolute resources > configuration of gpu and fpga in > AbstractCSQueue.updateConfigurableResourceRequirement. > This issue would cause gpu and fpga can't be allocated correctly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues
[ https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732255#comment-16732255 ] Wangda Tan commented on YARN-9163: -- [~ziqian hu], could u upload jstack or at least 3 full stacktrace of the threads you mentioned? I couldn't locate the issue you mentioned. > Deadlock when use yarn rmadmin -refreshQueues > - > > Key: YARN-9163 > URL: https://issues.apache.org/jira/browse/YARN-9163 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Hu Ziqian >Assignee: Hu Ziqian >Priority: Blocker > Attachments: YARN-9163.001.patch > > > We have a cluster with 4000+ node and 10w+ app per-day in our production > environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's > process is stuck and ha doesn't happen, which means all the cluster stops > service and we can only fix it by reboot active rm. We can reproduce on our > production cluster every time but can't reproduce in our test environment > which only has 100+ nodes and few apps. Both of our production and test > environment use CapacityScheduler which open asyncSchedule function and > preemption > Analyzing the jstack of active rm, we found a dead lock in it: > thread one( refreshqueue thread): > * take write lock of capacity scheduler > * take write lock of preemptionManager > * wait read lock of root queue > thread two (asyncScheduleThread) > * take read lock of root queue > * wait write lock of PreemptionManager > thread three (ipc handler on 8030 which deal the allocate ) > * wait write lock of root queue > These three thread work with a dead lock. > > The deadlock happens because of a "bug" of ReadWriteLock: writeLock request > blocks future readLock despite policy > unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve > this problem, we change the logic of refreshqueue thread, get a queue info > copy first and avoid the thread to take write lock of preemptionManager and > read lock of root queue at the same time. > > We test our new code in our production environment and the refresh queue > command works well. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9090) [Submarine] Adjust the submarine installation script document
[ https://issues.apache.org/jira/browse/YARN-9090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728538#comment-16728538 ] Wangda Tan commented on YARN-9090: -- +1, thanks [~liuxun323]. > [Submarine] Adjust the submarine installation script document > - > > Key: YARN-9090 > URL: https://issues.apache.org/jira/browse/YARN-9090 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Blocker > Attachments: YARN-9090.001.patch > > > Migrate the submarine installation script document from the hadoop-yarn > project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9120) Need to have a way to turn off GPU auto-discovery in GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721650#comment-16721650 ] Wangda Tan commented on YARN-9120: -- [~snemeth] / [~tangzhankun], I prefer to make GPU plugin can be disabled/enabled as a whole. To me adding the new option makes harder to troubleshooting. And I'm not sure if there's any solid requirement to allow enable/disable GPU when the node is running, there're some logics of NM may prevent this as well. Just my $0.02. > Need to have a way to turn off GPU auto-discovery in GpuDiscoverer > -- > > Key: YARN-9120 > URL: https://issues.apache.org/jira/browse/YARN-9120 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > > GpuDiscoverer.getGpusUsableByYarn either parses the user-defined GPU devices > or should have the value 'auto' (from property: > yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices) > In some circumstances, users would want to exclude a node from scheduling, so > they should have an option to turn off auto-discovery. > It's straightforward that this is possible by removing the GPU > resource-plugin from YARN's config along with GPU-related config in > container-executor.cfg, but doing that with a dedicated value for > yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is a more > lightweight approach. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues
[ https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719676#comment-16719676 ] Wangda Tan commented on YARN-9116: -- [~aihuaxu], This sounds like a plan, but existing maximum memory, etc. is defined inside yarn-site.xml. And given we will have multiple resource types support like GPU. I suggest to make the default maximum capacity definition use absolute values for all types. The name of the config could be: {code:java} yarn.scheduler.capacity.default-queue-maximum-capacity memory=20G,vcores=20,gpu=3 {code} You can reuse implementation org.apache.hadoop.yarn.submarine.client.cli.CliUtils#createResourceFromString from trunk (and move it to common place if you think it is required). Thoughts? + [~sunilg], [~cheersyang] > Capacity Scheduler: add the default maximum-allocation-mb and > maximum-allocation-vcores for the queues > -- > > Key: YARN-9116 > URL: https://issues.apache.org/jira/browse/YARN-9116 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue > which is targeting to support larger container features on dedicated queues > (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . > While to achieve larger container configuration, we need to increase the > global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and > then override those configurations with desired values on the queues since > queue configuration can't be larger than cluster configuration. There are > many queues in the system and if we forget to configure such values when > adding a new queue, then such queue gets default 120G/256 which typically is > not what we want. > We can come up with a queue-default configuration (set to normal queue > configuration like 16G/8), so the leaf queues gets such values by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9055) Capacity Scheduler: allow larger queue level maximum-allocation-mb to override the cluster configuration
[ https://issues.apache.org/jira/browse/YARN-9055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719658#comment-16719658 ] Wangda Tan commented on YARN-9055: -- [~aihuaxu], I agree with Thomas, this looks like a change of behavior. {quote}bq. What I can think of is: you have to increase the cluster configuration and override on the queue level which doesn't require larger containers. {quote} This makes more sense to me. > Capacity Scheduler: allow larger queue level maximum-allocation-mb to > override the cluster configuration > > > Key: YARN-9055 > URL: https://issues.apache.org/jira/browse/YARN-9055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9055.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue. > That feature gives the flexibility to give different memory requirements for > different queues. Such patch adds the limitation that the queue level > configuration can't exceed the cluster level default configuration, but I > feel it may make more sense to remove such limitation to allow any overrides > since > # Such configuration is controlled by the admin so it shouldn't get abused; > # It's common that typical queues require standard size containers while some > job (queues) have requirements for larger containers. With current > limitation, we have to set larger configuration on the cluster setting which > will cause resource abuse unless we override them on all the queues. > We can remove such limitation in CapacitySchedulerConfiguration.java so the > cluster setting provides the default value and queue setting can override it. > {noformat} >if (maxAllocationMbPerQueue > clusterMax.getMemorySize() > || maxAllocationVcoresPerQueue > clusterMax.getVirtualCores()) { > throw new IllegalArgumentException( > "Queue maximum allocation cannot be larger than the cluster setting" > + " for queue " + queue > + " max allocation per queue: " + result > + " cluster setting: " + clusterMax); > } > {noformat} > Let me know if it makes sense. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9015) [DevicePlugin] Add an interface for device plugin to provide customized scheduler
[ https://issues.apache.org/jira/browse/YARN-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719376#comment-16719376 ] Wangda Tan commented on YARN-9015: -- Committed to trunk, thanks [~tangzhankun]! > [DevicePlugin] Add an interface for device plugin to provide customized > scheduler > - > > Key: YARN-9015 > URL: https://issues.apache.org/jira/browse/YARN-9015 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9015-trunk.001.patch, YARN-9015-trunk.002.patch, > YARN-9015-trunk.003.patch, YARN-9015-trunk.004.patch > > > A vendor might need a customized scheduling policy for their devices. It > could be scheduled based on topology, resource utilization, virtualization, > device attribute and so on. > We'll provide another optional interface "DevicePluginScheduler" for the > vendor device plugin to implement. Once it's implemented, the framework will > prefer it to the default scheduler. > This would bring more flexibility to the framework's scheduling mechanism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8885) [DevicePlugin] Support NM APIs to query device resource allocation
[ https://issues.apache.org/jira/browse/YARN-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8885: - Summary: [DevicePlugin] Support NM APIs to query device resource allocation (was: Phase 1 - Support NM APIs to query device resource allocation) > [DevicePlugin] Support NM APIs to query device resource allocation > -- > > Key: YARN-8885 > URL: https://issues.apache.org/jira/browse/YARN-8885 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8885-trunk.001.patch, YARN-8885-trunk.002.patch, > YARN-8885-trunk.003.patch > > > Supprot REST API in NM for user to query allocation > *_nodemanager_address:port/ws/v1/node/resources/\{resource_name}_* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9015) [DevicePlugin] Add an interface for device plugin to provide customized scheduler
[ https://issues.apache.org/jira/browse/YARN-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9015: - Summary: [DevicePlugin] Add an interface for device plugin to provide customized scheduler (was: Phase 1 - Add an interface for device plugin to provide customized scheduler) > [DevicePlugin] Add an interface for device plugin to provide customized > scheduler > - > > Key: YARN-9015 > URL: https://issues.apache.org/jira/browse/YARN-9015 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9015-trunk.001.patch, YARN-9015-trunk.002.patch, > YARN-9015-trunk.003.patch, YARN-9015-trunk.004.patch > > > A vendor might need a customized scheduling policy for their devices. It > could be scheduled based on topology, resource utilization, virtualization, > device attribute and so on. > We'll provide another optional interface "DevicePluginScheduler" for the > vendor device plugin to implement. Once it's implemented, the framework will > prefer it to the default scheduler. > This would bring more flexibility to the framework's scheduling mechanism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9112) [Submarine] Support polling applicationId when it's not ready in cluster
[ https://issues.apache.org/jira/browse/YARN-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719349#comment-16719349 ] Wangda Tan commented on YARN-9112: -- LGTM, +1. Thanks [~tangzhankun]. > [Submarine] Support polling applicationId when it's not ready in cluster > > > Key: YARN-9112 > URL: https://issues.apache.org/jira/browse/YARN-9112 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9112-trunk.001.patch > > > This could happen when an application is not ready in the cluster. A polling > for the application Id is needed. > {code:java} > 18/12/11 22:39:58 INFO client.ApiServiceClient: Application ID: > application_1532131617202_0063 > Exception in thread "main" org.apache.hadoop.yarn.exceptions.YarnException: > Can't get application id for Service tensorboard-service > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:618) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:241) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:91) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:318) > at org.apache.hadoop.util.RunJar.main(RunJar.java:232){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8885) Phase 1 - Support NM APIs to query device resource allocation
[ https://issues.apache.org/jira/browse/YARN-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719347#comment-16719347 ] Wangda Tan commented on YARN-8885: -- Thanks [~tangzhankun], patch LGTM, will commit by today. > Phase 1 - Support NM APIs to query device resource allocation > - > > Key: YARN-8885 > URL: https://issues.apache.org/jira/browse/YARN-8885 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8885-trunk.001.patch, YARN-8885-trunk.002.patch, > YARN-8885-trunk.003.patch > > > Supprot REST API in NM for user to query allocation > *_nodemanager_address:port/ws/v1/node/resources/\{resource_name}_* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9078) [Submarine] Clean up the code of CliUtils#parseResourcesString
[ https://issues.apache.org/jira/browse/YARN-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719354#comment-16719354 ] Wangda Tan commented on YARN-9078: -- Change looks good. Thanks [~tangzhankun]. > [Submarine] Clean up the code of CliUtils#parseResourcesString > -- > > Key: YARN-9078 > URL: https://issues.apache.org/jira/browse/YARN-9078 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Minor > Attachments: YARN-9078-trunk.001.patch > > > Some minor changes to clean up the CliUtils#parseResourcesString for better > readability. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9015) Phase 1 - Add an interface for device plugin to provide customized scheduler
[ https://issues.apache.org/jira/browse/YARN-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719351#comment-16719351 ] Wangda Tan commented on YARN-9015: -- Thanks [~tangzhankun], latest patch LGTM, +1. > Phase 1 - Add an interface for device plugin to provide customized scheduler > > > Key: YARN-9015 > URL: https://issues.apache.org/jira/browse/YARN-9015 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9015-trunk.001.patch, YARN-9015-trunk.002.patch, > YARN-9015-trunk.003.patch, YARN-9015-trunk.004.patch > > > A vendor might need a customized scheduling policy for their devices. It > could be scheduled based on topology, resource utilization, virtualization, > device attribute and so on. > We'll provide another optional interface "DevicePluginScheduler" for the > vendor device plugin to implement. Once it's implemented, the framework will > prefer it to the default scheduler. > This would bring more flexibility to the framework's scheduling mechanism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9075) Dynamically add or remove auxiliary services
[ https://issues.apache.org/jira/browse/YARN-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719307#comment-16719307 ] Wangda Tan commented on YARN-9075: -- Thanks [~billie.rinaldi], The overall code flow looks good to me. Some comments for implementations: 1) org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices#loadManifest, First I suggest to make all methods which will change internal data structure to synchronized method. Unless you feel there will be some performance issues, I would prefer to make it simpler. And inside the method, I found service could be started before it added to serviceMap. Is it possible that service could "leak" when some issues happen for other services (like throw RuntimeException AuxServices#initAuxService). To make it simpler, I suggest to change the code flow to: {code:java} foreach (service : configured-service): if (serviceMap.contains(newService)): stop old service Service newService = load-and-start-service serviceMap.add(newService){code} This can avoid leaking services. And it seems service will be stopped twice in existing logic, inside initAuxService and after all service initializations: {code:java} if (!loadedAuxServices.contains(entry.getKey())) { foundChanges = true; stopAuxService(entry.getValue()); it.remove(); }{code} 2) Why change ShuffleHandler? 3) Could u provide an example file for shuffle service manifest json file? We need to file a separate Jira for documentation changes. > Dynamically add or remove auxiliary services > > > Key: YARN-9075 > URL: https://issues.apache.org/jira/browse/YARN-9075 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Billie Rinaldi >Assignee: Billie Rinaldi >Priority: Major > Attachments: YARN-9075.001.patch, YARN-9075.002.patch, > YARN-9075.003.patch, YARN-9075_Dynamic_Aux_Services_V1.pdf > > > It would be useful to support adding, removing, or updating auxiliary > services without requiring a restart of NMs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.
[ https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718234#comment-16718234 ] Wangda Tan commented on YARN-8714: -- [~tangzhankun], sounds like a plan, but let's try to solve the issue on service side early if possible. Could u file a new Jira and work on that when you get chance? +1 to the latest patch. > [Submarine] Support files/tarballs to be localized for a training job. > -- > > Key: YARN-8714 > URL: https://issues.apache.org/jira/browse/YARN-8714 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8714-WIP1-trunk-001.patch, > YARN-8714-WIP1-trunk-002.patch, YARN-8714-trunk.001.patch, > YARN-8714-trunk.002.patch, YARN-8714-trunk.003.patch, > YARN-8714-trunk.004.patch, YARN-8714-trunk.005.patch, > YARN-8714-trunk.006.patch, YARN-8714-trunk.007.patch, > YARN-8714-trunk.008.patch, YARN-8714-trunk.009.patch, > YARN-8714-trunk.010.patch > > > See > [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7], > {{job run --localization ...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9087) Better logging for initialization of Resource plugins
[ https://issues.apache.org/jira/browse/YARN-9087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711837#comment-16711837 ] Wangda Tan commented on YARN-9087: -- [~snemeth], Device plugin framework is for the future plugins. We won't "deprecate" existing GPU implementation. But I can expect once device plugin framework becomes ready, we can do refactoring to make better code structure and more maintainable. The nvidia-docker-plugin has two versions, v1 and v2. As of now, we're using v1 nvidia-docker-plugin, which is deprecated by nvidia. There's a patch to support v2 plugin. (YARN-8822). Same to existing GPU implementation, once device plugin framework becomes ready, we will refactor the code to use that if efforts are reasonable. > Better logging for initialization of Resource plugins > - > > Key: YARN-9087 > URL: https://issues.apache.org/jira/browse/YARN-9087 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9087.001.patch > > > The patch includes the following enahncements for logging: > - Logging initializer code of resource handlers in > {{LinuxContainerExecutor#init}} > - Logging initializer code of resource plugins in > {{ResourcePluginManager#initialize}} > - Added toString to {{ResourceHandlerChain}} > - Added toString to all implementations to subclasses of {{ResourcePlugin}} > as they are printed in {{ResourcePluginManager#initialize}} > - Added toString to all implementations to subclasses of {{ResourceHandler}} > as they are printed as a field of the {{LinuxContainerExecutor#init}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Priority: Critical (was: Major) > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Critical > Labels: Docker > Attachments: YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, > YARN-8822.002.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710646#comment-16710646 ] Wangda Tan commented on YARN-8822: -- [~Charo Zhang], Thanks for the patch, apologize missed this Jira. I took a quick look, in general patch looks good. My only concern is, should we add addition check to the runtime used by container-executor binary. You can use the method add_param_to_command_if_allowed to check if a runtime is allowed. The reason is c-e runs as root, we saw some security related issues before. By default allowed runtime should be empty. And it you could share 1) documentation about new configs (add doc to YARN doc). 2) test report. We can be more confident to get this patch committed. [~tangzhankun], Regarding use the new device plugin framework vs. use old framework, personally I think we can do that slowly. Given device plugin is not ready yet, we can migrate plugins to device plugin framework once it is ready. Regarding to target version, we should always get patch committed to trunk first, and backport to older release lines. [~Charo Zhang], let's try to get trunk patch done and backport to branch-3.1, branch-3.2. I expect we have about 1 week time before 3.1.2 release, it's best if we can finish the patch before 3.1.2 release. > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Major > Labels: Docker > Attachments: YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, > YARN-8822.002.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8822) Nvidia-docker v2 support
[ https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8822: - Fix Version/s: (was: 3.1.2) > Nvidia-docker v2 support > > > Key: YARN-8822 > URL: https://issues.apache.org/jira/browse/YARN-8822 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.1 >Reporter: Zhankun Tang >Assignee: Charo Zhang >Priority: Major > Labels: Docker > Attachments: YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, > YARN-8822.002.patch > > > To run a GPU container with Docker, we have nvdia-docker v1 support already > but is deprecated per > [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We > should support nvdia-docker v2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8870) [Submarine] Add submarine installation scripts
[ https://issues.apache.org/jira/browse/YARN-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709341#comment-16709341 ] Wangda Tan commented on YARN-8870: -- As we discussed offline, reverted the patch from branches. It's better to move such scripts outside of Hadoop core. > [Submarine] Add submarine installation scripts > -- > > Key: YARN-8870 > URL: https://issues.apache.org/jira/browse/YARN-8870 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8870-addendum.008.patch, YARN-8870.001.patch, > YARN-8870.004.patch, YARN-8870.005.patch, YARN-8870.006.patch, > YARN-8870.007.patch, YARN-8870.009.patch, YARN-8870.010.patch, > YARN-8870.011.patch, YARN-8870.012.patch > > > In order to reduce the deployment difficulty of Hadoop > {Submarine} DNS, Docker, GPU, Network, graphics card, operating system kernel > modification and other components, I specially developed this installation > script to deploy Hadoop \{Submarine} > runtime environment, providing one-click installation Scripts, which can also > be used to install, uninstall, start, and stop individual components step by > step. > > {color:#ff}design d{color}{color:#FF}ocument:{color} > [https://docs.google.com/document/d/1muCTGFuUXUvM4JaDYjKqX5liQEg-AsNgkxfLMIFxYHU/edit?usp=sharing] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8870) [Submarine] Add submarine installation scripts
[ https://issues.apache.org/jira/browse/YARN-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8870: - Target Version/s: (was: 3.2.0) > [Submarine] Add submarine installation scripts > -- > > Key: YARN-8870 > URL: https://issues.apache.org/jira/browse/YARN-8870 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8870-addendum.008.patch, YARN-8870.001.patch, > YARN-8870.004.patch, YARN-8870.005.patch, YARN-8870.006.patch, > YARN-8870.007.patch, YARN-8870.009.patch, YARN-8870.010.patch, > YARN-8870.011.patch, YARN-8870.012.patch > > > In order to reduce the deployment difficulty of Hadoop > {Submarine} DNS, Docker, GPU, Network, graphics card, operating system kernel > modification and other components, I specially developed this installation > script to deploy Hadoop \{Submarine} > runtime environment, providing one-click installation Scripts, which can also > be used to install, uninstall, start, and stop individual components step by > step. > > {color:#ff}design d{color}{color:#FF}ocument:{color} > [https://docs.google.com/document/d/1muCTGFuUXUvM4JaDYjKqX5liQEg-AsNgkxfLMIFxYHU/edit?usp=sharing] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8870) [Submarine] Add submarine installation scripts
[ https://issues.apache.org/jira/browse/YARN-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8870: - Fix Version/s: (was: 3.2.0) > [Submarine] Add submarine installation scripts > -- > > Key: YARN-8870 > URL: https://issues.apache.org/jira/browse/YARN-8870 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8870-addendum.008.patch, YARN-8870.001.patch, > YARN-8870.004.patch, YARN-8870.005.patch, YARN-8870.006.patch, > YARN-8870.007.patch, YARN-8870.009.patch, YARN-8870.010.patch, > YARN-8870.011.patch, YARN-8870.012.patch > > > In order to reduce the deployment difficulty of Hadoop > {Submarine} DNS, Docker, GPU, Network, graphics card, operating system kernel > modification and other components, I specially developed this installation > script to deploy Hadoop \{Submarine} > runtime environment, providing one-click installation Scripts, which can also > be used to install, uninstall, start, and stop individual components step by > step. > > {color:#ff}design d{color}{color:#FF}ocument:{color} > [https://docs.google.com/document/d/1muCTGFuUXUvM4JaDYjKqX5liQEg-AsNgkxfLMIFxYHU/edit?usp=sharing] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.
[ https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709178#comment-16709178 ] Wangda Tan commented on YARN-8714: -- Thanks [~tangzhankun], what I remember is YARN doesn't support localize directory for LocalResource, but I could be wrong as well. Hope you're correct :). Please keep us posted for your testing. > [Submarine] Support files/tarballs to be localized for a training job. > -- > > Key: YARN-8714 > URL: https://issues.apache.org/jira/browse/YARN-8714 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8714-WIP1-trunk-001.patch, > YARN-8714-WIP1-trunk-002.patch, YARN-8714-trunk.001.patch, > YARN-8714-trunk.002.patch > > > See > [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7], > {{job run --localization ...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.
[ https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707602#comment-16707602 ] Wangda Tan commented on YARN-8714: -- [~liuxun323], fair enough. [~tangzhankun], I think we can add a setting to submarine config. By default set it to 2GB or so. And print logs when we download files, tar it locally and upload to HDFS to make troubleshooting easier. Also please remove the local tmp file once upload is done. Another concern is if this operation needs to be done repeatedly for every submitted job, it gonna be a big issue. If we could append directory's modification time and size to the tar file for now, later we can optimize it to share same uploaded files across jobs. > [Submarine] Support files/tarballs to be localized for a training job. > -- > > Key: YARN-8714 > URL: https://issues.apache.org/jira/browse/YARN-8714 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8714-WIP1-trunk-001.patch, > YARN-8714-WIP1-trunk-002.patch, YARN-8714-trunk.001.patch, > YARN-8714-trunk.002.patch > > > See > [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7], > {{job run --localization ...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9015) Phase 1 - Add an interface for device plugin to provide customized scheduler
[ https://issues.apache.org/jira/browse/YARN-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707587#comment-16707587 ] Wangda Tan commented on YARN-9015: -- [~tangzhankun], 1) DevicePluginScheduler: Why use Integer instead int? 2) DeviceMappingManager: - devicePluginSchedulers: suggest to make it accessed under synchronized, given assignDevices is accessed under synchronized lock. DeviceMappingManager is not used under highly concurrency env. Suggest to make all access to synchronized to simplify logics. - {code} 312 // TODO: should check if customized scheduler return values are valid 313 if (dpsAllocated.size() != count) { {code} Is there any check needed? - {code} // TODO: fall back to default schedule logic? {code} I think your existing throw exception logic is good enough. > Phase 1 - Add an interface for device plugin to provide customized scheduler > > > Key: YARN-9015 > URL: https://issues.apache.org/jira/browse/YARN-9015 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9015-trunk.001.patch, YARN-9015-trunk.002.patch, > YARN-9015-trunk.003.patch > > > A vendor might need a customized scheduling policy for their devices. It > could be scheduled based on topology, resource utilization, virtualization, > device attribute and so on. > We'll provide another optional interface "DevicePluginScheduler" for the > vendor device plugin to implement. Once it's implemented, the framework will > prefer it to the default scheduler. > This would bring more flexibility to the framework's scheduling mechanism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8885) Phase 1 - Support NM APIs to query device resource allocation
[ https://issues.apache.org/jira/browse/YARN-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707577#comment-16707577 ] Wangda Tan commented on YARN-8885: -- [~tangzhankun], could u provide example output of the API? Thanks, > Phase 1 - Support NM APIs to query device resource allocation > - > > Key: YARN-8885 > URL: https://issues.apache.org/jira/browse/YARN-8885 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8885-trunk.001.patch, YARN-8885-trunk.002.patch, > YARN-8885-trunk.003.patch > > > Supprot REST API in NM for user to query allocation > *_nodemanager_address:port/ws/v1/node/resources/\{resource_name}_* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9078) [Submarine] Clean up the code of CliUtils#parseResourcesString
[ https://issues.apache.org/jira/browse/YARN-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707574#comment-16707574 ] Wangda Tan commented on YARN-9078: -- [~tangzhankun], I'm wondering if {code} 82 if (resourcesStr.startsWith("[")) { 83resourcesStr = resourcesStr.substring(1); 84 } 85 if (resourcesStr.endsWith("]")) { 86resourcesStr = resourcesStr.substring(0, resourcesStr.length() - 1); 87 } {code} Should be removed. > [Submarine] Clean up the code of CliUtils#parseResourcesString > -- > > Key: YARN-9078 > URL: https://issues.apache.org/jira/browse/YARN-9078 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Minor > Attachments: YARN-9078-trunk.001.patch > > > Some minor changes to clean up the CliUtils#parseResourcesString for better > readability. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707553#comment-16707553 ] Wangda Tan commented on YARN-9050: -- [~Tao Yang], make sense to me. Once you figured out details, I can help with reviews, etc. > Usability improvements for scheduler activities > --- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activites maybe confused when multiple scheduling threads record activites of > different allocation processes in the same variables like appsAllocation and > recordingNodesAllocation in ActivitiesManager. I think these variables should > be thread-local to make activities clear among multiple threads. > 2. Incomplete activites for multi-node lookup machanism, since > ActivitiesLogger will skip recording through {{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup machanism. > 3. Current app activites can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activites, add diagnosis for placement constraints check, update insufficient > resource diagnosis with detailed info (like 'insufficient resource > names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggragate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggragation for app activities by > diagnoses is neccessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnositics and example like this: > !image-2018-11-23-16-46-38-138.png! > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8870) [Submarine] Add submarine installation scripts
[ https://issues.apache.org/jira/browse/YARN-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706070#comment-16706070 ] Wangda Tan commented on YARN-8870: -- [~liuxun323], I figured out how to do it manually First you need to install shellcheck. Then you can use {code} shellcheck hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/* {code} and {code} shellcheck hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/* {code} to do the check. > [Submarine] Add submarine installation scripts > -- > > Key: YARN-8870 > URL: https://issues.apache.org/jira/browse/YARN-8870 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Fix For: 3.2.0 > > Attachments: YARN-8870-addendum.008.patch, YARN-8870.001.patch, > YARN-8870.004.patch, YARN-8870.005.patch, YARN-8870.006.patch, > YARN-8870.007.patch, YARN-8870.009.patch, YARN-8870.010.patch > > > In order to reduce the deployment difficulty of Hadoop > {Submarine} DNS, Docker, GPU, Network, graphics card, operating system kernel > modification and other components, I specially developed this installation > script to deploy Hadoop \{Submarine} > runtime environment, providing one-click installation Scripts, which can also > be used to install, uninstall, start, and stop individual components step by > step. > > {color:#ff}design d{color}{color:#FF}ocument:{color} > [https://docs.google.com/document/d/1muCTGFuUXUvM4JaDYjKqX5liQEg-AsNgkxfLMIFxYHU/edit?usp=sharing] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9010) Fix the incorrect trailing slash deletion in constructor method of CGroupsHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703996#comment-16703996 ] Wangda Tan commented on YARN-9010: -- Committed to trunk, thanks [~tangzhankun]. > Fix the incorrect trailing slash deletion in constructor method of > CGroupsHandlerImpl > - > > Key: YARN-9010 > URL: https://issues.apache.org/jira/browse/YARN-9010 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9010-trunk-001.patch > > > In constructor method of CGroupsHandlerImpl: > {code:java} > this.cGroupPrefix = conf.get(YarnConfiguration. > NM_LINUX_CONTAINER_CGROUPS_HIERARCHY, "/hadoop-yarn") > .replaceAll("^/", "").replaceAll("$/", "");{code} > The "$/" regex expression is not working. And "^/" for leading slash handling > is also not good enough. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9010) Fix the incorrect trailing slash deletion in constructor method of CGroupsHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9010: - Priority: Major (was: Minor) > Fix the incorrect trailing slash deletion in constructor method of > CGroupsHandlerImpl > - > > Key: YARN-9010 > URL: https://issues.apache.org/jira/browse/YARN-9010 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9010-trunk-001.patch > > > In constructor method of CGroupsHandlerImpl: > {code:java} > this.cGroupPrefix = conf.get(YarnConfiguration. > NM_LINUX_CONTAINER_CGROUPS_HIERARCHY, "/hadoop-yarn") > .replaceAll("^/", "").replaceAll("$/", "");{code} > The "$/" regex expression is not working. And "^/" for leading slash handling > is also not good enough. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8870) [Submarine] Add submarine installation scripts
[ https://issues.apache.org/jira/browse/YARN-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703895#comment-16703895 ] Wangda Tan commented on YARN-8870: -- That's my bad, [~liuxun323], could u work on an addendum patch to get the issue resolved? [~sunilg], it looks like you will work on RC1 soon, before you start RC1 build, if the addendum patch doesn't get committed, please revert this patch from branch-3.2.0. Thanks. > [Submarine] Add submarine installation scripts > -- > > Key: YARN-8870 > URL: https://issues.apache.org/jira/browse/YARN-8870 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Fix For: 3.2.0 > > Attachments: YARN-8870.001.patch, YARN-8870.004.patch, > YARN-8870.005.patch, YARN-8870.006.patch, YARN-8870.007.patch > > > In order to reduce the deployment difficulty of Hadoop > {Submarine} DNS, Docker, GPU, Network, graphics card, operating system kernel > modification and other components, I specially developed this installation > script to deploy Hadoop \{Submarine} > runtime environment, providing one-click installation Scripts, which can also > be used to install, uninstall, start, and stop individual components step by > step. > > {color:#ff}design d{color}{color:#FF}ocument:{color} > [https://docs.google.com/document/d/1muCTGFuUXUvM4JaDYjKqX5liQEg-AsNgkxfLMIFxYHU/edit?usp=sharing] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9050) Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703553#comment-16703553 ] Wangda Tan commented on YARN-9050: -- [~Tao Yang], thanks for filing the JIRA. The all issues you mentioned are valid to me, if you have any cycles to do such improvements, please convert this to umbrella and we can help with patch reviews. My bottomline is try to lower overhead of the activities recording as much as possible when it is not recording. And also if you have any ideas about make the result can be easier accessed by users, such as via web ui / cli, etc. it gonna be super helpful. > Usability improvements for scheduler activities > --- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activites maybe confused when multiple scheduling threads record activites of > different allocation processes in the same variables like appsAllocation and > recordingNodesAllocation in ActivitiesManager. I think these variables should > be thread-local to make activities clear between multiple threads. > 2. Incomplete activites for multi-node lookup machanism, since > ActivitiesLogger will skip recording through {{if (node == null || > activitiesManager == null) return; }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup machanism. > 3. Current app activites can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activites, add diagnosis for placement constraints check, update insufficient > resource diagnosis with detailed info (like 'insufficient resource > names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggragate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggragation for app activities by > diagnoses is neccessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnositics and example like this: > !image-2018-11-23-16-46-38-138.png! > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor
[ https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703546#comment-16703546 ] Wangda Tan commented on YARN-9060: -- [~tangzhankun], explanation makes sense, and the issue about GPU seems valid, once you have a jira and patch, I can help w/ patch reviews. > [YARN-8851] Phase 1 - Support device isolation in native container-executor > --- > > Key: YARN-9060 > URL: https://issues.apache.org/jira/browse/YARN-9060 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9060-trunk.001.patch > > > Due to the cgroups v1 implementation policy in linux kernel, we cannot update > the value of the device cgroups controller unless we have the root permission > ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]). > So we need to support this in container-executor for Java layer to invoke. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8975) [Submarine] Use predefined Charset object StandardCharsets.UTF_8 instead of String "UTF-8"
[ https://issues.apache.org/jira/browse/YARN-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8975. -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.3.0 Committed to trunk, thanks [~tangzhankun] and reviews from [~ajisakaa]. > [Submarine] Use predefined Charset object StandardCharsets.UTF_8 instead of > String "UTF-8" > -- > > Key: YARN-8975 > URL: https://issues.apache.org/jira/browse/YARN-8975 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Trivial > Fix For: 3.3.0 > > Attachments: YARN-8975-trunk.001.patch, YARN-8975-trunk.002.patch > > > {code:java} > Writer w = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");{code} > Could be refactored to this to improve a little bit performance due to avoid > string lookup: > {code:java} > Writer w = new OutputStreamWriter(new FileOutputStream(file), > StandardCharsets.UTF_8);{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8989) Move DockerCommandPlugin volume related APIs' invocation from DockerLinuxContainerRuntime#prepareContainer to #launchContainer
[ https://issues.apache.org/jira/browse/YARN-8989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702480#comment-16702480 ] Wangda Tan commented on YARN-8989: -- LGTM, thanks [~tangzhankun], committing. > Move DockerCommandPlugin volume related APIs' invocation from > DockerLinuxContainerRuntime#prepareContainer to #launchContainer > -- > > Key: YARN-8989 > URL: https://issues.apache.org/jira/browse/YARN-8989 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8989-trunk-001.patch > > > This seems required before we implement isolation in pluggable device > framework for default container and Docker container with > LinuxContainerExecutor. > To find a place for plugin "onDevicesAllocated" in current operation flow > when running a container with LCE. > {code:java} > ContainerLaunch#call() -> > 1.ContainerLaunch#prepareContainer() - > > LCE#prepareContainer -> > DelegatingLinuxContainerRuntime#prepareContainer -> > DockerLinuxContainerRuntime#prepareContainer -> > DockerCommandPlugin#getCreateDockerVolumeCommand > -> > onDeviceAllocated(null,docker); create volume? > > 2.ContainerLaunch#launchContainer > LCE#launchContainer() -> > resourceHandlerChain#preStart() -> > DeviceResourceHandlerImpl#preStart() -> > onDeviceAllocated(alloc,docker) > allocate device and do isolation for default container with cgroup > {code} > > What I want to do here is to move the DockerCommandPlugin APIs invocation > from DockerLinuxContainerRuntime#prepareContainer to #launchContainer. This > won't bring any incompatibility and can benefit the pluggable device > framework's interaction with the device plugin. > The "DeviceRuntimeSpec onDevicesAllocated(Setallocation, > yarnRuntime)" implemented by device plugin is to let the plugin do some > preparation and return a spec on how to run the container with the allocated > device. We designed a VolumeClaim field in DeviceRuntimeSpec object for the > plugin to declare what volume they need to create. > In current code flow, call this "onDevicesAllocated" in the > DockerCommandPlugin's methods seems weird and can only pass a null value as > allocation. This will complex the vendor device plugin implementation to > handle a null value. > Once we move the DockerCommandPlugin API invocation, it will like this: > {code:java} > ContainerLaunch#call() -> > ContainerLaunch#launchContainer > LCE#launchContainer() -> > resourceHandlerChain#preStart() -> > DeviceResourceHandlerImpl#preStart() -> > onDeviceAllocated(alloc,docker) > allocate device and do isolation for default container with cgroup > DelegatingLinuxContainerRuntime#launchContaienr -> > DockerLinuxContainerRuntime#launchContainer-> >DockerCommandPlugin#getCreateDockerVolumeCommand -> > get allocation;onDeviceAllocated(alloc,docker);create volume{code} > After changes, the flow is more smooth and also simplify the plugin > implementation for "onDevicesAllocated" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8882) [YARN-8851] Add a shared device mapping manager (scheduler) for device plugins
[ https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8882: - Summary: [YARN-8851] Add a shared device mapping manager (scheduler) for device plugins (was: Phase 1 - Add a shared device mapping manager for device plugin to use) > [YARN-8851] Add a shared device mapping manager (scheduler) for device plugins > -- > > Key: YARN-8882 > URL: https://issues.apache.org/jira/browse/YARN-8882 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, > YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch, > YARN-8882-trunk.005.patch, YARN-8882-trunk.006.patch, > YARN-8882-trunk.007.patch > > > Since a few devices uses FIFO policy to assign devices to the container, we > use a shared device manager to handle all types of devices. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9061) Improve the GPU/FPGA module log message of container-executor
[ https://issues.apache.org/jira/browse/YARN-9061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702451#comment-16702451 ] Wangda Tan commented on YARN-9061: -- +1, thanks [~tangzhankun], committing. > Improve the GPU/FPGA module log message of container-executor > - > > Key: YARN-9061 > URL: https://issues.apache.org/jira/browse/YARN-9061 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Minor > Attachments: YARN-9061-trunk.001.patch, YARN-9061-trunk.002.patch > > > The log message is not clear when options value is missing. > {code:java} > fprintf(ERRORFILE, "is not specified, skip cgroups call.\n");{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7277) Container Launch expand environment needs to consider bracket matching
[ https://issues.apache.org/jira/browse/YARN-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702450#comment-16702450 ] Wangda Tan commented on YARN-7277: -- [~tangzhankun], typically what you can do is add a new line or empty space to one of NM java file and trigger test to run. The reason to run changed project only is to save time to run unit test. For example only RM tests will take more than 3 hours to finish. With limited Jenkins slaves, we cannot run all tests for changed jars and their downstream jars. > Container Launch expand environment needs to consider bracket matching > -- > > Key: YARN-7277 > URL: https://issues.apache.org/jira/browse/YARN-7277 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: balloons >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-7277-trunk.001.patch, YARN-7277-trunk.002.patch, > YARN-7277-trunk.003.patch, YARN-7277-trunk.004.patch, > YARN-7277-trunk.005.patch > > > The SPARK application I submitted always failed and I finally found that the > commands I specified to launch AM Container were changed by NM. > *The following is part of the excerpt I submitted to RM to see the command:* > {code:java} > *'{\"handler\":\"FILLER\",\"inputTable\":\"engine_arch.adult_train\",\"outputTable\":[\"ether_features_filler_\$experimentId_\$taskId_out0\"],\"params\":{\"age\":{\"param\":[\"0\"]}}}'* > {code} > *The following is an excerpt from the corresponding command used when I > observe the NM launch container:* > {code:java} > *'{\"handler\":\"FILLER\",\"inputTable\":\"engine_arch.adult_train\",\"outputTable\":[\"ether_features_filler_\$experimentId_\$taskId_out0\"],\"params\":{\"age\":{\"param\":[\"0\"]}* > {code} > Finally, I found that NM made the following transformation in launch > container which led to this situation: > {code:java} > @VisibleForTesting > public static String expandEnvironment(String var, > Path containerLogDir) { > var = var.replace(ApplicationConstants.LOG_DIR_EXPANSION_VAR, > containerLogDir.toString()); > var = var.replace(ApplicationConstants.CLASS_PATH_SEPARATOR, > File.pathSeparator); > // replace parameter expansion marker. e.g. {{VAR}} on Windows is replaced > // as %VAR% and on Linux replaced as "$VAR" > if (Shell.WINDOWS) { > var = var.replaceAll("(\\{\\{)|(\\}\\})", "%"); > } else { > var = var.replace(ApplicationConstants.PARAMETER_EXPANSION_LEFT, "$"); > *var = var.replace(ApplicationConstants.PARAMETER_EXPANSION_RIGHT, "");* > } > return var; > } > {code} > I think this is a Bug that doesn't even consider the pairing of > "*PARAMETER_EXPANSION_LEFT*" and "*PARAMETER_EXPANSION_RIGHT*" when > substituting. But simply substituting for simple violence. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7277) Container Launch expand environment needs to consider bracket matching
[ https://issues.apache.org/jira/browse/YARN-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702450#comment-16702450 ] Wangda Tan edited comment on YARN-7277 at 11/28/18 10:30 PM: - [~tangzhankun], typically what you can do is add a new line or empty space to one of NM java file and trigger test to run. The reason to run changed project only is to save time to run unit test. For example only RM tests will take more than 3 hours to finish. With limited Jenkins slaves, we cannot run all tests for changed jars and their downstream jars. And btw, is it an incompatible change? was (Author: leftnoteasy): [~tangzhankun], typically what you can do is add a new line or empty space to one of NM java file and trigger test to run. The reason to run changed project only is to save time to run unit test. For example only RM tests will take more than 3 hours to finish. With limited Jenkins slaves, we cannot run all tests for changed jars and their downstream jars. > Container Launch expand environment needs to consider bracket matching > -- > > Key: YARN-7277 > URL: https://issues.apache.org/jira/browse/YARN-7277 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: balloons >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-7277-trunk.001.patch, YARN-7277-trunk.002.patch, > YARN-7277-trunk.003.patch, YARN-7277-trunk.004.patch, > YARN-7277-trunk.005.patch > > > The SPARK application I submitted always failed and I finally found that the > commands I specified to launch AM Container were changed by NM. > *The following is part of the excerpt I submitted to RM to see the command:* > {code:java} > *'{\"handler\":\"FILLER\",\"inputTable\":\"engine_arch.adult_train\",\"outputTable\":[\"ether_features_filler_\$experimentId_\$taskId_out0\"],\"params\":{\"age\":{\"param\":[\"0\"]}}}'* > {code} > *The following is an excerpt from the corresponding command used when I > observe the NM launch container:* > {code:java} > *'{\"handler\":\"FILLER\",\"inputTable\":\"engine_arch.adult_train\",\"outputTable\":[\"ether_features_filler_\$experimentId_\$taskId_out0\"],\"params\":{\"age\":{\"param\":[\"0\"]}* > {code} > Finally, I found that NM made the following transformation in launch > container which led to this situation: > {code:java} > @VisibleForTesting > public static String expandEnvironment(String var, > Path containerLogDir) { > var = var.replace(ApplicationConstants.LOG_DIR_EXPANSION_VAR, > containerLogDir.toString()); > var = var.replace(ApplicationConstants.CLASS_PATH_SEPARATOR, > File.pathSeparator); > // replace parameter expansion marker. e.g. {{VAR}} on Windows is replaced > // as %VAR% and on Linux replaced as "$VAR" > if (Shell.WINDOWS) { > var = var.replaceAll("(\\{\\{)|(\\}\\})", "%"); > } else { > var = var.replace(ApplicationConstants.PARAMETER_EXPANSION_LEFT, "$"); > *var = var.replace(ApplicationConstants.PARAMETER_EXPANSION_RIGHT, "");* > } > return var; > } > {code} > I think this is a Bug that doesn't even consider the pairing of > "*PARAMETER_EXPANSION_LEFT*" and "*PARAMETER_EXPANSION_RIGHT*" when > substituting. But simply substituting for simple violence. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor
[ https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702438#comment-16702438 ] Wangda Tan edited comment on YARN-9060 at 11/28/18 10:26 PM: - [~tangzhankun], just want to understand some high level considerations before looking into details: 1) Could u add some examples about parameters of {{--module-devices}} 2) Inside c-e.cfg, what field needs to be set, and please share some example of configs. And what if user doesn't set the configs, will c-e allow all devices controlled by c-e or none devices. (the previous one sounds super dangerous). 3) Could u add changes to c-e.cfg (hadoop-yarn-project/hadoop-yarn/conf/container-executor.cfg) for the new config and example options like other options. was (Author: leftnoteasy): [~tangzhankun], just want to understand some high level considerations before looking into details: 1) Could u add some examples about parameters of {{--module-devices}} 2) Inside c-e.cfg, what field needs to be set, and please share some example of configs. And what if user doesn't set the configs, will c-e allow all devices controlled by c-e or none devices. (the previous one sounds super dangerous). 3) Could u add changes to c-e.cfg? > [YARN-8851] Phase 1 - Support device isolation in native container-executor > --- > > Key: YARN-9060 > URL: https://issues.apache.org/jira/browse/YARN-9060 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9060-trunk.001.patch > > > Due to the cgroups v1 implementation policy in linux kernel, we cannot update > the value of the device cgroups controller unless we have the root permission > ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]). > So we need to support this in container-executor for Java layer to invoke. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor
[ https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702438#comment-16702438 ] Wangda Tan commented on YARN-9060: -- [~tangzhankun], just want to understand some high level considerations before looking into details: 1) Could u add some examples about parameters of {{--module-devices}} 2) Inside c-e.cfg, what field needs to be set, and please share some example of configs. And what if user doesn't set the configs, will c-e allow all devices controlled by c-e or none devices. (the previous one sounds super dangerous). 3) Could u add changes to c-e.cfg? > [YARN-8851] Phase 1 - Support device isolation in native container-executor > --- > > Key: YARN-9060 > URL: https://issues.apache.org/jira/browse/YARN-9060 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-9060-trunk.001.patch > > > Due to the cgroups v1 implementation policy in linux kernel, we cannot update > the value of the device cgroups controller unless we have the root permission > ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]). > So we need to support this in container-executor for Java layer to invoke. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8882) Phase 1 - Add a shared device mapping manager for device plugin to use
[ https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702429#comment-16702429 ] Wangda Tan commented on YARN-8882: -- Thanks [~tangzhankun], existing code looks good, committing .. > Phase 1 - Add a shared device mapping manager for device plugin to use > -- > > Key: YARN-8882 > URL: https://issues.apache.org/jira/browse/YARN-8882 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, > YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch, > YARN-8882-trunk.005.patch, YARN-8882-trunk.006.patch, > YARN-8882-trunk.007.patch > > > Since a few devices uses FIFO policy to assign devices to the container, we > use a shared device manager to handle all types of devices. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.
[ https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702424#comment-16702424 ] Wangda Tan commented on YARN-8714: -- Thanks [~tangzhankun] for working on the patch, several comments: 1) Why hardcoded to handle {{hdfs://}}, it could be s3, abfs, gs, etc. I'd definitely prefer to make this to more general. This including code and comments. 2) For the behavior: {code} 651 /** 652* Localize dependencies for all containers. 653* If remoteUri is a local directory, 654* we'll compress it and upload to staging dir in HDFS 655* If remoteUri is a HDFS directory, we'll download, compress it 656* and upload to staging dir in HDFS 657* If localFilePath is ".", we'll use remote file/dir name 658* */ {code} I would prefer to remove support of: {code} 655* If remoteUri is a HDFS directory, we'll download, compress it 656* and upload to staging dir in HDFS {code} Because downloading files from remote fs could be risky. What if user accidentally specified "/"? If user has needs to localize a dir of file, he or she should tar or zip it before uploading to HDFS. 3) Could u add the above behavior of localization to CLI description as well? > [Submarine] Support files/tarballs to be localized for a training job. > -- > > Key: YARN-8714 > URL: https://issues.apache.org/jira/browse/YARN-8714 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8714-WIP1-trunk-001.patch, > YARN-8714-WIP1-trunk-002.patch, YARN-8714-trunk.001.patch, > YARN-8714-trunk.002.patch > > > See > [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7], > {{job run --localization ...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9030) Log aggregation changes to handle filesystems which do not support setting permissions
[ https://issues.apache.org/jira/browse/YARN-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-9030: - Summary: Log aggregation changes to handle filesystems which do not support setting permissions (was: Log aggregation changes to handle filesystems which do not support permissions) > Log aggregation changes to handle filesystems which do not support setting > permissions > -- > > Key: YARN-9030 > URL: https://issues.apache.org/jira/browse/YARN-9030 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-9030.1.patch, YARN-9030.2.patch > > > Some cloud storages like ADLS do not support permissions in which case they > throw an UnsupportedOperationException. Log aggregation code should > log/ignore these exceptions and not set permissions henceforth for log > aggregation base dir/sub dirs > {noformat} > 2018-11-12 15:37:28,726 WARN logaggregation.LogAggregationService > (LogAggregationService.java:initApp(209)) - Application failed to init > aggregation > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to check > permissions for dir [abfs://testc...@test.blob.core.windows.net/app-logs] > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.verifyAndCreateRemoteLogDir(LogAggregationFileController.java:277) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:238) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:347) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:69) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8882) Phase 1 - Add a shared device mapping manager for device plugin to use
[ https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695450#comment-16695450 ] Wangda Tan commented on YARN-8882: -- [~tangzhankun], why rename "device-scheduler" to "device-mapping-manager"? :) I understand what it do is mapping devices to containers. But I would prefer to call it DeviceScheduler instead of DeviceMappingManager given name "scheduler" means booking, ordering, etc. > Phase 1 - Add a shared device mapping manager for device plugin to use > -- > > Key: YARN-8882 > URL: https://issues.apache.org/jira/browse/YARN-8882 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, > YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch, > YARN-8882-trunk.005.patch, YARN-8882-trunk.006.patch, > YARN-8882-trunk.007.patch > > > Since a few devices uses FIFO policy to assign devices to the container, we > use a shared device manager to handle all types of devices. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9030) Log aggregation changes to handle filesystems which do not support permissions
[ https://issues.apache.org/jira/browse/YARN-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693852#comment-16693852 ] Wangda Tan commented on YARN-9030: -- Thanks [~suma.shivaprasad], +1, will get it committed later today. > Log aggregation changes to handle filesystems which do not support permissions > -- > > Key: YARN-9030 > URL: https://issues.apache.org/jira/browse/YARN-9030 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-9030.1.patch, YARN-9030.2.patch > > > Some cloud storages like ADLS do not support permissions in which case they > throw an UnsupportedOperationException. Log aggregation code should > log/ignore these exceptions and not set permissions henceforth for log > aggregation base dir/sub dirs > {noformat} > 2018-11-12 15:37:28,726 WARN logaggregation.LogAggregationService > (LogAggregationService.java:initApp(209)) - Application failed to init > aggregation > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to check > permissions for dir [abfs://testc...@test.blob.core.windows.net/app-logs] > at > org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.verifyAndCreateRemoteLogDir(LogAggregationFileController.java:277) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:238) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:347) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:69) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8881) [YARN-8851] Add basic pluggable device plugin framework
[ https://issues.apache.org/jira/browse/YARN-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692007#comment-16692007 ] Wangda Tan commented on YARN-8881: -- [~tangzhankun], patch committed to trunk. Thanks for reviews from [~sunilg], [~csingh], [~cheersyang]. > [YARN-8851] Add basic pluggable device plugin framework > --- > > Key: YARN-8881 > URL: https://issues.apache.org/jira/browse/YARN-8881 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-8881-trunk.001.patch, YARN-8881-trunk.002.patch, > YARN-8881-trunk.003.patch, YARN-8881-trunk.004.patch, > YARN-8881-trunk.005.patch, YARN-8881-trunk.006.patch, > YARN-8881-trunk.007.patch, YARN-8881-trunk.008.patch, > YARN-8881-trunk.009.patch, YARN-8881-trunk.010.patch, > YARN-8881-trunk.011.patch, YARN-8881-trunk.012.patch > > > It includes adding support in "ResourcePluginManager" to load plugin classes > based on configuration, an interface for the vendor to implement and the > adapter to decouple plugin and YARN internals. And the vendor device resource > discovery will be ready after this support -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org