[jira] [Commented] (YARN-11018) RM rest api show error resources in capacity scheduler with nodelabels
[ https://issues.apache.org/jira/browse/YARN-11018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456719#comment-17456719 ] Eric Badger commented on YARN-11018: I think [~epayne] is probably more qualified to review this given that he worked on YARN-10343 > RM rest api show error resources in capacity scheduler with nodelabels > -- > > Key: YARN-11018 > URL: https://issues.apache.org/jira/browse/YARN-11018 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Major > Attachments: YARN-11018.001.patch > > > Because resource metrics updated only for "default" partition, allocatedMB, > allocatedVCores, totalMB, totalVirtualCores are error in capacity scheduler > with nodelabels. > When we get cluster metrics use 'curl > [http://rm:8088/ws/v1/cluster/metrics',] we get error totalMB and > totalVirtualCores. > It should use resources across partition to replace. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9818) test_docker_util.cc:test_add_mounts doesn't correctly test for parent dir of container-executor.cfg
[ https://issues.apache.org/jira/browse/YARN-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427319#comment-17427319 ] Eric Badger commented on YARN-9818: --- I believe when you do a native build there is a file created called {{cetest}}. You just need to execute that binary. There is also a {{test-container-executor}}, but that is a different piece of code (that tests other parts of the container-executor forreasons?) > test_docker_util.cc:test_add_mounts doesn't correctly test for parent dir of > container-executor.cfg > --- > > Key: YARN-9818 > URL: https://issues.apache.org/jira/browse/YARN-9818 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9818.001.patch > > > The code attempts to mount a directory that is a parent of > container-executor.cfg. However, the docker.allowed.[ro,rw]-mounts settings > in the container-executor.cfg don't allow that directory. So the test isn't > ever getting to the code where we disallow the mount because it is a parent > of container-executor.cfg. The test is disallowing it because the mount isn't > in the allowed mounts list. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9818) test_docker_util.cc:test_add_mounts doesn't correctly test for parent dir of container-executor.cfg
[ https://issues.apache.org/jira/browse/YARN-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427309#comment-17427309 ] Eric Badger commented on YARN-9818: --- This is from a few years ago so I don't quite remember the details, but from what I remember, the test was passing back then. The problem was that the test wasn't testing what it said it was. This patch was to fix the test so that it would accurately test what it was looking to test. But the patch was never reviewed/committed > test_docker_util.cc:test_add_mounts doesn't correctly test for parent dir of > container-executor.cfg > --- > > Key: YARN-9818 > URL: https://issues.apache.org/jira/browse/YARN-9818 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9818.001.patch > > > The code attempts to mount a directory that is a parent of > container-executor.cfg. However, the docker.allowed.[ro,rw]-mounts settings > in the container-executor.cfg don't allow that directory. So the test isn't > ever getting to the code where we disallow the mount because it is a parent > of container-executor.cfg. The test is disallowing it because the mount isn't > in the allowed mounts list. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419340#comment-17419340 ] Eric Badger commented on YARN-10935: Also thanks to [~ahussein] for the additional review! > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4, 3.1.5 > > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, > YARN-10935.branch-3.2.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10935: --- Fix Version/s: 3.1.5 3.2.4 2.10.2 Thanks for the additional patches, [~epayne]! +1 on them and I've committed them. The patches have now been committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1 (apparently unnecessary, but I did it anyway. Oops), and branch-2.10 > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4, 3.1.5 > > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, > YARN-10935.branch-3.2.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10935: --- Fix Version/s: 3.3.2 3.4.0 [~epayne], looks like it's clean back to branch-3.3. So I committed it to trunk (3.4) and branch-3.3. But I'll need patches for branch-3.2 onwards if you'd like it backported > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.4.0, 3.3.2 > > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.
[ https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415243#comment-17415243 ] Eric Badger commented on YARN-10935: [~epayne], +1 the patch looks good to me. However, trunk compilation is currently failing for me due to HADOOP-17891 and I'd like to get that cleared up before committing your patch (I don't like committing things when I can't compile) > AM Total Queue Limit goes below per-user AM Limit if parent is full. > > > Key: YARN-10935 > URL: https://issues.apache.org/jira/browse/YARN-10935 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, capacityscheduler >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot > 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, > YARN-10935.003.patch > > > This happens when DRF is enabled and all of one resource is consumed but the > second resources still has plenty available. > This is reproduceable by setting up a parent queue where the capacity and max > capacity are the same, with 2 or more sub-queues whose max capacity is 100%. > In one of the sub-queues, start a long-running app that consumes all > resources in the parent queue's hieararchy. This app will consume all of the > memory but not vary many vcores (for example) > In a second queue, submit an app. The *{{Max Application Master Resources Per > User}}* limit is much more than the *{{Max Application Master Resources}}* > limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10860) Make max container per heartbeat configs refreshable
[ https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385635#comment-17385635 ] Eric Badger commented on YARN-10860: Thanks, [~zhuqi]! > Make max container per heartbeat configs refreshable > > > Key: YARN-10860 > URL: https://issues.apache.org/jira/browse/YARN-10860 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0, 2.10.2, 3.2.3, 3.3.2 > > Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch > > > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} > and > {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} > are currently *not* refreshable configs, but I believe they should be. This > JIRA is to turn these into refreshable configs, just like > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} > is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10860) Make max container per heartbeat configs refreshable
[ https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385024#comment-17385024 ] Eric Badger commented on YARN-10860: [~zhuqi], thanks for the review and commit! And thanks [~gandras] for the additional review. Just a reminder that when committing patches you should attempt to cherry-pick them as far back as you can unless they are risky and/or unstable. In this case you committed patches to trunk (3.4) and branch-2.10. However, the trunk patch should also be cherry-picked back to the other active 3.x branches, which are branch-3.3 and branch-3.2. Could you please cherry-pick the trunk patch to those 2 branches? Thanks! > Make max container per heartbeat configs refreshable > > > Key: YARN-10860 > URL: https://issues.apache.org/jira/browse/YARN-10860 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0, 2.10.2 > > Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch > > > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} > and > {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} > are currently *not* refreshable configs, but I believe they should be. This > JIRA is to turn these into refreshable configs, just like > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} > is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10867) YARN should expose a ENV used to map a custom device into docker container
[ https://issues.apache.org/jira/browse/YARN-10867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384513#comment-17384513 ] Eric Badger commented on YARN-10867: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/DockerContainers.html I believe you can just use {{docker.allowed.devices}} in your container-executor.cfg file if you need to mount an actual device. However, you'll need to be a privileged container to do that, so you'll need to also set {{docker.privileged-containers.enabled=true}}. Note that running privileged containers is very risky and adds a lot of security concerns with it, so proceed with caution. After setting those, I believe you can use {{YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS}} to specify the mounts that you want, including the device such as {{/dev/fuse}} > YARN should expose a ENV used to map a custom device into docker container > -- > > Key: YARN-10867 > URL: https://issues.apache.org/jira/browse/YARN-10867 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chi Heng >Priority: Major > > In some scenarios, like mounting a FUSE in docker,user needs to map a custom > device (eg. /dev/fuse) into docker container.I notice that an adddevice > method is defined in [ > hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/docker/DockerRunCommand.java > > |https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/docker/DockerRunCommand.java] > ,I suppose that an ENV or config property should to be exposed to user to > call this method -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10860) Make max container per heartbeat configs refreshable
[ https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10860: --- Attachment: (was: YARN-10860.001.patch) > Make max container per heartbeat configs refreshable > > > Key: YARN-10860 > URL: https://issues.apache.org/jira/browse/YARN-10860 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch > > > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} > and > {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} > are currently *not* refreshable configs, but I believe they should be. This > JIRA is to turn these into refreshable configs, just like > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} > is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10860) Make max container per heartbeat configs refreshable
[ https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10860: --- Attachment: YARN-10860.001.patch > Make max container per heartbeat configs refreshable > > > Key: YARN-10860 > URL: https://issues.apache.org/jira/browse/YARN-10860 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch > > > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} > and > {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} > are currently *not* refreshable configs, but I believe they should be. This > JIRA is to turn these into refreshable configs, just like > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} > is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10860) Make max container per heartbeat configs refreshable
[ https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10860: --- Attachment: YARN-10860.branch-2.10.001.patch > Make max container per heartbeat configs refreshable > > > Key: YARN-10860 > URL: https://issues.apache.org/jira/browse/YARN-10860 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch > > > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} > and > {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} > are currently *not* refreshable configs, but I believe they should be. This > JIRA is to turn these into refreshable configs, just like > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} > is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10860) Make max container per heartbeat configs refreshable
[ https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10860: --- Attachment: YARN-10860.001.patch > Make max container per heartbeat configs refreshable > > > Key: YARN-10860 > URL: https://issues.apache.org/jira/browse/YARN-10860 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10860.001.patch > > > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} > and > {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} > are currently *not* refreshable configs, but I believe they should be. This > JIRA is to turn these into refreshable configs, just like > {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} > is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10860) Make max container per heartbeat configs refreshable
Eric Badger created YARN-10860: -- Summary: Make max container per heartbeat configs refreshable Key: YARN-10860 URL: https://issues.apache.org/jira/browse/YARN-10860 Project: Hadoop YARN Issue Type: Improvement Reporter: Eric Badger Assignee: Eric Badger {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} and {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} are currently *not* refreshable configs, but I believe they should be. This JIRA is to turn these into refreshable configs, just like {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10761) Add more event type to RM Dispatcher event metrics.
[ https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340347#comment-17340347 ] Eric Badger commented on YARN-10761: Thanks for the patch, [~zhuqi]. Is there a reason we need to call {{create()}} twice for each metric? The code in the patch calls it onceee to create the {{GenericEventTypeMetricsManager}} and then again just so that it can call {{getEnumClass()}}. Seems better to save the first {{create()}} call off into a local variable and then call {{getEnumClass()}} on that so we don't have to call {{create()}} twice per metric > Add more event type to RM Dispatcher event metrics. > --- > > Key: YARN-10761 > URL: https://issues.apache.org/jira/browse/YARN-10761 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10761.001.patch, image-2021-05-06-16-38-51-406.png, > image-2021-05-06-16-39-28-362.png > > > Since YARN-9615 add NodesListManagerEventType to event metrics. > And we'd better add total 4 busy event type to the metrics according to > YARN-9927. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10745) Change Log level from info to debug for few logs and remove unnecessary debuglog checks
[ https://issues.apache.org/jira/browse/YARN-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339816#comment-17339816 ] Eric Badger commented on YARN-10745: Hi [~dmmkr], thanks for the patch. Overall I think it has changes that make sense, but I have a few comments/questions {noformat} - if (LOG.isDebugEnabled()) { -LOG.debug("Auth is SASL user=\"{}\" JAAS context=\"{}\"", -jaasClientIdentity, jaasClientEntry); - } +LOG.debug("Auth is SASL user=\"{}\" JAAS context=\"{}\"", + jaasClientIdentity, jaasClientEntry); {noformat} Looks like the wrong indentation here {noformat} switch (purgePolicy) { case SkipOnChildren: // don't do the deletion... continue to next record - if (LOG.isDebugEnabled()) { -LOG.debug("Skipping deletion"); - } +LOG.debug("Skipping deletion"); toDelete = false; break; case PurgeAll: // mark for deletion - if (LOG.isDebugEnabled()) { -LOG.debug("Scheduling for deletion with children"); - } +LOG.debug("Scheduling for deletion with children"); toDelete = true; entries = new ArrayList(0); break; case FailOnChildren: - if (LOG.isDebugEnabled()) { -LOG.debug("Failing deletion operation"); - } +LOG.debug("Failing deletion operation"); throw new PathIsNotEmptyDirectoryException(path); {noformat} Same here with the case statements {noformat} List clusterNodeReports = yarnClient.getNodeReports( NodeState.RUNNING); -LOG.info("Got Cluster node info from ASM"); +if (clusterNodeReports.isEmpty()) { + LOG.info("Got Empty Cluster node Report info from ASM"); +} {noformat} Is {{clusterNodeReports}} guaranteeed to be non-null here? Otherwise we can NPE {noformat} -// NodeManager is the last service to start, so NodeId is available. +// NodeStatusUpdater is the last service to start, so NodeId is available. {noformat} I'm not sure what this change is for. The comment seems to imply that NodeStatusUpdater is the last service to start, so the service that populates NodeId will already be done. Nodemanager is probably the last service to start overall since it adds all of the other services, but I don't think the change in the comment makes the code any clearer {noformat} + LOG.info("Callback succeeded for initializing request processing " + + "pipeline for an AM "); {noformat} Can you comment on the this log statement? Have you found this useful for debugging? Does it only get logged rarely? {noformat} -LOG.info("hostsReader include:{" + -StringUtils.join(",", hostsReader.getHosts()) + -"} exclude:{" + -StringUtils.join(",", hostsReader.getExcludedHosts()) + "}"); - +if (!hostsReader.getHosts().isEmpty() || +!hostsReader.getExcludedHosts().isEmpty()) { + LOG.info("hostsReader include:{" + + StringUtils.join(",", hostsReader.getHosts()) + + "} exclude:{" + + StringUtils.join(",", hostsReader.getExcludedHosts()) + "}"); +} {noformat} I feel like we're losing information here. Knowing that the hostsReader is empty is helpful. We could log it differently, but I don't think we want to lose that information > Change Log level from info to debug for few logs and remove unnecessary > debuglog checks > --- > > Key: YARN-10745 > URL: https://issues.apache.org/jira/browse/YARN-10745 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Minor > Attachments: YARN-10745.001.patch, YARN-10745.002.patch, > YARN-10745.003.patch, YARN-10745.004.patch > > > Change the info log level to debug for few logs so that the load on the > logger decreases in large cluster and improves the performance. > Remove the unnecessary isDebugEnabled() checks for printing strings without > any string concatenation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10648) NM local logs are not cleared after uploading to hdfs
[ https://issues.apache.org/jira/browse/YARN-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339350#comment-17339350 ] Eric Badger commented on YARN-10648: The patch looks good, but I'll wait for [~grepas], [~rkanter], and [~snemeth] to look at this as they were the ones that worked on the original code that created the issue. > NM local logs are not cleared after uploading to hdfs > - > > Key: YARN-10648 > URL: https://issues.apache.org/jira/browse/YARN-10648 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 3.2.0 >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Attachments: YARN-10648.001.patch > > > YARN-8273 has induced the following issues. > # The {color:#00}delService.delete(deletionTask){color} has been removed > from the for loop, and added at the end in finally block. Inside the for loop > we are creating FileDeletionTask for each container, but not storing it, due > to this, only the last container log files will be present in the > deletionTask and only those files will be removed. Ideally all the container > log files which are uploaded must be deleted. > # The LogAggregationDFSException is caught in the closeswriter, but when we > configure LogAggregationTFileController as logAggregationFileController, > this.logAggregationFileController.closeWriter() itself calls closeWriter, > which throws LogAggregationDFSException if any, and the exception is not > saved. Again when we try to do closeWriter we dont get any exception and, we > are not throwing the LogAggregationDFSException in this scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9927) RM multi-thread event processing mechanism
[ https://issues.apache.org/jira/browse/YARN-9927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335804#comment-17335804 ] Eric Badger commented on YARN-9927: --- {noformat} +// Test multi thread dispatcher +conf.setBoolean(YarnConfiguration. +MULTI_THREAD_DISPATCHER_ENABLED, true); {noformat} If this is a feature that is disabled by default, I don't think we should have it enabled by default in all of the RM tests. I would be happier running it as a parameterized test with both multi and single thread dispatchers. In general I think the patch looks reasonable, but I would like to see testing done to see if this makes the problem better or worse. I would think it would make things better, but until we run some real tests on it, we won't really know. So getting something similar to what [~hcarrot] provided originally would be good. That way we can merge this with confidence. > RM multi-thread event processing mechanism > -- > > Key: YARN-9927 > URL: https://issues.apache.org/jira/browse/YARN-9927 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0, 2.9.2 >Reporter: hcarrot >Assignee: Qi Zhu >Priority: Major > Attachments: RM multi-thread event processing mechanism.pdf, > YARN-9927.001.patch, YARN-9927.002.patch, YARN-9927.003.patch, > YARN-9927.004.patch, YARN-9927.005.patch > > > Recently, we have observed serious event blocking in RM event dispatcher > queue. After analysis of RM event monitoring data and RM event processing > logic, we found that > 1) environment: a cluster with thousands of nodes > 2) RMNodeStatusEvent dominates 90% time consumption of RM event scheduler > 3) Meanwhile, RM event processing is in a single-thread mode, and It results > in the low headroom of RM event scheduler, thus performance of RM. > So we proposed a RM multi-thread event processing mechanism to improve RM > performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.
[ https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335726#comment-17335726 ] Eric Badger commented on YARN-10707: Thanks for the updates, [~zhuqi]! +1 I've committed this to trunk (3.4) and branch-3.3. There are conflicts backporting back further than that > Support custom resources in ResourceUtilization, and update Node GPU > Utilization to use. > > > Key: YARN-10707 > URL: https://issues.apache.org/jira/browse/YARN-10707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10707.001.patch, YARN-10707.002.patch, > YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, > YARN-10707.006.patch, YARN-10707.007.patch, YARN-10707.008.patch, > YARN-10707.009.patch, YARN-10707.010.patch, YARN-10707.011.patch > > > Support gpu in ResourceUtilization, and update Node GPU Utilization to use > first. > It will be very helpful for other use cases about GPU utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.
[ https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10707: --- Fix Version/s: 3.3.1 3.4.0 > Support custom resources in ResourceUtilization, and update Node GPU > Utilization to use. > > > Key: YARN-10707 > URL: https://issues.apache.org/jira/browse/YARN-10707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10707.001.patch, YARN-10707.002.patch, > YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, > YARN-10707.006.patch, YARN-10707.007.patch, YARN-10707.008.patch, > YARN-10707.009.patch, YARN-10707.010.patch, YARN-10707.011.patch > > > Support gpu in ResourceUtilization, and update Node GPU Utilization to use > first. > It will be very helpful for other use cases about GPU utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333560#comment-17333560 ] Eric Badger commented on YARN-10493: bq. In theory we could change that if there is a benefit in your opinion, but my initial reaction is that adding sub-directories to that namespace may make it harder to track images (cleanup, governance, perhaps even quotas, etc.). I don't think it's a huge deal. A nice to have feature, but if it requires a major rework then I don't think it's necessary. The reason I think it would be nice is so that we can more cleanly segment our images. E.g. you could have {{hadoop/small-image/rhel7:7.9}} or something like that. But again, it's not a huge deal if it's difficult > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf, > runc-container-repository-v2-design_updated.pdf > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.
[ https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333520#comment-17333520 ] Eric Badger commented on YARN-10707: Thanks for the update, [~zhuqi]! The content looks good, I just have a few nits on naming conventions. {noformat} + public float getNodePhysGpus() throws Exception{ {noformat} I think a better name for this method would be {{getTotalNodeGpuUtilization}} and {{getNodeGpuUtilization}} would be better off as {{getAvgNodeGpuUtilization}}. Then {{totalGpuUtilization}} would also be changed to {{avgGpuUtilization}} in {{getAvgNodeGpuUtilization}}. This way we have a clear distinction on what the methods are returning. Relevant Javadocs would also be nice for each of the methods. > Support custom resources in ResourceUtilization, and update Node GPU > Utilization to use. > > > Key: YARN-10707 > URL: https://issues.apache.org/jira/browse/YARN-10707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10707.001.patch, YARN-10707.002.patch, > YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, > YARN-10707.006.patch, YARN-10707.007.patch, YARN-10707.008.patch, > YARN-10707.009.patch > > > Support gpu in ResourceUtilization, and update Node GPU Utilization to use > first. > It will be very helpful for other use cases about GPU utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7713) Add parallel copying of directories into FSDownload
[ https://issues.apache.org/jira/browse/YARN-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333483#comment-17333483 ] Eric Badger commented on YARN-7713: --- Thanks for taking this up, [~ChrisKarampeazis]. I noticed that you weren't a contributor in JIRA yet so I've added you as one. You may now assign JIRAs to yourself in all of the Hadoop projects (YARN, Common, HDFS, Mapreduce). In general I think the PR looks good, but I think it would be nice and not too awfully difficult to sort the list of files to be localized by file size and then split the list into chunks based on that. That way we don't end up with 1 thread downloading 4 files of 2 KB and another thread downloading 4 files of 4 GB. > Add parallel copying of directories into FSDownload > --- > > Key: YARN-7713 > URL: https://issues.apache.org/jira/browse/YARN-7713 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Miklos Szegedi >Assignee: Christos Karampeazis-Papadakis >Priority: Major > Labels: newbie, pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > YARN currently copies directories sequentially when localizing. This could be > improved to do in parallel, since the source blocks are normally on different > nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7713) Add parallel copying of directories into FSDownload
[ https://issues.apache.org/jira/browse/YARN-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger reassigned YARN-7713: - Assignee: Christos Karampeazis-Papadakis > Add parallel copying of directories into FSDownload > --- > > Key: YARN-7713 > URL: https://issues.apache.org/jira/browse/YARN-7713 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Miklos Szegedi >Assignee: Christos Karampeazis-Papadakis >Priority: Major > Labels: newbie, pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > YARN currently copies directories sequentially when localizing. This could be > improved to do in parallel, since the source blocks are normally on different > nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1773#comment-1773 ] Eric Badger commented on YARN-10493: What I'm saying on the split thing is that in the current state {{hadoop/rhel7/myimage:current}} would throw an exception. But I don't see why that is necessary. In the above case, why not have {{hadoop}} as the namespace and {{rhel7/myimage:current}} as the image name? > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf, > runc-container-repository-v2-design_updated.pdf > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.
[ https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332739#comment-17332739 ] Eric Badger commented on YARN-10707: Thanks for the updated patch, [~zhuqi]! It's much cleaner and much smaller now {noformat} float nodeGpuUtilization = 0F; +float nodeGpus = 0F; try { if (gpuNodeResourceUpdateHandler != null) { nodeGpuUtilization = gpuNodeResourceUpdateHandler.getNodeGpuUtilization(); +nodeGpus = +gpuNodeResourceUpdateHandler.getNodePhysGpus(); } } catch (Exception e) { LOG.error("Get Node GPU Utilization error: " + e); } {noformat} Ideally this wouldn't be GPU-specific and we could add all plugin utilizations to the nodeUtilization object. But that is beyond the scope of this JIRA, so I think this is fine. However, I think we can get a better name than {{nodeGpus}}. Maybe {{TotalNodeGpuUtilization}}? Additionally, why are we sending the average GPU utilization to the NM metrics, but the total GPU utilization to the RM? Memory and CPU are consistent across the two. I don't understand why GPU is different. > Support custom resources in ResourceUtilization, and update Node GPU > Utilization to use. > > > Key: YARN-10707 > URL: https://issues.apache.org/jira/browse/YARN-10707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10707.001.patch, YARN-10707.002.patch, > YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, > YARN-10707.006.patch, YARN-10707.007.patch > > > Support gpu in ResourceUtilization, and update Node GPU Utilization to use > first. > It will be very helpful for other use cases about GPU utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10749) Can't remove all node labels after add node label without nodemanager port, broken by YARN-10647
[ https://issues.apache.org/jira/browse/YARN-10749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10749: --- Fix Version/s: 3.2.3 2.10.2 3.1.5 3.3.1 3.4.0 Thanks for the patch, [~dmmkr] and [~zhuqi] for the review. +1 committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and branch-2.10 > Can't remove all node labels after add node label without nodemanager port, > broken by YARN-10647 > > > Key: YARN-10749 > URL: https://issues.apache.org/jira/browse/YARN-10749 > Project: Hadoop YARN > Issue Type: Bug >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: YARN-10749.001.patch, YARN-10749.002.patch > > > The fix done in YARN-10501, doesn't work after YARN-10647. > To reproduce follow the same steps in YARN-10501 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326150#comment-17326150 ] Eric Badger commented on YARN-10493: Thanks for the latest patch. I tested out the patch along with the CLI tool from YARN-10494 and everything seems to be working well. The addition of namespaces has fixed my issue from last time with the {{hadoop/rhel7:current}} image. But I do have a few comments. In addition to these comments, I would like to commit both this JIRA and YARN-10494 at the same time, because I don't particularly think either makes sense without the other. And YARN-10494 has a blocker on it because of the docker->overlayfs incompatibilities with whiteout/opaque files. Anyway, here are my comments on this JIRA {noformat} if (!tag.equals("")) { {noformat} nit: There are a few times where the patch uses this, but I think {{isEmpty()}} is more appropriate than {{equals("")}}. {noformat} String[] nameParts = imageCoordinates.split("/", -1); String imageTag; if (nameParts.length == 2) { metaNamespace = nameParts[0]; imageTag = nameParts[1]; } else if (nameParts.length == 1) { imageTag = nameParts[0]; } else { throw new IllegalArgumentException("Invalid image coordinates: " + imageCoordinates); } {noformat} According to the [documentation|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-int-] for {{split}}, this code will create a String array with the number of elements equal to the number of {{/}} + 1. But then we only look at the first 2 parts of the array. Is there a reason not to take only the part before the first slash as the namespace and then take the rest as the imageTag? > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf, > runc-container-repository-v2-design_updated.pdf > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.
[ https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326097#comment-17326097 ] Eric Badger commented on YARN-10707: Thanks for the patch, [~zhuqi]. To decrease the size of the patch, I think it would be better to keep the ResourceUtilization.newInstance method signature the same (i.e. with pmem, vmem, and cpu). And then create a new method signature with those 3 parameters plus the new custom resources. The newInstance method with only 3 parameters can call the method with 4 parameters and just assume that the custom resources will be null. That way we won't have to modify as many files changing all of the newInstance calls to add null. The same logic can be used for {{addTo}} and {{subtractFrom}} {noformat} public void setCustomResource(String resourceName, float utilization) { if (customResources != null && resourceName != null && !resourceName.isEmpty()) { customResources.put(resourceName, utilization); } } {noformat} I don't think the {{customResources != null}} check is necessary. {{customResources}} is initialized to a new HashMap and the only place that it is assigned is in {{setCustomResources}}, but that method only sets it if the parameter is non-null. {noformat} +nodeUtilization = +ResourceUtilization.newInstance( +(int) (pmem >> 20), // B -> MB +(int) (vmem >> 20), // B -> MB +vcores, // Used Virtual Cores +customResources); // Used GPUs + +nodeUtilization. +setCustomResource(ResourceInformation.GPU_URI, nodeGpus); + + {noformat} Maybe it's just me, but I think it makes more sense to set the custom resources before passing them as a parameter to newInstance. Just like we're setting cpu and mem in newInstance instead of setting them to 0 and then setting them after > Support custom resources in ResourceUtilization, and update Node GPU > Utilization to use. > > > Key: YARN-10707 > URL: https://issues.apache.org/jira/browse/YARN-10707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10707.001.patch, YARN-10707.002.patch, > YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, > YARN-10707.006.patch > > > Support gpu in ResourceUtilization, and update Node GPU Utilization to use > first. > It will be very helpful for other use cases about GPU utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.
[ https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326048#comment-17326048 ] Eric Badger commented on YARN-10743: I don't really have a big issue with adding this as an option that is disabled by default. It's not something that I would ever want to enable in my clusters, but if there is use for it in other scenarios, then I don't have a big issue with it. [~Jim_Brennan], do you agree or do you still have concerns with adding this? > Add a policy for not aggregating for containers which are killed because > exceeding container log size limit. > > > Key: YARN-10743 > URL: https://issues.apache.org/jira/browse/YARN-10743 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10743.001.patch, image-2021-04-20-10-41-01-057.png > > > Since YARN-10471 supported container log size limited for kill. > We'd better to add a policy that can not aggregated for those containers, so > that to reduce the pressure of HDFS etc. > cc [~epayne] [~Jim_Brennan] [~ebadger] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10723) Change CS nodes page in UI to support custom resource.
[ https://issues.apache.org/jira/browse/YARN-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10723: --- Fix Version/s: 3.2.3 3.1.5 3.3.1 3.4.0 [~zhuqi], thanks for the patch! +1 I've committed this to trunk (3.4), branch-3.3, branch-3.2, and branch-3.1 > Change CS nodes page in UI to support custom resource. > -- > > Key: YARN-10723 > URL: https://issues.apache.org/jira/browse/YARN-10723 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10723.001.patch, YARN-10723.002.patch, > YARN-10723.003.patch, YARN-10723.004.patch, YARN-10723.005.patch, > image-2021-04-06-17-22-32-733.png > > > Node page now only support gpu for custom resource. > We should make this supported for all custom resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10460: --- Fix Version/s: 2.10.2 3.1.5 Thanks for the review, [~Jim_Brennan]. The spotbugs is unrelated to this patch (in a different file, Server.java). I've committed the 2.10 patch to branch-2.10. This jira has now been committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and branch-2.10 > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-2.10.002.patch, > YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.serve
[jira] [Commented] (YARN-10715) Remove hardcoded resource values (e.g. GPU/FPGA) in code.
[ https://issues.apache.org/jira/browse/YARN-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325387#comment-17325387 ] Eric Badger commented on YARN-10715: Finally getting around to looking at this and I don't think removing the hardcoded values from ResourceUtils is necessary. It's not necessary for the resource translation to be there for GPUs and FPGAs, but it also doesn't hurt anything. I'm not quite sure why {{yarn.io/gpu}} was chosen anyway, since that seems like a pretty complex name for something as simple as a gpu. But I'm sure there was a good reason. Anyway, we should definitely strive to remove any code that is hardcoding calculations for GPUs/FPGAs and generalize them to any extended resource type. But in this case, this is just a simple translation from {{gpu}} to {{yarn.io/gpu}} and similarily for FPGAs. I'm inclined to close this as won't do > Remove hardcoded resource values (e.g. GPU/FPGA) in code. > - > > Key: YARN-10715 > URL: https://issues.apache.org/jira/browse/YARN-10715 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10715.001.patch > > > https://issues.apache.org/jira/browse/YARN-10503?focusedCommentId=17307772&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17307772 > As above comment , we should remove hardcoded resource values (e.g. GPU/FPGA) > in code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.
[ https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325343#comment-17325343 ] Eric Badger commented on YARN-10743: I have the same concern as [~Jim_Brennan]. If the flink logs are large enough that the container is getting killed, don't you want to check the logs to see what happened? I'm trying to understand the scenario where you wouldn't want the logs even though your container failed due to large log size. Is there a reason that you don't care about the logs in this instance? > Add a policy for not aggregating for containers which are killed because > exceeding container log size limit. > > > Key: YARN-10743 > URL: https://issues.apache.org/jira/browse/YARN-10743 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10743.001.patch > > > Since YARN-10471 supported container log size limited for kill. > We'd better to add a policy that can not aggregated for those containers, so > that to reduce the pressure of HDFS etc. > cc [~epayne] [~Jim_Brennan] [~ebadger] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325329#comment-17325329 ] Eric Badger commented on YARN-10460: Posting a branch-2.10 patch that doesn't use a lambda expression (not support in Java 7). > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-2.10.002.patch, > YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdat
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10460: --- Attachment: YARN-10460-branch-2.10.002.patch > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-2.10.002.patch, > YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in
[jira] [Comment Edited] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325291#comment-17325291 ] Eric Badger edited comment on YARN-10460 at 4/19/21, 8:26 PM: -- Thanks for the review, [~Jim_Brennan]! I've committed the 3.2 patch to branch-3.2 and cherry-picked it to branch-3.1. So now this has been committed to trunk (3.4), branch-3.3, branch-3.2, and branch-3.1 was (Author: ebadger): Thanks for the review, [~Jim_Brennan]! I've committed the 3.2 patch to branch-3.2. So now this has been committed to trunk (3.4), branch-3.3, and branch-3.2 > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagement
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10460: --- Fix Version/s: 3.2.3 Thanks for the review, [~Jim_Brennan]! I've committed the 3.2 patch to branch-3.2. So now this has been committed to trunk (3.4), branch-3.3, and branch-3.2 > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdate
[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325231#comment-17325231 ] Eric Badger commented on YARN-10460: The unit tests seem unrelated and don't fail for me locally. [~pbacsko], [~aajisaka], [~Jim_Brennan], could one of your review the branch-3.2 patch? > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.test
[jira] [Commented] (YARN-10723) Change CS nodes page in UI to support custom resource.
[ https://issues.apache.org/jira/browse/YARN-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325210#comment-17325210 ] Eric Badger commented on YARN-10723: Looks like it still never ran. [~zhuqi], can you re-upload the patch? > Change CS nodes page in UI to support custom resource. > -- > > Key: YARN-10723 > URL: https://issues.apache.org/jira/browse/YARN-10723 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10723.001.patch, YARN-10723.002.patch, > YARN-10723.003.patch, YARN-10723.004.patch, image-2021-04-06-17-22-32-733.png > > > Node page now only support gpu for custom resource. > We should make this supported for all custom resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10723) Change CS nodes page in UI to support custom resource.
[ https://issues.apache.org/jira/browse/YARN-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324097#comment-17324097 ] Eric Badger commented on YARN-10723: Precommit never ran on the latest patch, so I cancelled the patch and resubmitted. I also tested out the patch on my GPU environment as well as my non-GPU environment and both look good. I'm +1 on the patch pending HadoopQA > Change CS nodes page in UI to support custom resource. > -- > > Key: YARN-10723 > URL: https://issues.apache.org/jira/browse/YARN-10723 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10723.001.patch, YARN-10723.002.patch, > YARN-10723.003.patch, YARN-10723.004.patch, image-2021-04-06-17-22-32-733.png > > > Node page now only support gpu for custom resource. > We should make this supported for all custom resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324085#comment-17324085 ] Eric Badger commented on YARN-10460: Reopening and attaching a patch for branch-3.2 that puts {{clearClientCache}} in ProtobufRpcEngine instead of ProtobufRpcEngine2, since ProtobufRpcEngine2 doesn't exist in branch-3.2 > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nod
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10460: --- Attachment: YARN-10460-branch-3.2.002.patch > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the > c
[jira] [Reopened] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger reopened YARN-10460: > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the > client object in {{ProtobufRpcEngine}}/{{Protob
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10460: --- I backported this to branch-3.3. There's a merge conflict with branch-3.2 that I'm looking into. HADOOP-17602 was fairly recently merged which means this issue show up in all active branches without this fix > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(T
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10460: --- Fix Version/s: 3.3.1 > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10460-001.patch, YARN-10460-002.patch, > YARN-10460-POC.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the > client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngin
[jira] [Updated] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10503: --- Fix Version/s: 3.3.1 Thanks for the patch, [~zhuqi]. +1 committed to branch-3.3. This has now been committed to trunk (3.4) and branch-3.3. > Support queue capacity in terms of absolute resources with custom > resourceType. > --- > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10503-branch-3.3.010.patch, YARN-10503.001.patch, > YARN-10503.002.patch, YARN-10503.003.patch, YARN-10503.004.patch, > YARN-10503.005.patch, YARN-10503.006.patch, YARN-10503.007.patch, > YARN-10503.008.patch, YARN-10503.009.patch, YARN-10503.010.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10503: --- Fix Version/s: 3.4.0 Thanks for the updates, [~zhuqi]. +1 on patch 10. And thanks for the reviews, [~gandras] and [~pbacsko]. I've committed this to trunk (3.4) [~zhuqi], there is a conflict on the cherry-pick back to branch-3.3. It looks like a fairly trivial fix. Could you make the necessary adjustments and put up a patch for branch-3.3? > Support queue capacity in terms of absolute resources with custom > resourceType. > --- > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-10503.001.patch, YARN-10503.002.patch, > YARN-10503.003.patch, YARN-10503.004.patch, YARN-10503.005.patch, > YARN-10503.006.patch, YARN-10503.007.patch, YARN-10503.008.patch, > YARN-10503.009.patch, YARN-10503.010.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor
[ https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10702: --- Fix Version/s: 3.2.3 3.1.5 Thanks for the additional patches, [~Jim_Brennan]. I committed the 3.2 and 3.1 patches. This has now been committed to trunk (3.4), branch-3.3, branch-3.2, and branch-3.1. > Add cluster metric for amount of CPU used by RM Event Processor > --- > > Key: YARN-10702 > URL: https://issues.apache.org/jira/browse/YARN-10702 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 2.10.1, 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: Scheduler-Busy.png, YARN-10702-branch-3.1.006.patch, > YARN-10702-branch-3.2.006.patch, YARN-10702-branch-3.3.006.patch, > YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, > YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, > simon-scheduler-busy.png > > > Add a cluster metric to track the cpu usage of the ResourceManager Event > Processing thread. This lets us know when the critical path of the RM is > running out of headroom. > This feature was originally added for us internally by [~nroberts] and we've > been running with it on production clusters for nearly four years. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor
[ https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10702: --- Fix Version/s: 3.3.1 3.4.0 Thanks for the patch, [~Jim_Brennan]. I've committed it to branch-3.3 So now it's been committed to trunk (3.4) and branch-3.3. There's another conflict with branch-3.2. If you'd like it to go back there, please provide a patch for that branch as well. Also a belated thanks to [~gandras] and [~zhuqi] for the reviews on the original patch > Add cluster metric for amount of CPU used by RM Event Processor > --- > > Key: YARN-10702 > URL: https://issues.apache.org/jira/browse/YARN-10702 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 2.10.1, 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > Attachments: Scheduler-Busy.png, YARN-10702-branch-3.3.006.patch, > YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, > YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, > simon-scheduler-busy.png > > > Add a cluster metric to track the cpu usage of the ResourceManager Event > Processing thread. This lets us know when the critical path of the RM is > running out of headroom. > This feature was originally added for us internally by [~nroberts] and we've > been running with it on production clusters for nearly four years. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor
[ https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315191#comment-17315191 ] Eric Badger commented on YARN-10702: [~Jim_Brennan], thanks for the patch. +1 I've committed this to trunk (3.4). There are a few small conflicts with the cherry-pick to branch-3.3. Would you mind putting up a patch for branch-3.3? > Add cluster metric for amount of CPU used by RM Event Processor > --- > > Key: YARN-10702 > URL: https://issues.apache.org/jira/browse/YARN-10702 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 2.10.1, 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: Scheduler-Busy.png, YARN-10702.001.patch, > YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, > YARN-10702.005.patch, YARN-10702.006.patch, simon-scheduler-busy.png > > > Add a cluster metric to track the cpu usage of the ResourceManager Event > Processing thread. This lets us know when the critical path of the RM is > running out of headroom. > This feature was originally added for us internally by [~nroberts] and we've > been running with it on production clusters for nearly four years. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10501: --- Fix Version/s: 2.10.2 Thanks for the patch/patience [~caozhiqiang]. Finally HadoopQA is back to normal. I fixed up the small checkstyle on the patch and committed it to branch-2.10. > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, > YARN-10502-branch-2.10.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309788#comment-17309788 ] Eric Badger commented on YARN-10503: Thanks for the update, [~zhuqi]. This might be a little too picky, but I think it would be better if {{appendCustomResources}} just created the string instead of appending it to {{resourceString}}. That way we can keep the current structure of the StringBuilders at the caller level. {noformat} resourceString .append("[" + AbsoluteResourceType.MEMORY.toString().toLowerCase() + "=" + resource.getMemorySize() + "," + AbsoluteResourceType.VCORES.toString().toLowerCase() + "=" + resource.getVirtualCores() + getCustomResourcesString(resource) + "]"); {noformat} It could look something like this, where {{getCustomResourcesString}} returns the string instead of appending it. {noformat} // Custom resource type defined by user. // Such as GPU FPGA etc. if (!resourceTypes.contains(resourceName)) { resource.setResourceInformation(resourceName, ResourceInformation .newInstance(resourceName, units, resourceValue)); return; } // map it based on key. AbsoluteResourceType resType = AbsoluteResourceType .valueOf(StringUtils.toUpperCase(resourceName)); switch (resType) { case MEMORY : resource.setMemorySize(resourceValue); break; case VCORES : resource.setVirtualCores(resourceValue.intValue()); break; default : resource.setResourceInformation(resourceName, ResourceInformation .newInstance(resourceName, units, resourceValue)); break; } } {noformat} This snippet of code confuses me a bit. What's the purpose of thee initial if statement? If the resource doesn't already container the resource in question, we add it and then return. But in the case that it does exist, we go to the switch statement, add it, and then return. I looks like the if statement is unnecessary. Am I missing something? > Support queue capacity in terms of absolute resources with custom > resourceType. > --- > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10503.001.patch, YARN-10503.002.patch, > YARN-10503.003.patch, YARN-10503.004.patch, YARN-10503.005.patch, > YARN-10503.006.patch, YARN-10503.007.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309640#comment-17309640 ] Eric Badger commented on YARN-10501: bq. Backporting HADOOP-16870 to branch-2.10 should mitigate this error. I'll check the patch there. Gotcha. Thanks, [~aajisaka]. We'll resubmit the patch once HADOOP-16870 is backported > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309058#comment-17309058 ] Eric Badger commented on YARN-10503: Thanks for the patch, [~zhuqi]! Here are a few comments {noformat} +if (ResourceUtils.getNumberOfKnownResourceTypes() > 2) { + ResourceInformation[] resources = + resource.getResources(); + for (int i = 2; i < resources.length; i++) { +ResourceInformation resInfo = resources[i]; +resourceString.append("," ++ resInfo.getName() + "=" + resInfo.getValue()); + } +} {noformat} This code snippet is repeated a lot of different times in this patch. I think it would make sense to make this into a method so that we don't have so much code repetition. {{splits[0]}} is used enough in the code that I think it makes sense to make it into a local variable for better readability. > Support queue capacity in terms of absolute resources with custom > resourceType. > --- > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10503.001.patch, YARN-10503.002.patch, > YARN-10503.003.patch, YARN-10503.004.patch, YARN-10503.005.patch, > YARN-10503.006.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309028#comment-17309028 ] Eric Badger commented on YARN-10501: [~aajisaka], can you help out here? The Yetus bug is blocking this patch > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10713) ClusterMetrics should support custom resource capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10713: --- Fix Version/s: 3.3.1 3.4.0 Thanks for the patch, [~zhuqi]. I tested this out on my local GPU environment and everything looks good. +1 I've committed this to trunk (3.4) and branch-3.3. The cherry-pick comes back clean to branch-3.2, but there is a compilation error that I believe is due to some other requisite patches not being pulled back there. If you'd like it to go back to branch-3.2, we'll need to do some additional work. Closing for now, though. > ClusterMetrics should support custom resource capacity related metrics. > --- > > Key: YARN-10713 > URL: https://issues.apache.org/jira/browse/YARN-10713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10713.001.patch, YARN-10713.002.patch > > > YARN-10688 > Only add gpu resource capacity related metrics, i think we should improve it > to support custom resources as [~ebadger] suggested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10713) ClusterMetrics should support custom resource capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308817#comment-17308817 ] Eric Badger commented on YARN-10713: [~zhuqi], I very much appreciate the patches and am trying to review as quickly as possible. But the number of different patches going on concurrently is quite overwhelming. I will do my best to review them in a timely matter > ClusterMetrics should support custom resource capacity related metrics. > --- > > Key: YARN-10713 > URL: https://issues.apache.org/jira/browse/YARN-10713 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10713.001.patch, YARN-10713.002.patch > > > YARN-10688 > Only add gpu resource capacity related metrics, i think we should improve it > to support custom resources as [~ebadger] suggested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308187#comment-17308187 ] Eric Badger commented on YARN-10503: I'm fine with moving the effort of removing hardcoded resource values (e.g. GPU/FPGA) to a follow-up JIRA. But only if that JIRA is going to be worked on. Because right now we are adding code debt with everything hardcoded value we add to the code. > Support queue capacity in terms of absolute resources with custom > resourceType. > --- > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10503.001.patch, YARN-10503.002.patch, > YARN-10503.003.patch, YARN-10503.004.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308180#comment-17308180 ] Eric Badger commented on YARN-10493: I did have a weird umask set. Reverting back to the default umask fixed the localizer errors that I posted above. However, the tool should probably explicitly specify 755 and 644 perms for the directories and files respectively > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308173#comment-17308173 ] Eric Badger commented on YARN-10493: Hmm, must be a default umask issue or something on my testing environment. After fixing the perms and a small container-executor.cfg issue, I've been able to successfully run a sleep job using the V2 plugins! > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308169#comment-17308169 ] Eric Badger commented on YARN-10493: {noformat} [ebadger@foo hadoop]$ hadoop fs -ls /runc-root/*/*/* WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX. -rw--- 10 ebadger supergroup 11166 2021-03-24 00:06 /runc-root/config/a9/a9a241e617577cf0da93c89010d0026de8327c8220c732f2ede29d2ce15588cf -rw--- 10 ebadger supergroup 4096 2021-03-24 00:06 /runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.sqsh -rw--- 10 ebadger supergroup156 2021-03-24 00:06 /runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.tar.gz -rw--- 10 ebadger supergroup 4096 2021-03-24 00:06 /runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.sqsh -rw--- 10 ebadger supergroup185 2021-03-24 00:06 /runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.tar.gz -rw--- 10 ebadger supergroup 26095616 2021-03-24 00:06 /runc-root/layer/72/726141ff510fe8ee7d540faa490649332a561f79ce9b5d02045f7e0db5e4cfbc.sqsh -rw--- 10 ebadger supergroup 26687036 2021-03-24 00:06 /runc-root/layer/72/726141ff510fe8ee7d540faa490649332a561f79ce9b5d02045f7e0db5e4cfbc.tar.gz -rw--- 10 ebadger supergroup 12288 2021-03-24 00:06 /runc-root/layer/8c/8c4f37442a65ac28bb23c8a0c408f1f2c061b8928abfaf8a40050ebae6130974.sqsh -rw--- 10 ebadger supergroup 12057 2021-03-24 00:06 /runc-root/layer/8c/8c4f37442a65ac28bb23c8a0c408f1f2c061b8928abfaf8a40050ebae6130974.tar.gz -rw--- 10 ebadger supergroup 98697216 2021-03-24 00:06 /runc-root/layer/98/98dc2361422a32ef978770b879dd1d0079242cc55980cfd2205939a6796d309f.sqsh -rw--- 10 ebadger supergroup 99451719 2021-03-24 00:06 /runc-root/layer/98/98dc2361422a32ef978770b879dd1d0079242cc55980cfd2205939a6796d309f.tar.gz -rw--- 10 ebadger supergroup 121638912 2021-03-24 00:06 /runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.sqsh -rw--- 10 ebadger supergroup 123724262 2021-03-24 00:06 /runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.tar.gz -rw--- 10 ebadger supergroup 205000704 2021-03-24 00:06 /runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.sqsh -rw--- 10 ebadger supergroup 205322058 2021-03-24 00:06 /runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.tar.gz -rw--- 10 ebadger supergroup 1791 2021-03-24 00:06 /runc-root/manifest/f8/f849453b22d5e6a2e2f1390dc021cd2a786bcd923fffa9e778f3be6c87a0d3fe -rw--- 10 ebadger supergroup236 2021-03-24 00:06 /runc-root/meta/hadoop/rhel7@current.properties -rw--- 10 ebadger supergroup236 2021-03-24 00:15 /runc-root/meta/hadoop/rhel7@latest.properties -rw--- 10 ebadger supergroup236 2021-03-24 20:21 /runc-root/meta/library/rhel7@current.properties {noformat} For reference, the perms on all of the files are 600. > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308167#comment-17308167 ] Eric Badger commented on YARN-10493: {noformat} 2021-03-24 20:21:56,225 WARN [Public Localizer] localizer.LocalResourcesTrackerImpl (LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED event for request { /runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.sqsh, 1616544364679, FILE, null } but localized resource is missing 2021-03-24 20:21:56,226 ERROR [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1004)) - Failed to download resource { { /runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.sqsh, 1616544364679, FILE, null },pending,[],50806868248172187,DOWNLOADING} java.io.IOException: Resource /runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.sqsh is not publicly accessible and as such cannot be part of the public cache. 2021-03-24 20:21:56,227 WARN [Public Localizer] localizer.LocalResourcesTrackerImpl (LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED event for request { /runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.sqsh, 1616544363068, FILE, null } but localized resource is missing 2021-03-24 20:21:56,227 ERROR [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1004)) - Failed to download resource { { /runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.sqsh, 1616544363068, FILE, null },pending,[],50806868248209381,DOWNLOADING} java.io.IOException: Resource /runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.sqsh is not publicly accessible and as such cannot be part of the public cache. 2021-03-24 20:21:56,227 WARN [Public Localizer] localizer.LocalResourcesTrackerImpl (LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED event for request { /runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.sqsh, 1616544367971, FILE, null } but localized resource is missing 2021-03-24 20:21:56,227 ERROR [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1004)) - Failed to download resource { { /runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.sqsh, 1616544367971, FILE, null },pending,[],50806868248366695,DOWNLOADING} java.io.IOException: Resource /runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.sqsh is not publicly accessible and as such cannot be part of the public cache. 2021-03-24 20:21:56,228 WARN [Public Localizer] localizer.LocalResourcesTrackerImpl (LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED event for request { /runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.sqsh, 1616544367125, FILE, null } but localized resource is missing 2021-03-24 20:21:56,228 ERROR [Public Localizer] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1004)) - Failed to download resource { { /runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.sqsh, 1616544367125, FILE, null },pending,[],50806868248376369,DOWNLOADING} java.io.IOException: Resource /runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.sqsh is not publicly accessible and as such cannot be part of the public cache. 2021-03-24 20:21:56,228 WARN [Public Localizer] localizer.LocalResourcesTrackerImpl (LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED event for request { /runc-root/config/a9/a9a241e617577cf0da93c89010d0026de8327c8220c732f2ede29d2ce15588cf, 1616544368404, FILE, null } but localized resource is missing {noformat} Building and running with an image under the default "library" namespace, I run into these permission errors. The errors are pretty clear, but is this something that you run into in your production environments? Is there a setup step that is necessary before or after running the CLI tool to fix the perms on the files? > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current runc
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308165#comment-17308165 ] Eric Badger commented on YARN-10493: Yea, I think that would be a good improvement to the plugin implementation. > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307481#comment-17307481 ] Eric Badger commented on YARN-10493: Additionally, I've run into some issues while testing. {noformat:title=CLI Invocation} hadoop jar ./hadoop-tools/hadoop-runc/target/hadoop-runc-3.4.0-SNAPSHOT.jar org.apache.hadoop.runc.tools.ImportDockerImage -r docker.foobar.com: hadoop-images/hadoop/rhel7 hadoop/rhel7 {noformat} {noformat} [ebadger@foo hadoop]$ hadoop fs -ls /runc-root/meta/hadoop/rhel7@latest.properties -rw--- 10 ebadger supergroup236 2021-03-24 00:15 /runc-root/meta/hadoop/rhel7@latest.properties {noformat} Here's the properties file after the CLI tool completes. {noformat} yarn.nodemanager.runtime.linux.runc.image-tag-to-manifest-plugin org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.runc.ImageTagToManifestV2Plugin yarn.nodemanager.runtime.linux.runc.manifest-to-resources-plugin org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.runc.HdfsManifestToResourcesV2Plugin {noformat} Then I set these properties as well as adding {{runc}} to the allowed-runtimes config. {noformat} export vars="YARN_CONTAINER_RUNTIME_TYPE=runc,YARN_CONTAINER_RUNTIME_RUNC_IMAGE=hadoop/rhel7"; $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.*-tests.jar sleep -Dyarn.app.mapreduce.am.env="HADOOP_MAPRED_HOME=$HADOOP_HOME" -Dmapreduce.admin.user.env="HADOOP_MAPRED_HOME=$HADOOP_HOME" -Dyarn.app.mapreduce.am.env=$vars -Dmapreduce.map.env=$vars -Dmapreduce.reduce.env=$vars -mt 1 -rt 1 -m 1 -r 1 {noformat} I ran a sleep job using this command. {noformat} 2021-03-24 00:26:07,823 DEBUG [NM ContainerManager dispatcher] runc.ImageTagToManifestV2Plugin (ImageTagToManifestV2Plugin.java:getHdfsImageToHashReader(144)) - Checking HDFS for image file: /runc-root/meta/library/hadoop/rhel7@latest.properties 2021-03-24 00:26:07,825 WARN [NM ContainerManager dispatcher] runc.ImageTagToManifestV2Plugin (ImageTagToManifestV2Plugin.java:getHdfsImageToHashReader(148)) - Did not load the hdfs image to hash properties file, file doesn't exist 2021-03-24 00:26:07,828 WARN [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:transition(1261)) - Failed to parse resource-request java.io.FileNotFoundException: File does not exist: /runc-root/manifest/ha/hadoop/rhel7 {noformat} Then I got this error in the NM when it was trying to resolve the tag. It added the default {{metaNamespaceDir}} (which is library) into the path when looking for the properties file. But when the CLI tool ran, it didn't add the {{metaNamespaceDir}}. I didn't have the config set in my configs at all, so the NM was using the conf default. I'm not sure if I did anything wrong here or not, but it seems inconsistent to me. Let me know what you think > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307467#comment-17307467 ] Eric Badger commented on YARN-10493: [~MatthewSharp], thanks for the PR. Just starting to take a look at this now. I am wondering if the document is still up to date though. Is the PR you put up still a good reflection of what's in the document? Just want to make sure > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matthew Sharp >Priority: Major > Labels: pull-request-available > Attachments: runc-container-repository-v2-design.pdf > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated
[ https://issues.apache.org/jira/browse/YARN-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307457#comment-17307457 ] Eric Badger commented on YARN-10517: [~epayne], this change looks reasonable to me, but I'd like to get an extra pair of eyes on it as it has to do with scheduler internals > QueueMetrics has incorrect Allocated Resource when labelled partitions updated > -- > > Key: YARN-10517 > URL: https://issues.apache.org/jira/browse/YARN-10517 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.0, 3.3.0 >Reporter: sibyl.lv >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10517-branch-3.2.001.patch, YARN-10517.001.patch, > wrong metrics.png > > > After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has > incorrect allocated jmx, such as {color:#660e7a}allocatedMB, > {color}{color:#660e7a}allocatedVCores and > {color}{color:#660e7a}allocatedContainers, {color}when the node partition is > updated from "DEFAULT" to other label and there are running applications. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Submit one application to default partition and run > # Add label "tpcds" to cluster and replace label on node1 and node2 to be > "tpcds" when the above application is running > # Note down "VCores Used" at Web UI > # When the application is finished, the metrics get wrong (screenshots > attached). > == > > FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles > this event {color:#660e7a}NODE_LABELS_UPDATE.{color} > So we should release container resource from old partition and add used > resource to new partition, just as updating queueUsage. > {code:java} > // code placeholder > public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition, > String newPartition) { > Resource containerResource = rmContainer.getAllocatedResource(); > this.attemptResourceUsage.decUsed(oldPartition, containerResource); > this.attemptResourceUsage.incUsed(newPartition, containerResource); > getCSLeafQueue().decUsedResource(oldPartition, containerResource, this); > getCSLeafQueue().incUsedResource(newPartition, containerResource, this); > // Update new partition name if container is AM and also update AM resource > if (rmContainer.isAMContainer()) { > setAppAMNodePartitionName(newPartition); > this.attemptResourceUsage.decAMUsed(oldPartition, containerResource); > this.attemptResourceUsage.incAMUsed(newPartition, containerResource); > getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this); > getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10707) Support gpu in ResourceUtilization, and update Node GPU Utilization to use.
[ https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307427#comment-17307427 ] Eric Badger commented on YARN-10707: Similar to my [comment|https://issues.apache.org/jira/browse/YARN-10503?focusedCommentId=17307421&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17307421] on YARN-10503, I believe that the approach we should take here should allow for arbitrary resources, not hardcoded for GPUs. It's a lot of work to make GPUs a first class resource, but should only be a little more work in addition to make arbitrary resources (which can include GPUs) a first class resource. > Support gpu in ResourceUtilization, and update Node GPU Utilization to use. > --- > > Key: YARN-10707 > URL: https://issues.apache.org/jira/browse/YARN-10707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10707.001.patch, YARN-10707.002.patch, > YARN-10707.003.patch > > > Support gpu in ResourceUtilization, and update Node GPU Utilization to use > first. > It will be very helpful for other use cases about GPU utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307421#comment-17307421 ] Eric Badger commented on YARN-10503: bq. Do we want to treat GPUs and FPGAs like that? In other parts of the code, we have mem/vcore as primary resources, then an array of other resources. I believe the correct approach is to leave memroy and vcores as "first class" resources and then add on logic to add arbitrary extended resources, such as GPU or FPGA. The arbitrary extended resources should not be hardcoded values. The point is that we're doing the work right now to support GPUs. But in 2 years if some new resource needs to be tracked and used, we don't want to have to redo all of this work again. We should make sure that our work right here is extended to any future arbitrary resources > Support queue capacity in terms of absolute resources with custom > resourceType. > --- > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10503.001.patch, YARN-10503.002.patch, > YARN-10503.003.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307413#comment-17307413 ] Eric Badger edited comment on YARN-9618 at 3/23/21, 8:52 PM: - bq. Actually, why we use an other async dispatcher here is try to make the rmDispatcher#eventQueue not boom to affect other event process. The boom will transformed to nodeListManagerDispatcher#eventQueue. I think [~gandras]'s point is that all of the events are going to go through {{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will get the event in the eventQueue and will also do the processing. With this proposed change, {{rmDispatcher}} will get the event and then it will copy it over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will do the processing. But in both cases, {{rmDispatcher}} is dealing with {{RMAppNodeUpdateEvent}} in some way. So the question is whether copying the event or processing the event takes more time. If copying the event takes more time than processing the event, then this change only makes things worse. If processing the event takes more time than copying the event to the new async dispatcher, then this change makes sense and will remove some load on the {{rmDispatcher}}. [~gandras], is that right? was (Author: ebadger): bq. Actually, why we use an other async dispatcher here is try to make the rmDispatcher#eventQueue not boom to affect other event process. The boom will transformed to nodeListManagerDispatcher#eventQueue. I think [~gandras]'s point is that all of the events are going to go through {{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will get the event in the eventQueue and will also do the processing. With this proposed change, {{rmDispatcher}} will get the event and then it will copy it over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will do the processing. But in both cases, {{rmDispatcher}} is dealing with {{RMAppNodeUpdateEvent}}s in some way. So the question is whether copying the event or processing the event takes more time. If copying the event takes more time than processing the event, then this change only makes things worse. If processing the event takes more time than copying the event to the new async dispatcher, then this change makes sense and will remove some load on the {{rmDispatcher}}. [~gandras], is that right? > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307413#comment-17307413 ] Eric Badger commented on YARN-9618: --- bq. Actually, why we use an other async dispatcher here is try to make the rmDispatcher#eventQueue not boom to affect other event process. The boom will transformed to nodeListManagerDispatcher#eventQueue. I think [~gandras]'s point is that all of the events are going to go through {{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will get the event in the eventQueue and will also do the processing. With this proposed change, {{rmDispatcher}} will get the event and then it will copy it over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will do the processing. But in both cases, {{rmDispatcher}} is dealing with {{RMAppNodeUpdateEvent}}s in some way. So the question is whether copying the event or processing the event takes more time. If copying the event takes more time than processing the event, then this change only makes things worse. If processing the event takes more time than copying the event to the new async dispatcher, then this change makes sense and will remove some load on the {{rmDispatcher}}. [~gandras], is that right? > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10704) The CS effective capacity for absolute mode in UI should support GPU and other custom resources.
[ https://issues.apache.org/jira/browse/YARN-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307379#comment-17307379 ] Eric Badger commented on YARN-10704: I'm not very familiar with the new YARN UI v2. Will this change automatically apply to both UIs? Or do we need to add extra stuff for it to be supported in both? > The CS effective capacity for absolute mode in UI should support GPU and > other custom resources. > > > Key: YARN-10704 > URL: https://issues.apache.org/jira/browse/YARN-10704 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10704.001.patch, YARN-10704.002.patch, > YARN-10704.003.patch, image-2021-03-19-12-05-28-412.png, > image-2021-03-19-12-08-35-273.png > > > Actually there are no information about the effective capacity about GPU in > UI for absolute resource mode. > !image-2021-03-19-12-05-28-412.png|width=873,height=136! > But we have this information in QueueMetrics: > !image-2021-03-19-12-08-35-273.png|width=613,height=268! > > It's very important for our GPU users to use in absolute mode, there still > have nothing to know GPU absolute information in CS Queue UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10701) The yarn.resource-types should support multi types without trimmed.
[ https://issues.apache.org/jira/browse/YARN-10701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10701: --- Fix Version/s: 3.3.1 3.4.0 +1. Thanks for the patch, [~zhuqi]. I've committed this to trunk (3.4) and branch-3.3 > The yarn.resource-types should support multi types without trimmed. > --- > > Key: YARN-10701 > URL: https://issues.apache.org/jira/browse/YARN-10701 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10701.001.patch, YARN-10701.002.patch > > > {code:java} > > > yarn.resource-types > yarn.io/gpu, yarn.io/fpga > > {code} > When i configured the resource type above with gpu and fpga, the error > happend: > > {code:java} > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: ' yarn.io/fpga' is > not a valid resource name. A valid resource name must begin with a letter and > contain only letters, numbers, and any of: '.', '_', or '-'. A valid resource > name may also be optionally preceded by a name space followed by a slash. A > valid name space consists of period-separated groups of letters, numbers, and > dashes.{code} > > The resource types should support trim. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10616) Nodemanagers cannot detect GPU failures
[ https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304456#comment-17304456 ] Eric Badger edited comment on YARN-10616 at 3/18/21, 9:22 PM: -- The issue with graceful decommissioning is that you have to edit a file on the RM. It would be nice to be able to run a {{yarn rmadmin}} command from a remote host to tell the RM to graceful decom a node. AFAIK that functionality doesn't exist. I still don't like the idea of completely undermining {{-updateNodeResource}}. I think I would be more on board with a feature that is disabled by default, but can be enabled. That way we won't break any existing ways of doing things, but will give more flexibility to those who want to detect these types of failures. They will just have to understand that it isn't compatible with {{-updateNodeResource}} was (Author: ebadger): The issue with graceful decommissioning is that you have to edit a file on the RM. It would be nice to be able to run a `yarn rmadmin` command from a remote host to tell the RM to graceful decom a node. AFAIK that functionality doesn't exist. I still don't like the idea of completely undermining {{-updateNodeResource}}. I think I would be more on board with a feature that is disabled by default, but can be enabled. That way we won't break any existing ways of doing things, but will give more flexibility to those who want to detect these types of failures. They will just have to understand that it isn't compatible with {{-updateNodeResource}} > Nodemanagers cannot detect GPU failures > --- > > Key: YARN-10616 > URL: https://issues.apache.org/jira/browse/YARN-10616 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > As stated above, the bug is that GPUs can fail, but the NM doesn't notice the > failure. The NM will continue to schedule tasks onto the failed GPU, but the > GPU won't actually work and so the container will likely fail or run very > slowly on the CPU. > My initial thought on solving this is to add NM resource capabilities to the > NM-RM heartbeat and have the RM update its view of the NM's resource > capabilities on each heartbeat. This would be a fairly trivial change, but > comes with the unfortunate side effect that it completely undermindes {{yarn > rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the > assumption is that the node will retain these new resource capabilities until > either the NM or RM is restarted. But with a heartbeat interaction constantly > updating those resource capabilities from the NM perspective, the explicit > changes via {{-updateNodeResource}} would be lost on the next heartbeat. We > could potentially add a flag to ignore the heartbeat updates for any node who > has had {{-updateNodeResource}} called on it (until a re-registration). But > in this case, the node would no longer get resource capability updates until > the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, > then that would give potentially unexpected behavior in relation to nodes > properly auto-detecting failures. > Another idea is to add a GPU monitor thread on the NM to periodically run > {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that > number decreased, the node would hook into the health check status and mark > itself as unhealthy. The downside of this approach is that a single failed > GPU would mean taking out an entire node (e.g. 8 GPUs). > I would really like to go with the NM-RM heartbeat approach, but the > {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, > but I also don't like taking down whole GPU nodes when only a single GPU is > bad. Would like to hear thoughts of others on how best to approach this > [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures
[ https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304456#comment-17304456 ] Eric Badger commented on YARN-10616: The issue with graceful decommissioning is that you have to edit a file on the RM. It would be nice to be able to run a `yarn rmadmin` command from a remote host to tell the RM to graceful decom a node. AFAIK that functionality doesn't exist. I still don't like the idea of completely undermining {{-updateNodeResource}}. I think I would be more on board with a feature that is disabled by default, but can be enabled. That way we won't break any existing ways of doing things, but will give more flexibility to those who want to detect these types of failures. They will just have to understand that it isn't compatible with {{-updateNodeResource}} > Nodemanagers cannot detect GPU failures > --- > > Key: YARN-10616 > URL: https://issues.apache.org/jira/browse/YARN-10616 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > As stated above, the bug is that GPUs can fail, but the NM doesn't notice the > failure. The NM will continue to schedule tasks onto the failed GPU, but the > GPU won't actually work and so the container will likely fail or run very > slowly on the CPU. > My initial thought on solving this is to add NM resource capabilities to the > NM-RM heartbeat and have the RM update its view of the NM's resource > capabilities on each heartbeat. This would be a fairly trivial change, but > comes with the unfortunate side effect that it completely undermindes {{yarn > rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the > assumption is that the node will retain these new resource capabilities until > either the NM or RM is restarted. But with a heartbeat interaction constantly > updating those resource capabilities from the NM perspective, the explicit > changes via {{-updateNodeResource}} would be lost on the next heartbeat. We > could potentially add a flag to ignore the heartbeat updates for any node who > has had {{-updateNodeResource}} called on it (until a re-registration). But > in this case, the node would no longer get resource capability updates until > the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, > then that would give potentially unexpected behavior in relation to nodes > properly auto-detecting failures. > Another idea is to add a GPU monitor thread on the NM to periodically run > {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that > number decreased, the node would hook into the health check status and mark > itself as unhealthy. The downside of this approach is that a single failed > GPU would mean taking out an entire node (e.g. 8 GPUs). > I would really like to go with the NM-RM heartbeat approach, but the > {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, > but I also don't like taking down whole GPU nodes when only a single GPU is > bad. Would like to hear thoughts of others on how best to approach this > [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304333#comment-17304333 ] Eric Badger commented on YARN-10495: I would suggest using a dockerfile with the same OS version as what you plan to run on > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10495.001.patch, YARN-10495.002.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10703) Fix potential null pointer error of gpuNodeResourceUpdateHandler in NodeResourceMonitorImpl.
[ https://issues.apache.org/jira/browse/YARN-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10703: --- Fix Version/s: 3.3.1 I've also committed this to branch-3.3. This has now been committed to trunk (3.4) and branch-3.3 > Fix potential null pointer error of gpuNodeResourceUpdateHandler in > NodeResourceMonitorImpl. > > > Key: YARN-10703 > URL: https://issues.apache.org/jira/browse/YARN-10703 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10703.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10692: --- Fix Version/s: 3.3.1 I cherry-picked this to branch-3.3 I would like all of the GPU stuff to go back to 3.3 if the cherry-picks are clean. This has now been committed to trunk (3.4) and branch-3.3 > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10692.001.patch, YARN-10692.002.patch, > YARN-10692.003.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10703) Fix potential null pointer error of gpuNodeResourceUpdateHandler in NodeResourceMonitorImpl.
[ https://issues.apache.org/jira/browse/YARN-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304313#comment-17304313 ] Eric Badger commented on YARN-10703: +1 I've committed this to trunk (3.4) > Fix potential null pointer error of gpuNodeResourceUpdateHandler in > NodeResourceMonitorImpl. > > > Key: YARN-10703 > URL: https://issues.apache.org/jira/browse/YARN-10703 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10703.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10703) Fix potential null pointer error of gpuNodeResourceUpdateHandler in NodeResourceMonitorImpl.
[ https://issues.apache.org/jira/browse/YARN-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10703: --- Fix Version/s: 3.4.0 > Fix potential null pointer error of gpuNodeResourceUpdateHandler in > NodeResourceMonitorImpl. > > > Key: YARN-10703 > URL: https://issues.apache.org/jira/browse/YARN-10703 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10703.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10688: --- Fix Version/s: 3.2.3 3.3.1 3.4.0 Thanks for the updated patch, [~zhuqi]! +1 I've committed this to trunk (3.4), branch-3.3, and branch-3.2. There was a small import conflict that I took care of in the cherry-pick to branch-3.2 > ClusterMetrics should support GPU capacity related metrics. > --- > > Key: YARN-10688 > URL: https://issues.apache.org/jira/browse/YARN-10688 > Project: Hadoop YARN > Issue Type: Sub-task > Components: metrics, resourcemanager >Affects Versions: 3.2.2, 3.4.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: YARN-10688.001.patch, YARN-10688.002.patch, > YARN-10688.003.patch, YARN-10688.004.patch, image-2021-03-11-15-35-49-625.png > > > Now the ClusterMetrics only support memory and Vcore related metrics. > > {code:java} > @Metric("Memory Utilization") MutableGaugeLong utilizedMB; > @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; > @Metric("Memory Capability") MutableGaugeLong capabilityMB; > @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; > {code} > > > !image-2021-03-11-15-35-49-625.png|width=593,height=253! > In our cluster, we added GPU supported, so i think the GPU related metrics > should also be supported by ClusterMetrics. > > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with gpu resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302956#comment-17302956 ] Eric Badger commented on YARN-10503: One initial question I have is whether we should generalize this to any resource type (e.g. GPU, FPGA, etc). GPU already isn't a first-class resource in YARN. If we aren't going to make it one, then I think it would be prudent to make these additions generalized to all arbitrary resources instead of just GPUs > Support queue capacity in terms of absolute resources with gpu resourceType. > > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10503.001.patch, YARN-10503.002.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302931#comment-17302931 ] Eric Badger commented on YARN-10692: [~zhuqi], it looks like the unit test failure from Hadoop QA is related to the patch. Additionally, there are no unit tests added for the patch. I think it would be good to add to TestNodeManagerMetrics > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302893#comment-17302893 ] Eric Badger commented on YARN-10688: {noformat} @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; @Metric("Memory Capability") MutableGaugeLong capabilityMB; @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; + @Metric("GPU Capability") + private MutableGaugeLong capabilityGPUs; {noformat} To maintain consistency, I would actually remove the private here and let the checkstyle warning exist. I would prefer to update the checkstyle for them all in a separate JIRA. But I think consistency is most important. Other than that, the patch looks good to me > ClusterMetrics should support GPU capacity related metrics. > --- > > Key: YARN-10688 > URL: https://issues.apache.org/jira/browse/YARN-10688 > Project: Hadoop YARN > Issue Type: Sub-task > Components: metrics, resourcemanager >Affects Versions: 3.2.2, 3.4.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10688.001.patch, YARN-10688.002.patch, > YARN-10688.003.patch, image-2021-03-11-15-35-49-625.png > > > Now the ClusterMetrics only support memory and Vcore related metrics. > > {code:java} > @Metric("Memory Utilization") MutableGaugeLong utilizedMB; > @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; > @Metric("Memory Capability") MutableGaugeLong capabilityMB; > @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; > {code} > > > !image-2021-03-11-15-35-49-625.png|width=593,height=253! > In our cluster, we added GPU supported, so i think the GPU related metrics > should also be supported by ClusterMetrics. > > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures
[ https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302864#comment-17302864 ] Eric Badger commented on YARN-10616: bq. For the "updateNodeResource" issue, one question is that is it a frequently used operation? I'm not ware of the scenario that we use this often. [~ztang], we use this feature internally. Maybe once or twice a day across all of our clusters. Usually to quickly remove a node from a cluster while we investigate why it's running slow or causing errors. We will use {{updateNodeResource}} to set the node resources to 0, meaning that nothing will get scheduled on the node. But the NM will still be running so that we can jstack or grab a heap dump. For us at least, the only time we ever use this operation is to remove a node from the cluster. So maybe there's a different way that we could do that such that it doesn't mess with the node resources. Because this really is just a simple hack to get the node to node schedule anything else > Nodemanagers cannot detect GPU failures > --- > > Key: YARN-10616 > URL: https://issues.apache.org/jira/browse/YARN-10616 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > As stated above, the bug is that GPUs can fail, but the NM doesn't notice the > failure. The NM will continue to schedule tasks onto the failed GPU, but the > GPU won't actually work and so the container will likely fail or run very > slowly on the CPU. > My initial thought on solving this is to add NM resource capabilities to the > NM-RM heartbeat and have the RM update its view of the NM's resource > capabilities on each heartbeat. This would be a fairly trivial change, but > comes with the unfortunate side effect that it completely undermindes {{yarn > rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the > assumption is that the node will retain these new resource capabilities until > either the NM or RM is restarted. But with a heartbeat interaction constantly > updating those resource capabilities from the NM perspective, the explicit > changes via {{-updateNodeResource}} would be lost on the next heartbeat. We > could potentially add a flag to ignore the heartbeat updates for any node who > has had {{-updateNodeResource}} called on it (until a re-registration). But > in this case, the node would no longer get resource capability updates until > the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, > then that would give potentially unexpected behavior in relation to nodes > properly auto-detecting failures. > Another idea is to add a GPU monitor thread on the NM to periodically run > {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that > number decreased, the node would hook into the health check status and mark > itself as unhealthy. The downside of this approach is that a single failed > GPU would mean taking out an entire node (e.g. 8 GPUs). > I would really like to go with the NM-RM heartbeat approach, but the > {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, > but I also don't like taking down whole GPU nodes when only a single GPU is > bad. Would like to hear thoughts of others on how best to approach this > [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302860#comment-17302860 ] Eric Badger commented on YARN-9618: --- The patch looks reasonable to me. Agree with [~gandras] that some stress testing should be done before committing > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302782#comment-17302782 ] Eric Badger commented on YARN-10501: [~aajisaka], [~ahussein], most recent builds are failing due to some yetus flag errors. Is this a recent change? Do you know how to mitigate it? > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, > YARN-10502-branch-2.10.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302761#comment-17302761 ] Eric Badger commented on YARN-10495: [~angerszhu], I don't think it's a good idea to ship glibc with Hadoop. glibc is tied very closely to the kernel and if the ABI has changed then it won't work. > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10495.001.patch, YARN-10495.002.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10690) ClusterMetrics should support GPU utilization related metrics.
[ https://issues.apache.org/jira/browse/YARN-10690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302009#comment-17302009 ] Eric Badger commented on YARN-10690: [~zhuqi], can we convert the related JIRAs to be subtasks of this JIRA? That will make it easier to track them. > ClusterMetrics should support GPU utilization related metrics. > -- > > Key: YARN-10690 > URL: https://issues.apache.org/jira/browse/YARN-10690 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301993#comment-17301993 ] Eric Badger commented on YARN-10501: [~caozhiqiang], it doesn't need to be merged to 2.10.1. It has successfully been merged to branch-2.10. Try uploading your patch one more time as YARN-10502-branch-2.10.002.patch > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, > YARN-10501-branch-2.10.1.001.patch, YARN-10501-branch-2.10.1.002.patch, > YARN-10501.002.patch, YARN-10501.003.patch, YARN-10501.004.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301987#comment-17301987 ] Eric Badger commented on YARN-10688: [~zhuqi], thanks for the updated patch. To make things a little cleaner, I think we can do something like this instead of having 2 separate methods. {noformat} public long getCapabilityGPUs() { if (capabilityGPUs == null) { return 0; } return capabilityGPUs.value(); } {noformat} This works in my non-GPU environment. I think it's cleaner, but need you to test it out in your GPU environment to make sure it works ok. And then of course update the unit tests to use {{getCapabilitiyGPUs}}. > ClusterMetrics should support GPU capacity related metrics. > --- > > Key: YARN-10688 > URL: https://issues.apache.org/jira/browse/YARN-10688 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, resourcemanager >Affects Versions: 3.2.2, 3.4.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10688.001.patch, YARN-10688.002.patch, > image-2021-03-11-15-35-49-625.png > > > Now the ClusterMetrics only support memory and Vcore related metrics. > > {code:java} > @Metric("Memory Utilization") MutableGaugeLong utilizedMB; > @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; > @Metric("Memory Capability") MutableGaugeLong capabilityMB; > @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; > {code} > > > !image-2021-03-11-15-35-49-625.png|width=593,height=253! > In our cluster, we added GPU supported, so i think the GPU related metrics > should also be supported by ClusterMetrics. > > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10495: --- Fix Version/s: 3.3.1 [~angerszhu], I backported this to branch-3.3. There's a conflict past that. If you'd like for it to go further, please provide a patch for branch-3.2 It's now been committed to trunk (3.4) and branch-3.3 > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10495.001.patch, YARN-10495.002.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299824#comment-17299824 ] Eric Badger commented on YARN-10688: {noformat} 2021-03-11 19:25:11,183 ERROR [SchedulerEventDispatcher:Event Processor] event.EventDispatcher (MarkerIgnoringBase.java:error(159)) - Error in handling event type NODE_ADDED to the Event Dispatcher org.apache.hadoop.yarn.exceptions.ResourceNotFoundException: The resource manager encountered a problem that should not occur under normal circumstances. Please report this error to the Hadoop community by opening a JIRA ticket at http://issues.apache.org/jira and including the following information: * Resource type requested: yarn.io/gpu * Resource object: * The stack trace for this exception: java.lang.Exception at org.apache.hadoop.yarn.exceptions.ResourceNotFoundException.(ResourceNotFoundException.java:47) at org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:263) at org.apache.hadoop.yarn.server.resourcemanager.ClusterMetrics.incrCapability(ClusterMetrics.java:222) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.addNode(ClusterNodeTracker.java:110) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addNode(CapacityScheduler.java:2201) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1937) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79) at java.lang.Thread.run(Thread.java:748) {noformat} This is the error I get when I start up the RM in a cluster without any GPUs > ClusterMetrics should support GPU related metrics. > -- > > Key: YARN-10688 > URL: https://issues.apache.org/jira/browse/YARN-10688 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, resourcemanager >Affects Versions: 3.2.2, 3.4.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10688.001.patch, image-2021-03-11-15-35-49-625.png > > > Now the ClusterMetrics only support memory and Vcore related metrics. > > {code:java} > @Metric("Memory Utilization") MutableGaugeLong utilizedMB; > @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; > @Metric("Memory Capability") MutableGaugeLong capabilityMB; > @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; > {code} > > > !image-2021-03-11-15-35-49-625.png|width=593,height=253! > In our cluster, we added GPU supported, so i think the GPU related metrics > should also be supported by ClusterMetrics. > > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299814#comment-17299814 ] Eric Badger commented on YARN-10688: {noformat} + Integer gpuIndex = ResourceUtils.getResourceTypeIndex() + .get(ResourceInformation.GPU_URI); + res.getResourceInformation(ResourceInformation.GPU_URI); + if (gpuIndex != null) { +capabilityGPUs.incr(res. +getResourceValue(ResourceInformation.GPU_URI)); + } {noformat} {noformat} + res.getResourceInformation(ResourceInformation.GPU_URI); {noformat} Looks like this line is unnecessary > ClusterMetrics should support GPU related metrics. > -- > > Key: YARN-10688 > URL: https://issues.apache.org/jira/browse/YARN-10688 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, resourcemanager >Affects Versions: 3.2.2, 3.4.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10688.001.patch, image-2021-03-11-15-35-49-625.png > > > Now the ClusterMetrics only support memory and Vcore related metrics. > > {code:java} > @Metric("Memory Utilization") MutableGaugeLong utilizedMB; > @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; > @Metric("Memory Capability") MutableGaugeLong capabilityMB; > @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; > {code} > > > !image-2021-03-11-15-35-49-625.png|width=593,height=253! > In our cluster, we added GPU supported, so i think the GPU related metrics > should also be supported by ClusterMetrics. > > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299772#comment-17299772 ] Eric Badger commented on YARN-10501: [~aajisaka], looks like the precommit is still failing to install jdk 7 > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.1.001.patch, > YARN-10501-branch-2.10.1.002.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298253#comment-17298253 ] Eric Badger commented on YARN-10501: [~ahussein], [~aajisaka], is this due to any of the recent yetus changes? New branch-2.10 patches are failing Hadoop QA because it can't find openjdk-7-jdk > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.1.001.patch, > YARN-10501-branch-2.10.1.002.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297684#comment-17297684 ] Eric Badger commented on YARN-10501: Reopening and submitting patch so that Hadoop QA will run > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.1.001.patch, > YARN-10501-branch-2.10.1.002.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10501.005.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger reopened YARN-10501: > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.1.001.patch, > YARN-10501-branch-2.10.1.002.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10501.005.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV
[ https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10664: --- Fix Version/s: 3.2.3 Thanks for the patch, [~Jim_Brennan]! +1 from me. The checkstyle warning should be cleaned up in a different way than in this patch and I don't think is big here. I've committed this to branch-3.2. Now this has been committed to trunk (3.4), branch-3.3, and branch-3.2. > Allow parameter expansion in NM_ADMIN_USER_ENV > -- > > Key: YARN-10664 > URL: https://issues.apache.org/jira/browse/YARN-10664 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 2.10.1, 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: YARN-10664-branch-3.2.004.patch, YARN-10664.001.patch, > YARN-10664.002.patch, YARN-10664.003.patch, YARN-10664.004.patch > > > Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter > expansion. That is, you cannot specify an environment variable such as > {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside > the container. > We have a need for this in specifying different java gc options for java > processing running inside yarn containers based on which version of java is > being used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org