[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303115#comment-17303115 ] Hadoop QA commented on YARN-10688: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 37s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 42s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 8s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 24s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 58s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 39s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/806/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 15 unchanged - 0 fixed = 16 total (was 15) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 57s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color}
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303097#comment-17303097 ] Hadoop QA commented on YARN-10674: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 39s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 56s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 10s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 21m 21s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 54s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 57s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green}{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 13 unchanged - 7 fixed = 13 total (was 20) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 2s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | |
[jira] [Updated] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-9618: - Attachment: YARN-9618.004.patch > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures
[ https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303044#comment-17303044 ] Qi Zhu commented on YARN-10616: --- [~ebadger] [~ztang] Actually we can use the graceful decommission way to realize that: "We will use {{updateNodeResource}} to set the node resources to 0, meaning that nothing will get scheduled on the node. But the NM will still be running so that we can jstack or grab a heap dump." I think we can realize the NM-RM heartbeat approach first, then to handle the updateNodeResource. What you advice about this? > Nodemanagers cannot detect GPU failures > --- > > Key: YARN-10616 > URL: https://issues.apache.org/jira/browse/YARN-10616 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > As stated above, the bug is that GPUs can fail, but the NM doesn't notice the > failure. The NM will continue to schedule tasks onto the failed GPU, but the > GPU won't actually work and so the container will likely fail or run very > slowly on the CPU. > My initial thought on solving this is to add NM resource capabilities to the > NM-RM heartbeat and have the RM update its view of the NM's resource > capabilities on each heartbeat. This would be a fairly trivial change, but > comes with the unfortunate side effect that it completely undermindes {{yarn > rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the > assumption is that the node will retain these new resource capabilities until > either the NM or RM is restarted. But with a heartbeat interaction constantly > updating those resource capabilities from the NM perspective, the explicit > changes via {{-updateNodeResource}} would be lost on the next heartbeat. We > could potentially add a flag to ignore the heartbeat updates for any node who > has had {{-updateNodeResource}} called on it (until a re-registration). But > in this case, the node would no longer get resource capability updates until > the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, > then that would give potentially unexpected behavior in relation to nodes > properly auto-detecting failures. > Another idea is to add a GPU monitor thread on the NM to periodically run > {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that > number decreased, the node would hook into the health check status and mark > itself as unhealthy. The downside of this approach is that a single failed > GPU would mean taking out an entire node (e.g. 8 GPUs). > I would really like to go with the NM-RM heartbeat approach, but the > {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, > but I also don't like taking down whole GPU nodes when only a single GPU is > bad. Would like to hear thoughts of others on how best to approach this > [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303039#comment-17303039 ] Qi Zhu commented on YARN-10688: --- Thanks [~ebadger] for confirm. I also think it is more reasonable to remove private. Updated in latest patch. > ClusterMetrics should support GPU capacity related metrics. > --- > > Key: YARN-10688 > URL: https://issues.apache.org/jira/browse/YARN-10688 > Project: Hadoop YARN > Issue Type: Sub-task > Components: metrics, resourcemanager >Affects Versions: 3.2.2, 3.4.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10688.001.patch, YARN-10688.002.patch, > YARN-10688.003.patch, YARN-10688.004.patch, image-2021-03-11-15-35-49-625.png > > > Now the ClusterMetrics only support memory and Vcore related metrics. > > {code:java} > @Metric("Memory Utilization") MutableGaugeLong utilizedMB; > @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; > @Metric("Memory Capability") MutableGaugeLong capabilityMB; > @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; > {code} > > > !image-2021-03-11-15-35-49-625.png|width=593,height=253! > In our cluster, we added GPU supported, so i think the GPU related metrics > should also be supported by ClusterMetrics. > > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-10688: -- Attachment: YARN-10688.004.patch > ClusterMetrics should support GPU capacity related metrics. > --- > > Key: YARN-10688 > URL: https://issues.apache.org/jira/browse/YARN-10688 > Project: Hadoop YARN > Issue Type: Sub-task > Components: metrics, resourcemanager >Affects Versions: 3.2.2, 3.4.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10688.001.patch, YARN-10688.002.patch, > YARN-10688.003.patch, YARN-10688.004.patch, image-2021-03-11-15-35-49-625.png > > > Now the ClusterMetrics only support memory and Vcore related metrics. > > {code:java} > @Metric("Memory Utilization") MutableGaugeLong utilizedMB; > @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; > @Metric("Memory Capability") MutableGaugeLong capabilityMB; > @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; > {code} > > > !image-2021-03-11-15-35-49-625.png|width=593,height=253! > In our cluster, we added GPU supported, so i think the GPU related metrics > should also be supported by ClusterMetrics. > > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303024#comment-17303024 ] Qi Zhu commented on YARN-10674: --- [~pbacsko] Fixed the checkstyle in latest patch.:D > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-10674: -- Attachment: YARN-10674.012.patch > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10699) Document the fact that usage of usernames/groupnames with a "." (dot) is strictly not recommended
Siddharth Ahuja created YARN-10699: -- Summary: Document the fact that usage of usernames/groupnames with a "." (dot) is strictly not recommended Key: YARN-10699 URL: https://issues.apache.org/jira/browse/YARN-10699 Project: Hadoop YARN Issue Type: Improvement Components: docs, documentation Reporter: Siddharth Ahuja Based on discussions in YARN-10652, it is clear that usage of a "." (dot) in a username/groupname (e.g. users in AD/LDAP) can cause unexpected issues e.g. [placement rules involving username (%user placeholder) will definitely exhibit unexpected behavior|https://issues.apache.org/jira/browse/YARN-10652?focusedCommentId=17295964=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17295964]. As such, we need to document clearly for our customers that this use-case is strictly not recommended in CapacityScheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10699) Document the fact that usage of usernames/groupnames with a "." (dot) is strictly not recommended
[ https://issues.apache.org/jira/browse/YARN-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Ahuja reassigned YARN-10699: -- Assignee: Siddharth Ahuja > Document the fact that usage of usernames/groupnames with a "." (dot) is > strictly not recommended > - > > Key: YARN-10699 > URL: https://issues.apache.org/jira/browse/YARN-10699 > Project: Hadoop YARN > Issue Type: Improvement > Components: docs, documentation >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > > Based on discussions in YARN-10652, it is clear that usage of a "." (dot) in > a username/groupname (e.g. users in AD/LDAP) can cause unexpected issues e.g. > [placement rules involving username (%user placeholder) will definitely > exhibit unexpected > behavior|https://issues.apache.org/jira/browse/YARN-10652?focusedCommentId=17295964=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17295964]. > As such, we need to document clearly for our customers that this use-case is > strictly not recommended in CapacityScheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it
[ https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302978#comment-17302978 ] Wilfred Spiegelenburg commented on YARN-10652: -- Thank you to [~sahuja] for the fix, and to all ([~snemeth] , [~shuzirra] , [~gandras] & [~pbacsko]) for the discussion and resolution around this jira. I committed to trunk with a comment in the commit message: {quote}This only fixes the user name resolution for weights in the queues. It does not add generic support for user names with dots in all use cases in the capacity scheduler. {quote} > Capacity Scheduler fails to handle user weights for a user that has a "." > (dot) in it > - > > Key: YARN-10652 > URL: https://issues.apache.org/jira/browse/YARN-10652 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: Correct user weight of 0.76 picked up for the user with > a dot after the patch.png, Incorrect default user weight of 1.0 being picked > for the user with a dot before the patch.png, YARN-10652.001.patch > > > AD usernames can have a "." (dot) in them i.e. they can be of the format -> > {{firstname.lastname}}. However, if you specify a username with this format > against the Capacity Scheduler setting -> > {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}}, > it fails to be applied and is instead assigned the default of 1.0f weight. > This renders the user weight feature (being used as a means of setting user > priorities for a queue) unusable for such users. > This limitation comes from [1]. From [1], only word characters (A word > character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no > good for AD names that contain a "." (dot). > Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and > HADOOP-15395 and the outcome was to use non-whitespace characters i.e. > instead of {{\w+}}, use {{\S+}}. > We could go down similar path and unblock this feature for the AD usernames > with a "." (dot) in them. > [1] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953 > [2] > https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with gpu resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302956#comment-17302956 ] Eric Badger commented on YARN-10503: One initial question I have is whether we should generalize this to any resource type (e.g. GPU, FPGA, etc). GPU already isn't a first-class resource in YARN. If we aren't going to make it one, then I think it would be prudent to make these additions generalized to all arbitrary resources instead of just GPUs > Support queue capacity in terms of absolute resources with gpu resourceType. > > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10503.001.patch, YARN-10503.002.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302931#comment-17302931 ] Eric Badger commented on YARN-10692: [~zhuqi], it looks like the unit test failure from Hadoop QA is related to the patch. Additionally, there are no unit tests added for the patch. I think it would be good to add to TestNodeManagerMetrics > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.
[ https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302893#comment-17302893 ] Eric Badger commented on YARN-10688: {noformat} @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; @Metric("Memory Capability") MutableGaugeLong capabilityMB; @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; + @Metric("GPU Capability") + private MutableGaugeLong capabilityGPUs; {noformat} To maintain consistency, I would actually remove the private here and let the checkstyle warning exist. I would prefer to update the checkstyle for them all in a separate JIRA. But I think consistency is most important. Other than that, the patch looks good to me > ClusterMetrics should support GPU capacity related metrics. > --- > > Key: YARN-10688 > URL: https://issues.apache.org/jira/browse/YARN-10688 > Project: Hadoop YARN > Issue Type: Sub-task > Components: metrics, resourcemanager >Affects Versions: 3.2.2, 3.4.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10688.001.patch, YARN-10688.002.patch, > YARN-10688.003.patch, image-2021-03-11-15-35-49-625.png > > > Now the ClusterMetrics only support memory and Vcore related metrics. > > {code:java} > @Metric("Memory Utilization") MutableGaugeLong utilizedMB; > @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores; > @Metric("Memory Capability") MutableGaugeLong capabilityMB; > @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores; > {code} > > > !image-2021-03-11-15-35-49-625.png|width=593,height=253! > In our cluster, we added GPU supported, so i think the GPU related metrics > should also be supported by ClusterMetrics. > > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules
[ https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878 ] Peter Bacsko edited comment on YARN-10370 at 3/16/21, 8:36 PM: --- [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that this feature is ready and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? was (Author: pbacsko): [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that the umbrella is done and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? > [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue > Mapping rules > --- > > Key: YARN-10370 > URL: https://issues.apache.org/jira/browse/YARN-10370 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: MappingRuleEnhancements.pdf, Possible extensions of > mapping rule format in Capacity Scheduler.pdf > > > To continue closing the feature gaps between Fair Scheduler and Capacity > Scheduler to help users migrate between the scheduler more easy, we need to > add some of the Fair Scheduler placement rules to the capacity scheduler's > queue mapping functionality. > With [~snemeth] and [~pbacsko] we've created the following design docs about > the proposed changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules
[ https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878 ] Peter Bacsko commented on YARN-10370: - [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that the umbrella is done and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? > [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue > Mapping rules > --- > > Key: YARN-10370 > URL: https://issues.apache.org/jira/browse/YARN-10370 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: MappingRuleEnhancements.pdf, Possible extensions of > mapping rule format in Capacity Scheduler.pdf > > > To continue closing the feature gaps between Fair Scheduler and Capacity > Scheduler to help users migrate between the scheduler more easy, we need to > add some of the Fair Scheduler placement rules to the capacity scheduler's > queue mapping functionality. > With [~snemeth] and [~pbacsko] we've created the following design docs about > the proposed changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10493) RunC container repository v2
[ https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit reassigned YARN-10493: --- Assignee: Matt Sharp (was: Craig Condit) > RunC container repository v2 > > > Key: YARN-10493 > URL: https://issues.apache.org/jira/browse/YARN-10493 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matt Sharp >Priority: Major > Attachments: runc-container-repository-v2-design.pdf > > > The current runc container repository design has scalability and usability > issues which will likely limit widespread adoption. We should address this > with a new, V2 layout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)
[ https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit reassigned YARN-10494: --- Assignee: Matt Sharp (was: Craig Condit) > CLI tool for docker-to-squashfs conversion (pure Java) > -- > > Key: YARN-10494 > URL: https://issues.apache.org/jira/browse/YARN-10494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Matt Sharp >Priority: Major > Labels: pull-request-available > Attachments: YARN-10494.001.patch, > docker-to-squashfs-conversion-tool-design.pdf > > Time Spent: 3h 20m > Remaining Estimate: 0h > > *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on > python2, multiple libraries, squashfs-tools and root access in order to > convert Docker images to squashfs images for use with the runc container > runtime in YARN. > *YARN-9943* was created to investigate alternatives, as the response to > merging YARN-9564 has not been very positive. This proposal outlines the > design for a CLI conversion tool in 100% pure Java that will work out of the > box. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures
[ https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302864#comment-17302864 ] Eric Badger commented on YARN-10616: bq. For the "updateNodeResource" issue, one question is that is it a frequently used operation? I'm not ware of the scenario that we use this often. [~ztang], we use this feature internally. Maybe once or twice a day across all of our clusters. Usually to quickly remove a node from a cluster while we investigate why it's running slow or causing errors. We will use {{updateNodeResource}} to set the node resources to 0, meaning that nothing will get scheduled on the node. But the NM will still be running so that we can jstack or grab a heap dump. For us at least, the only time we ever use this operation is to remove a node from the cluster. So maybe there's a different way that we could do that such that it doesn't mess with the node resources. Because this really is just a simple hack to get the node to node schedule anything else > Nodemanagers cannot detect GPU failures > --- > > Key: YARN-10616 > URL: https://issues.apache.org/jira/browse/YARN-10616 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > > As stated above, the bug is that GPUs can fail, but the NM doesn't notice the > failure. The NM will continue to schedule tasks onto the failed GPU, but the > GPU won't actually work and so the container will likely fail or run very > slowly on the CPU. > My initial thought on solving this is to add NM resource capabilities to the > NM-RM heartbeat and have the RM update its view of the NM's resource > capabilities on each heartbeat. This would be a fairly trivial change, but > comes with the unfortunate side effect that it completely undermindes {{yarn > rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the > assumption is that the node will retain these new resource capabilities until > either the NM or RM is restarted. But with a heartbeat interaction constantly > updating those resource capabilities from the NM perspective, the explicit > changes via {{-updateNodeResource}} would be lost on the next heartbeat. We > could potentially add a flag to ignore the heartbeat updates for any node who > has had {{-updateNodeResource}} called on it (until a re-registration). But > in this case, the node would no longer get resource capability updates until > the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, > then that would give potentially unexpected behavior in relation to nodes > properly auto-detecting failures. > Another idea is to add a GPU monitor thread on the NM to periodically run > {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that > number decreased, the node would hook into the health check status and mark > itself as unhealthy. The downside of this approach is that a single failed > GPU would mean taking out an entire node (e.g. 8 GPUs). > I would really like to go with the NM-RM heartbeat approach, but the > {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, > but I also don't like taking down whole GPU nodes when only a single GPU is > bad. Would like to hear thoughts of others on how best to approach this > [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302860#comment-17302860 ] Eric Badger commented on YARN-9618: --- The patch looks reasonable to me. Agree with [~gandras] that some stress testing should be done before committing > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it
[ https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302845#comment-17302845 ] Szilard Nemeth commented on YARN-10652: --- Hi [~sahuja], Answering your comment from [here|https://issues.apache.org/jira/browse/YARN-10652?focusedCommentId=17295634=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17295634]. 1. This might be tough to implement but [~pbacsko] and [~shuzirra] know the internals of the placement engine better than myself. 2. I think it's okay to have it documented, so I'd choose this from your suggestions. Could you please file a jira for this? 3. This is also a good idea. Furthermore, can you file a follow-up jira (you can file more if necessary) as suggested by [Peter's comment|https://issues.apache.org/jira/browse/YARN-10652?focusedCommentId=17295964=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17295964] to cover the problematic cases we already discovered during code inspection and while having our discussion here? All in all, if you file follow-up jiras to make this use-case more stable and consistent, I'm fine. So, I'm giving +1 (binding) for your patch. [~wilfreds] I get your point with the last comment. Based on my comment above: As you wanted to commit this in the first place, please go ahead with committing. Thanks. > Capacity Scheduler fails to handle user weights for a user that has a "." > (dot) in it > - > > Key: YARN-10652 > URL: https://issues.apache.org/jira/browse/YARN-10652 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Major > Attachments: Correct user weight of 0.76 picked up for the user with > a dot after the patch.png, Incorrect default user weight of 1.0 being picked > for the user with a dot before the patch.png, YARN-10652.001.patch > > > AD usernames can have a "." (dot) in them i.e. they can be of the format -> > {{firstname.lastname}}. However, if you specify a username with this format > against the Capacity Scheduler setting -> > {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}}, > it fails to be applied and is instead assigned the default of 1.0f weight. > This renders the user weight feature (being used as a means of setting user > priorities for a queue) unusable for such users. > This limitation comes from [1]. From [1], only word characters (A word > character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no > good for AD names that contain a "." (dot). > Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and > HADOOP-15395 and the outcome was to use non-whitespace characters i.e. > instead of {{\w+}}, use {{\S+}}. > We could go down similar path and unblock this feature for the AD usernames > with a "." (dot) in them. > [1] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953 > [2] > https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302817#comment-17302817 ] Ahmed Hussein edited comment on YARN-10501 at 3/16/21, 6:55 PM: That's confusing. I am sure [~aajisaka] has better clue. branch-2.10 dev-support/Jenkinsfile defines {{YETUS_ARGS+=("--findbugs-strict-precheck")}}. I do not know where does {{--spotbugs-strict-precheck}} come from on branch-2.10 builds. was (Author: ahussein): That's confusing. I am sure [~aajisaka] has better clue. branch-2.10 -> dev-support/Jenkinsfile defines {{YETUS_ARGS+=("--findbugs-strict-precheck")}}. I do not know where does {{--spotbugs-strict-precheck}} come from on branch-2.10 builds. > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, > YARN-10502-branch-2.10.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302817#comment-17302817 ] Ahmed Hussein commented on YARN-10501: -- That's confusing. I am sure [~aajisaka] has better clue. branch-2.10 -> dev-support/Jenkinsfile defines {{YETUS_ARGS+=("--findbugs-strict-precheck")}}. I do not know where does {{--spotbugs-strict-precheck}} come from on branch-2.10 builds. > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, > YARN-10502-branch-2.10.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302782#comment-17302782 ] Eric Badger commented on YARN-10501: [~aajisaka], [~ahussein], most recent builds are failing due to some yetus flag errors. Is this a recent change? Do you know how to mitigate it? > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, > YARN-10502-branch-2.10.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302761#comment-17302761 ] Eric Badger commented on YARN-10495: [~angerszhu], I don't think it's a good idea to ship glibc with Hadoop. glibc is tied very closely to the kernel and if the ABI has changed then it won't work. > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10495.001.patch, YARN-10495.002.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1151) Ability to configure auxiliary services from HDFS-based JAR files
[ https://issues.apache.org/jira/browse/YARN-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302759#comment-17302759 ] Haibo Chen commented on YARN-1151: -- cherry-picked to branch-2.10 > Ability to configure auxiliary services from HDFS-based JAR files > - > > Key: YARN-1151 > URL: https://issues.apache.org/jira/browse/YARN-1151 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.1.0-beta, 2.9.0 >Reporter: john lilley >Assignee: Xuan Gong >Priority: Major > Labels: auxiliary-service, yarn > Fix For: 3.2.0, 3.1.1, 2.10.2 > > Attachments: YARN-1151.1.patch, YARN-1151.2.patch, YARN-1151.3.patch, > YARN-1151.4.patch, YARN-1151.5.patch, YARN-1151.6.patch, > YARN-1151.branch-2.poc.2.patch, YARN-1151.branch-2.poc.3.patch, > YARN-1151.branch-2.poc.patch, [YARN-1151] [Design] Configure auxiliary > services from HDFS-based JAR files.pdf > > > I would like to install an auxiliary service in Hadoop YARN without actually > installing files/services on every node in the system. Discussions on the > user@ list indicate that this is not easily done. The reason we want an > auxiliary service is that our application has some persistent-data components > that are not appropriate for HDFS. In fact, they are somewhat analogous to > the mapper output of MapReduce's shuffle, which is what led me to > auxiliary-services in the first place. It would be much easier if we could > just place our service's JARs in HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1151) Ability to configure auxiliary services from HDFS-based JAR files
[ https://issues.apache.org/jira/browse/YARN-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-1151: - Fix Version/s: 2.10.2 > Ability to configure auxiliary services from HDFS-based JAR files > - > > Key: YARN-1151 > URL: https://issues.apache.org/jira/browse/YARN-1151 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.1.0-beta, 2.9.0 >Reporter: john lilley >Assignee: Xuan Gong >Priority: Major > Labels: auxiliary-service, yarn > Fix For: 3.2.0, 3.1.1, 2.10.2 > > Attachments: YARN-1151.1.patch, YARN-1151.2.patch, YARN-1151.3.patch, > YARN-1151.4.patch, YARN-1151.5.patch, YARN-1151.6.patch, > YARN-1151.branch-2.poc.2.patch, YARN-1151.branch-2.poc.3.patch, > YARN-1151.branch-2.poc.patch, [YARN-1151] [Design] Configure auxiliary > services from HDFS-based JAR files.pdf > > > I would like to install an auxiliary service in Hadoop YARN without actually > installing files/services on every node in the system. Discussions on the > user@ list indicate that this is not easily done. The reason we want an > auxiliary service is that our application has some persistent-data components > that are not appropriate for HDFS. In fact, they are somewhat analogous to > the mapper output of MapReduce's shuffle, which is what led me to > auxiliary-services in the first place. It would be much easier if we could > just place our service's JARs in HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302754#comment-17302754 ] Hadoop QA commented on YARN-10674: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 30s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 32s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 55s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 14s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 35s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 2m 1s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 50s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 41s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/803/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 13 unchanged - 7 fixed = 14 total (was 20) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 59s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color}
[jira] [Commented] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302681#comment-17302681 ] Hadoop QA commented on YARN-10698: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 36s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} yetus {color} | {color:red} 0m 14s{color} | {color:red}{color} | {color:red} Unprocessed flag(s): --spotbugs-strict-precheck {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/804/artifact/out/Dockerfile | | JIRA Issue | YARN-10698 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13022399/YARN-10698.branch-2.10.00.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/804/console | | versions | git=2.7.4 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10 > - > > Key: YARN-10698 > URL: https://issues.apache.org/jira/browse/YARN-10698 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.1 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > Attachments: YARN-10698.branch-2.10.00.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-10698: -- Attachment: YARN-10698.branch-2.10.00.patch > Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10 > - > > Key: YARN-10698 > URL: https://issues.apache.org/jira/browse/YARN-10698 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.1 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > Attachments: YARN-10698.branch-2.10.00.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-10698: -- Target Version/s: 2.10.2 > Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10 > - > > Key: YARN-10698 > URL: https://issues.apache.org/jira/browse/YARN-10698 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.1 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302670#comment-17302670 ] Hadoop QA commented on YARN-9618: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 30s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 35s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 59s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 58s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 21m 18s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 55s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s{color} | {color:green}{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 58 unchanged - 1 fixed = 58 total (was 59) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 15s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | |
[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-10698: -- Affects Version/s: 2.10.1 > Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10 > - > > Key: YARN-10698 > URL: https://issues.apache.org/jira/browse/YARN-10698 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.1 >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10
[ https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-10698: -- Summary: Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10 (was: Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2) > Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2.10 > - > > Key: YARN-10698 > URL: https://issues.apache.org/jira/browse/YARN-10698 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302660#comment-17302660 ] Andras Gyori commented on YARN-9618: Thanks [~zhuqi] for the patch, it seems to be a good scalability improvement. I think it has relatively low risk, as dispatching events to its own handlers is a common idiom in ResourceManager. This is, however, affects a core part of YARN, so we need to be careful here. My addition to the issue: * Please use full types everywhere, like EventDispatcher, or use wildcard, if the type is unknown. > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2
[ https://issues.apache.org/jira/browse/YARN-10698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-10698: -- Target Version/s: (was: 2.10.2) > Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2 > -- > > Key: YARN-10698 > URL: https://issues.apache.org/jira/browse/YARN-10698 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Haibo Chen >Assignee: Haibo Chen >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10698) Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2
Haibo Chen created YARN-10698: - Summary: Backport YARN-1151 (load auxiliary service from HDFS archives) to branch-2 Key: YARN-10698 URL: https://issues.apache.org/jira/browse/YARN-10698 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Haibo Chen Assignee: Haibo Chen -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
[ https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302599#comment-17302599 ] Peter Bacsko commented on YARN-10686: - +1 Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk. > Fix > TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode > - > > Key: YARN-10686 > URL: https://issues.apache.org/jira/browse/YARN-10686 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10686.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302595#comment-17302595 ] Qi Zhu commented on YARN-10674: --- Thanks [~pbacsko] for valid suggestion. Updated this in latest patch.:D > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
[ https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10686: Summary: Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode (was: Fix testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode user error.) > Fix > TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode > - > > Key: YARN-10686 > URL: https://issues.apache.org/jira/browse/YARN-10686 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10686.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-10674: -- Attachment: YARN-10674.011.patch > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10659) Improve CS MappingRule %secondary_group evaluation
[ https://issues.apache.org/jira/browse/YARN-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302587#comment-17302587 ] Hadoop QA commented on YARN-10659: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 58s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 46s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 29s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 38s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 52s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 43s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 42s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/801/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 6 unchanged - 1 fixed = 7 total (was 7) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 1s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} |
[jira] [Commented] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma
[ https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302567#comment-17302567 ] Peter Bacsko commented on YARN-10682: - +1 Thanks for the patch [~zhuqi] and [~gandras] for the review, committed to trunk. > The scheduler monitor policies conf should trim values separated by comma > - > > Key: YARN-10682 > URL: https://issues.apache.org/jira/browse/YARN-10682 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10682.001.patch > > > When i configured scheduler monitor policies with space, the RM will start > with error. > The conf should support trim between "," , such as : > "a,b,c" is supported now, but "a, b, c" is not supported now, just add > trim in this jira. > > When tested multi policy, it happened. > > yarn.resourcemanager.scheduler.monitor.policies > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy, > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma
[ https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10682: Summary: The scheduler monitor policies conf should trim values separated by comma (was: The scheduler monitor policies conf should support trim between ",".) > The scheduler monitor policies conf should trim values separated by comma > - > > Key: YARN-10682 > URL: https://issues.apache.org/jira/browse/YARN-10682 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10682.001.patch > > > When i configured scheduler monitor policies with space, the RM will start > with error. > The conf should support trim between "," , such as : > "a,b,c" is supported now, but "a, b, c" is not supported now, just add > trim in this jira. > > When tested multi policy, it happened. > > yarn.resourcemanager.scheduler.monitor.policies > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy, > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302548#comment-17302548 ] Peter Bacsko commented on YARN-10674: - Thanks [~zhuqi] this is definitely looks better. We're close to the final version. Some comments: 1. {noformat} Disable the preemption with nopolicy or observeonly mode, " + "default mode is nopolicy with no arg." + "When use nopolicy arg, it means to remove " + "ProportionalCapacityPreemptionPolicy for CS preemption, " + "When use observeonly arg, " + "it means to set " + "yarn.resourcemanager.monitor.capacity.preemption.observe_only " + "to true" {noformat} I'd to slightly modify this text: {noformat} Disable the preemption with \"nopolicy\" or \"observeonly\" mode. Default is \"nopolicy\". \"nopolicy\" removes ProportionalCapacityPreemptionPolicy from the list of monitor policies. \"observeronly\" sets \"yarn.resourcemanager.monitor.capacity.preemption.observe_only\" to true. {noformat} 2. This definition: {{private String disablePreemptionMode;}} This should be a simple enum like: {noformat} public enum DisablePreemptionMode { OBSERVE_ONLY { @Override String getCliOption() { return "observeonly"; } }, NO_POLICY { @Override String getCliOption() { return "nopolicy"; } }; abstract String getCliOption(); } {noformat} So you can also use them here: {noformat} private static void checkDisablePreemption(CliOption cliOption, String disablePreemptionMode) { if (disablePreemptionMode == null || disablePreemptionMode.trim().isEmpty()) { // The default mode is nopolicy. return; } try { DisablePreemptionMode.valueOf(disablePreemptionMode); } catch (IllegalArgumentException e) { throw new PreconditionException( String.format("Specified disable-preemption option %s is illegal, " + " use \"nopolicy\" or \"observeonly\"")); } {noformat} "disablePreemptionMode" should be an enum everywhere. 3. {noformat} public void convertSiteProperties(Configuration conf, Configuration yarnSiteConfig, boolean drfUsed, boolean enableAsyncScheduler) boolean enableAsyncScheduler, boolean userPercentage, boolean disablePreemption, String disablePreemptionMode) { {noformat} Here "disablePreemptionMode" should be an enum also and make sure that it always has a value. If it always has a value, this part becomes much simpler: {noformat} if (disablePreemption && disablePreemptionMode == DisablePreemptionMode.NO_POLICY) { yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, ""); } } {noformat} 4. {{AutoCreatedQueueDeletionPolicy.class.getCanonicalName())}} This string is referenced very often in the tests. Instead, use a final String: {noformat} private static final String DELETION_POLICY_CLASS = AutoCreatedQueueDeletionPolicy.class.getCanonicalName(); {noformat} So the readability becomes much better. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-9618: - Attachment: YARN-9618.003.patch > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302517#comment-17302517 ] Qi Zhu commented on YARN-9618: -- Fixed test and checkstyle in latest patch. :D > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302478#comment-17302478 ] Hadoop QA commented on YARN-9618: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 18s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 34s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 54s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 2s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 52s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 40s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/800/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 11 new + 58 unchanged - 1 fixed = 69 total (was 59) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 55s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} |
[jira] [Created] (YARN-10697) Resources are displayed in bytes in UI for schedulers other than capacity
Bilwa S T created YARN-10697: Summary: Resources are displayed in bytes in UI for schedulers other than capacity Key: YARN-10697 URL: https://issues.apache.org/jira/browse/YARN-10697 Project: Hadoop YARN Issue Type: Bug Reporter: Bilwa S T Assignee: Bilwa S T Resources.newInstance expects MB as memory whereas in MetricsOverviewTable passes resources in bytes . Also we should display memory in GB for better readability for user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10659) Improve CS MappingRule %secondary_group evaluation
[ https://issues.apache.org/jira/browse/YARN-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gergely Pollak updated YARN-10659: -- Attachment: YARN-10659.002.patch > Improve CS MappingRule %secondary_group evaluation > -- > > Key: YARN-10659 > URL: https://issues.apache.org/jira/browse/YARN-10659 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-10659.001.patch, YARN-10659.002.patch > > > Since the leaf queue names are not unique, there are a lot of use cases where > %secondary_group evaluation fail, or behave inconsistently. > We should extend it's behavior, when it's under a defined parent, > %secondary_group evaluation should only check for queue existence under that > queue. Egy root.group.%secondary_group, should only evaluate to groups which > exist under root.group, while the legacy %secondary_group.%user should still > look for groups by their leaf name globally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10696) Add RMNodeEvent to single async dispatcher before YARN-9927.
Qi Zhu created YARN-10696: - Summary: Add RMNodeEvent to single async dispatcher before YARN-9927. Key: YARN-10696 URL: https://issues.apache.org/jira/browse/YARN-10696 Project: Hadoop YARN Issue Type: Sub-task Reporter: Qi Zhu Assignee: Qi Zhu According to YARN-9927 analysis. The RMNodeStatusEvent dominates 90% time consumption of RM event scheduler. We'd better to add RMNodeEvent to a separate async Event handler similar to YARN-9618. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10690) GPU related improvement for better usage.
[ https://issues.apache.org/jira/browse/YARN-10690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-10690: -- Description: This Jira will improve GPU for better usage. cc [~bibinchundatt] [~pbacsko] [~ebadger] [~ztang] [~epayne] [~gandras] [~bteke] was: This Jira will improve GPU for better usage. > GPU related improvement for better usage. > - > > Key: YARN-10690 > URL: https://issues.apache.org/jira/browse/YARN-10690 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > > This Jira will improve GPU for better usage. > cc [~bibinchundatt] [~pbacsko] [~ebadger] [~ztang] [~epayne] [~gandras] > [~bteke] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10695) Event related improvement of YARN for better usage.
[ https://issues.apache.org/jira/browse/YARN-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-10695: -- Description: This jira, marked the event related improvement in yarn for better usage. cc [~bibinchundatt] [~pbacsko] [~ebadger] [~epayne] [~gandras] [~bteke] was: This jira, marked the event related improvement in yarn for better usage. cc > Event related improvement of YARN for better usage. > --- > > Key: YARN-10695 > URL: https://issues.apache.org/jira/browse/YARN-10695 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > > This jira, marked the event related improvement in yarn for better usage. > cc [~bibinchundatt] [~pbacsko] [~ebadger] [~epayne] [~gandras] [~bteke] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10695) Event related improvement of YARN for better usage.
[ https://issues.apache.org/jira/browse/YARN-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-10695: -- Description: This jira, marked the event related improvement in yarn for better usage. cc [~bibinchundatt] [~pbacsko] [~ebadger] [~ztang] [~epayne] [~gandras] [~bteke] was: This jira, marked the event related improvement in yarn for better usage. cc [~bibinchundatt] [~pbacsko] [~ebadger] [~epayne] [~gandras] [~bteke] > Event related improvement of YARN for better usage. > --- > > Key: YARN-10695 > URL: https://issues.apache.org/jira/browse/YARN-10695 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > > This jira, marked the event related improvement in yarn for better usage. > cc [~bibinchundatt] [~pbacsko] [~ebadger] [~ztang] [~epayne] [~gandras] > [~bteke] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10695) Event related improvement of YARN for better usage.
[ https://issues.apache.org/jira/browse/YARN-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-10695: -- Description: This jira, marked the event related improvement in yarn for better usage. cc was: This jira, marked the event related improvement in yarn for better usage. > Event related improvement of YARN for better usage. > --- > > Key: YARN-10695 > URL: https://issues.apache.org/jira/browse/YARN-10695 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > > This jira, marked the event related improvement in yarn for better usage. > cc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302371#comment-17302371 ] Qi Zhu commented on YARN-9618: -- [~bibinchundatt] [~pbacsko] [~ebadger] [~epayne] [~gandras] [~bteke] Could you help review this? I updated a patch: # Added another async Event handler similar to scheduler. # Instead of adding events to dispatcher directly call RMApp event handler. Thanks. > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-9618: - Attachment: YARN-9618.002.patch > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302359#comment-17302359 ] caozhiqiang commented on YARN-10501: [~ebadger], branch-2.10 also build failed. Do you known where I make a mistake? > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, > YARN-10502-branch-2.10.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302349#comment-17302349 ] Hadoop QA commented on YARN-10501: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 20s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} yetus {color} | {color:red} 0m 15s{color} | {color:red}{color} | {color:red} Unprocessed flag(s): --spotbugs-strict-precheck {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/799/artifact/out/Dockerfile | | JIRA Issue | YARN-10501 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13022368/YARN-10502-branch-2.10.003.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/799/console | | versions | git=2.7.4 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, > YARN-10502-branch-2.10.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated YARN-10501: --- Attachment: YARN-10502-branch-2.10.003.patch > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, > YARN-10502-branch-2.10.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302335#comment-17302335 ] Hadoop QA commented on YARN-10501: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 27s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} yetus {color} | {color:red} 0m 14s{color} | {color:red}{color} | {color:red} Unprocessed flag(s): --spotbugs-strict-precheck {color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/798/artifact/out/Dockerfile | | JIRA Issue | YARN-10501 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13022365/YARN-10502-branch-2.10.002.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/798/console | | versions | git=2.7.4 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated YARN-10501: --- Attachment: YARN-10502-branch-2.10.002.patch > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated YARN-10501: --- Attachment: (was: YARN-10501-branch-2.10.1.002.patch) > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated YARN-10501: --- Attachment: (was: YARN-10501-branch-2.10.1.001.patch) > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, > YARN-10501.003.patch, YARN-10501.004.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-9618: - Parent Issue: YARN-10695 (was: YARN-9871) > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9927) RM multi-thread event processing mechanism
[ https://issues.apache.org/jira/browse/YARN-9927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-9927: - Parent: YARN-10695 Issue Type: Sub-task (was: Improvement) > RM multi-thread event processing mechanism > -- > > Key: YARN-9927 > URL: https://issues.apache.org/jira/browse/YARN-9927 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0, 2.9.2 >Reporter: hcarrot >Assignee: Qi Zhu >Priority: Major > Attachments: RM multi-thread event processing mechanism.pdf, > YARN-9927.001.patch > > > Recently, we have observed serious event blocking in RM event dispatcher > queue. After analysis of RM event monitoring data and RM event processing > logic, we found that > 1) environment: a cluster with thousands of nodes > 2) RMNodeStatusEvent dominates 90% time consumption of RM event scheduler > 3) Meanwhile, RM event processing is in a single-thread mode, and It results > in the low headroom of RM event scheduler, thus performance of RM. > So we proposed a RM multi-thread event processing mechanism to improve RM > performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10695) Event related improvement of YARN for better usage.
[ https://issues.apache.org/jira/browse/YARN-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-10695: -- Description: This jira, marked the event related improvement in yarn for better usage. > Event related improvement of YARN for better usage. > --- > > Key: YARN-10695 > URL: https://issues.apache.org/jira/browse/YARN-10695 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > > This jira, marked the event related improvement in yarn for better usage. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-8995: - Parent: YARN-10695 Issue Type: Sub-task (was: Improvement) > Log events info in AsyncDispatcher when event queue size cumulatively reaches > a certain number every time. > -- > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: TestStreamPerf.java, > YARN-8995-branch-3.1.001.patch.addendum, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, > image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu updated YARN-9615: - Parent: YARN-10695 Issue Type: Sub-task (was: Task) > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9615.001.patch, YARN-9615.002.patch, > YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, > YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, > YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, > YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, > image-2021-03-04-10-36-12-441.png, screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10695) Event related improvement of YARN for better usage.
Qi Zhu created YARN-10695: - Summary: Event related improvement of YARN for better usage. Key: YARN-10695 URL: https://issues.apache.org/jira/browse/YARN-10695 Project: Hadoop YARN Issue Type: Improvement Reporter: Qi Zhu Assignee: Qi Zhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302262#comment-17302262 ] angerszhu commented on YARN-10495: -- Hi [~ebadger] When we build hadoop-3.3.0 meet glibc error {code:java} writev(2, [{iov_base="/usr/share/hadoop-yarn/bin/conta"..., iov_len=45}, {iov_base=": ", iov_len=2}, {iov_base="/lib64/libc.so.6", iov_len=16}, {iov_base=": ", iov_len=2}, {iov_base="version `GLIBC_2.25' not found ("..., iov_len=75}, {iov_base="\n", iov_len=1}], 6/usr/share/hadoop-yarn/bin/container-executor: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by /lib64/x86_64/libcrypto.so.1.1) {code} can we also make glibc bundle to native path like `bundle.openssl` to solve this problem? > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10495.001.patch, YARN-10495.002.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org