[jira] [Comment Edited] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation
[ https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253943#comment-17253943 ] zhuqi edited comment on YARN-10506 at 12/23/20, 7:53 AM: - [~gandras] Thanks for your reply. [~wangda] [~gandras] 1. What i said is related policy management, we should fixed it later, and may be we need a new sub-task. 2. I think we should disallow it by throwing an exception, but we can add a admin api to realize destroying all its underlying children for optional, it will be more reasonable. 3. Also the 2. question is related to auto deleting, we should also discuss it. Thanks. was (Author: zhuqi): [~gandras] Thanks for your reply. 1. What i said is related policy management, we should fixed it later, and may be we need a new sub-task. 2. I think we should disallow it by throwing an exception, but we can add a admin api to realize destroying all its underlying children for optional, it will be more reasonable. 3. Also the 2. question is related to auto deleting, we should also discuss it. Thanks. > Update queue creation logic to use weight mode and allow the flexible > static/dynamic creation > - > > Key: YARN-10506 > URL: https://issues.apache.org/jira/browse/YARN-10506 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10506.001.patch, YARN-10506.002.patch > > > The queue creation logic should be updated to use weight mode and support the > flexible creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation
[ https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253943#comment-17253943 ] zhuqi edited comment on YARN-10506 at 12/23/20, 7:52 AM: - [~gandras] Thanks for your reply. 1. What i said is related policy management, we should fixed it later, and may be we need a new sub-task. 2. I think we should disallow it by throwing an exception, but we can add a admin api to realize destroying all its underlying children for optional, it will be more reasonable. 3. Also the 2. question is related to auto deleting, we should also discuss it. Thanks. was (Author: zhuqi): [~gandras] Thanks for your reply. 1. What i said is related policy management, we should fixed it later, and may be we need a new sub-task. 2. I think we should disallow it by throwing an exception, but we can add a admin api to realize destroying all its underlying children for optional, it will be more reasonable. Thanks. > Update queue creation logic to use weight mode and allow the flexible > static/dynamic creation > - > > Key: YARN-10506 > URL: https://issues.apache.org/jira/browse/YARN-10506 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10506.001.patch, YARN-10506.002.patch > > > The queue creation logic should be updated to use weight mode and support the > flexible creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation
[ https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253943#comment-17253943 ] zhuqi commented on YARN-10506: -- [~gandras] Thanks for your reply. 1. What i said is related policy management, we should fixed it later, and may be we need a new sub-task. 2. I think we should disallow it by throwing an exception, but we can add a admin api to realize destroying all its underlying children for optional, it will be more reasonable. Thanks. > Update queue creation logic to use weight mode and allow the flexible > static/dynamic creation > - > > Key: YARN-10506 > URL: https://issues.apache.org/jira/browse/YARN-10506 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10506.001.patch, YARN-10506.002.patch > > > The queue creation logic should be updated to use weight mode and support the > flexible creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService
[ https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minni Mittal updated YARN-8529: --- Attachment: YARN-8529.v11.patch > Add timeout to RouterWebServiceUtil#invokeRMWebService > -- > > Key: YARN-8529 > URL: https://issues.apache.org/jira/browse/YARN-8529 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Íñigo Goiri >Assignee: Minni Mittal >Priority: Major > Attachments: YARN-8529.v1.patch, YARN-8529.v10.patch, > YARN-8529.v11.patch, YARN-8529.v2.patch, YARN-8529.v3.patch, > YARN-8529.v4.patch, YARN-8529.v5.patch, YARN-8529.v6.patch, > YARN-8529.v7.patch, YARN-8529.v8.patch, YARN-8529.v9.patch > > > {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. > This should be configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
[ https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-8557: Attachment: (was: YARN-8557.001.patch) > Exclude lagged/unhealthy/decommissioned nodes in async allocating thread > > > Key: YARN-8557 > URL: https://issues.apache.org/jira/browse/YARN-8557 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.4.0 >Reporter: Weiwei Yang >Assignee: zhuqi >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which > we should make it configurable. And more over, we need to exclude unhealthy > and decommissioned nodes too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
[ https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-8557: - Labels: pull-request-available (was: ) > Exclude lagged/unhealthy/decommissioned nodes in async allocating thread > > > Key: YARN-8557 > URL: https://issues.apache.org/jira/browse/YARN-8557 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.4.0 >Reporter: Weiwei Yang >Assignee: zhuqi >Priority: Major > Labels: pull-request-available > Attachments: YARN-8557.001.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which > we should make it configurable. And more over, we need to exclude unhealthy > and decommissioned nodes too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
[ https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253904#comment-17253904 ] Xiaoqiao He commented on YARN-10540: I am not sure if it is expected. if yes, I would like to backport to 3.2.2 and prepare another RC. [~ebadger],[~Jim_Brennan] would you like to give another check? Thanks. > Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes > > > Key: YARN-10540 > URL: https://issues.apache.org/jira/browse/YARN-10540 > Project: Hadoop YARN > Issue Type: Task > Components: webapp >Affects Versions: 3.2.2 >Reporter: Sunil G >Assignee: Jim Brennan >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 > PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, > Yarn-UI-Ubuntu.png, osx-yarn-ui2.png, yarnodes.png, yarnui2onubuntu.png > > > YARN-10450 added changes in NodeInfo class. > Various exceptions are showing while accessing UI2 and UI1 NODE pages. > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70) > {code} > {code:java} > 2020-12-19 22:55:54,846 WARN > org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
[ https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239526#comment-17239526 ] zhuqi edited comment on YARN-8557 at 12/23/20, 5:50 AM: [~cheersyang] [~wangda] [~tangzhankun] [~sunilg] [~BilwaST] [~Tao Yang] I add the patch , if you can review for it: 1 Add configurable HB lag. 2 Support exclude not running nodes. Thanks. was (Author: zhuqi): [~cheersyang] [~wangda] [~tangzhankun] [~sunilg] [~BilwaST] [~Tao Yang] I add the patch , if you can review for it: 1 Add configurable HB lag. 2 Support exclude unhealthy and decommissioned and decommissing nodes. Thanks. > Exclude lagged/unhealthy/decommissioned nodes in async allocating thread > > > Key: YARN-8557 > URL: https://issues.apache.org/jira/browse/YARN-8557 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.4.0 >Reporter: Weiwei Yang >Assignee: zhuqi >Priority: Major > Attachments: YARN-8557.001.patch > > > Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which > we should make it configurable. And more over, we need to exclude unhealthy > and decommissioned nodes too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253892#comment-17253892 ] Drew Merrill commented on YARN-10427: - Hi [~snemeth], wow, your response is amazing! I need to set aside a good chunk of time to digest it in its entirety and actually work through the debugging procedure you went through, step-by-step. But I just want to express my sincere gratitude for putting the time and energy into crafting such a detailed and instructive follow-up that both confirmed my findings, while also showing in great detail and clarity the steps you took to identify the source of the problem along with possible solutions. Thank you! > Duplicate Job IDs in SLS output > --- > > Key: YARN-10427 > URL: https://issues.apache.org/jira/browse/YARN-10427 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0 > Environment: I ran the attached inputs on my MacBook Pro, using > Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also > tested against 3.2.1 and 3.3.0 release branches. > >Reporter: Drew Merrill >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, > YARN-10427.001.patch, YARN-10427.002.patch, YARN-10427.003.patch, > fair-scheduler.xml, inputsls.json, jobruntime.csv, jobruntime.csv, > mapred-site.xml, sls-runner.xml, yarn-site.xml > > > Hello, I'm hoping someone can help me resolve or understand some issues I've > been having with the YARN Scheduler Load Simulator (SLS). I've been > experimenting with SLS for several months now at work as we're trying to > build a simulation model to characterize our enterprise Hadoop infrastructure > for purposes of future capacity planning. In the process of attempting to > verify and validate the SLS output, I've encountered a number of issues > including runtime exceptions and bad output. The focus of this issue is the > bad output. In all my simulation runs, the jobruntime.csv output seems to > have one or more of the following problems: no output, duplicate job ids, > and/or missing job ids. > > Because of where I work, I'm unable to provide the exact inputs I typically > use, but I'm able to reproduce the problem of the duplicate Job IDS using > some simplified inputs and configuration files, which I've attached, along > with the output I obtained. > > The command I used to run the simulation: > {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json > --output-dir=sls-run-1 --print-simulation > --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}} > > Can anyone help me understand what would cause the duplicate Job IDs in the > output? Is this a bug in Hadoop or a problem with my inputs? Thanks in > advance. > > PS: This is my first issue I've ever opened so please be kind if I've missed > something or am not understanding something obvious about the way Hadoop > works. I'll gladly follow-up with more info as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
[ https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253889#comment-17253889 ] Xiaoqiao He commented on YARN-10540: Thanks [~Jim_Brennan],[~ebadger],[~sunilg] for your works. It seems works well mostly but one link could not jump correctly using ui2 on local mac and ubuntu now. reference the following chart. Thanks. !osx-yarn-ui2.png! > Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes > > > Key: YARN-10540 > URL: https://issues.apache.org/jira/browse/YARN-10540 > Project: Hadoop YARN > Issue Type: Task > Components: webapp >Affects Versions: 3.2.2 >Reporter: Sunil G >Assignee: Jim Brennan >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 > PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, > Yarn-UI-Ubuntu.png, osx-yarn-ui2.png, yarnodes.png, yarnui2onubuntu.png > > > YARN-10450 added changes in NodeInfo class. > Various exceptions are showing while accessing UI2 and UI1 NODE pages. > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70) > {code} > {code:java} > 2020-12-19 22:55:54,846 WARN > org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
[ https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He updated YARN-10540: --- Attachment: osx-yarn-ui2.png > Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes > > > Key: YARN-10540 > URL: https://issues.apache.org/jira/browse/YARN-10540 > Project: Hadoop YARN > Issue Type: Task > Components: webapp >Affects Versions: 3.2.2 >Reporter: Sunil G >Assignee: Jim Brennan >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 > PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, > Yarn-UI-Ubuntu.png, osx-yarn-ui2.png, yarnodes.png, yarnui2onubuntu.png > > > YARN-10450 added changes in NodeInfo class. > Various exceptions are showing while accessing UI2 and UI1 NODE pages. > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70) > {code} > {code:java} > 2020-12-19 22:55:54,846 WARN > org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253792#comment-17253792 ] Eric Badger commented on YARN-10501: I'm looking at this patch and I have some questions about other pieces of code that are in the same section. I'll admit, this code is a little bit confusing to me because we have Hosts->Labels maps as well as Labels->Nodes maps and then on top of that, each Host can have multiple Nodes. Before I can comment on your patch, I think I need to clear up some things that are going on in this area of code that are confusing to me. Below is what I _think_ is happening. Feel free to correct me where I'm wrong. (Assuming no port and/or wildcard port for this) 1. When we add a node label, we invoke this piece of code. {noformat} case ADD: addNodeToLabels(nodeId, labels); host.labels.addAll(labels); for (Node node : host.nms.values()) { if (node.labels != null) { node.labels.addAll(labels); } addNodeToLabels(node.nodeId, labels); } break; {noformat} 1a. This code adds the NodeId (without a port/with a wildcard port) to the Labels->Nodes map via addNodeToLabels. *Why do we do this? There is no port associated with this node. In 1d we add the nodes to the map with their associated port, so I don't understand why we're adding the node here when it doesn't have a port.* 1b. It adds all of the labels to the Host. This part doesn't make sense to me. *If we are giving Hosts the granularity to have multiple labels per host (due to multiple NMs), then why does the Host itself have labels?* 1c. We add all the labels to each Node in the host, but _only_ if they already have labels. *Why do we only add the labels if they already have labels? Don't we want to add the labels regardless? Should it be possible for us to be in the ADD method while node.labels == null? Maybe this should throw an exception* 1d. We add the Nodes (with their associated NM port) to the Labels->nodes map via addNodeToLabels. 2. When we replace the node label we invoke this piece of code {noformat} case REPLACE: replaceNodeForLabels(nodeId, host.labels, labels); host.labels.clear(); host.labels.addAll(labels); for (Node node : host.nms.values()) { replaceNodeForLabels(node.nodeId, node.labels, labels); node.labels = null; } {noformat} 2a. We remove the Node (without port or with wildcard port) from the specific label in the Labels->Nodes map via removeNoveFromLabels(). *Why do we have the node without a port in the first place?* This comes from 1a. 2b. We add the Node (without port or with wildcard port) to the new specific label in the Labels->Nodes map via addNodeToLabels(). *Why do we add the node without a port?* 2c. We clear the labels associated with the Host. *Why are there labels associated with a Host when each Host is actually a collection of Nodes?* 2d. We add the new labels to the Host. Same question as 2c. 2e. We iterate through the list of Nodes associated with each Host and perform 2a and 2b, except with Nodes that have their associated ports. 2f. We set the Labels to Null for each Node associated with the Host. I don't understand the purpose of this. I must be missing something here. Overall I have 2 main issues with the code that I need cleared up because I either don't understand the code or think things are broken/unnecessary. 1) We have labels associated with Hosts (which are collections of Hosts) _and_ labels associated with just Nodes 2) We add Nodes that have no associated port or have the wildcard port _and_ add those same nodes with their associated ports. Probably need [~leftnoteasy], [~varunsaxena], or [~sunilg] to comment on this since they were involved with YARN-3075 that added much of this code > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Attachments: YARN-10501.002.patch, YARN-10501.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclu
[jira] [Updated] (YARN-10040) DistributedShell test failure on X86 and ARM
[ https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated YARN-10040: - Priority: Major (was: Blocker) > DistributedShell test failure on X86 and ARM > > > Key: YARN-10040 > URL: https://issues.apache.org/jira/browse/YARN-10040 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell > Environment: X86/ARM > OS: ubuntu1804 > Java 8 >Reporter: zhao bo >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-10040.001.patch > > > * > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers > * > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType > Please see the Apache Jenkins Test result: > [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/] > > These 2 tests are failed on both X86 and ARM platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10040) DistributedShell test failure on X86 and ARM
[ https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253737#comment-17253737 ] Ahmed Hussein edited comment on YARN-10040 at 12/22/20, 8:48 PM: - [~abmodi] can you suggest anyone familiar with the changes done in YARN-9697? was (Author: ahussein): I changed the status of this Jira to blocker. [~abmodi] can you suggest anyone familiar with the changes done in YARN-9697? > DistributedShell test failure on X86 and ARM > > > Key: YARN-10040 > URL: https://issues.apache.org/jira/browse/YARN-10040 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell > Environment: X86/ARM > OS: ubuntu1804 > Java 8 >Reporter: zhao bo >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-10040.001.patch > > > * > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers > * > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType > Please see the Apache Jenkins Test result: > [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/] > > These 2 tests are failed on both X86 and ARM platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10040) DistributedShell test failure on X86 and ARM
[ https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253737#comment-17253737 ] Ahmed Hussein commented on YARN-10040: -- I changed the status of this Jira to blocker. [~abmodi] can you suggest anyone familiar with the changes done in YARN-9697? > DistributedShell test failure on X86 and ARM > > > Key: YARN-10040 > URL: https://issues.apache.org/jira/browse/YARN-10040 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell > Environment: X86/ARM > OS: ubuntu1804 > Java 8 >Reporter: zhao bo >Assignee: Abhishek Modi >Priority: Blocker > Attachments: YARN-10040.001.patch > > > * > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers > * > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType > Please see the Apache Jenkins Test result: > [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/] > > These 2 tests are failed on both X86 and ARM platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10040) DistributedShell test failure on X86 and ARM
[ https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Hussein updated YARN-10040: - Priority: Blocker (was: Major) > DistributedShell test failure on X86 and ARM > > > Key: YARN-10040 > URL: https://issues.apache.org/jira/browse/YARN-10040 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell > Environment: X86/ARM > OS: ubuntu1804 > Java 8 >Reporter: zhao bo >Assignee: Abhishek Modi >Priority: Blocker > Attachments: YARN-10040.001.patch > > > * > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers > * > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType > Please see the Apache Jenkins Test result: > [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/] > > These 2 tests are failed on both X86 and ARM platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253693#comment-17253693 ] Hadoop QA commented on YARN-10427: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 44s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 52s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 59s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 49s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 16s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/411/artifact/out/diff-checkstyle-hadoop-tools_hadoop-sls.txt{color} | {color:orange} hadoop-tools/hadoop-sls: The patch generated 1 new + 22 unchanged - 1 fixed = 23 total (was 23) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 22s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 40s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {col
[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation
[ https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253687#comment-17253687 ] Hadoop QA commented on YARN-10506: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 9s{color} | {color:red}{color} | {color:red} YARN-10506 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-10506 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13017530/YARN-10506.002.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/412/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Update queue creation logic to use weight mode and allow the flexible > static/dynamic creation > - > > Key: YARN-10506 > URL: https://issues.apache.org/jira/browse/YARN-10506 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10506.001.patch, YARN-10506.002.patch > > > The queue creation logic should be updated to use weight mode and support the > flexible creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation
[ https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10506: Attachment: YARN-10506.002.patch > Update queue creation logic to use weight mode and allow the flexible > static/dynamic creation > - > > Key: YARN-10506 > URL: https://issues.apache.org/jira/browse/YARN-10506 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10506.001.patch, YARN-10506.002.patch > > > The queue creation logic should be updated to use weight mode and support the > flexible creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation
[ https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253682#comment-17253682 ] Andras Gyori commented on YARN-10506: - I have taken your advise into account, however, I would need more details about from you about the problem at hand. You mentioned ManagedParent, but I need to emphasise, that we have taken an approach that does not involve ManagedParentQueue or AutoCreatedLeafQueue at all. We are reusing the original ParentQueue and LeafQueue classes. I have fixed the failing tests, but I have also identified a possible issue, which needs to be addressed. If a dynamic queue is a ParentQueue, and it has child queues as well, and we make it a static queue, it will be converted to a LeafQueue, basically destroying all its underlying children. Either we disallow it by throwing an exception, or we convert it successfully. What is your opinion about it? > Update queue creation logic to use weight mode and allow the flexible > static/dynamic creation > - > > Key: YARN-10506 > URL: https://issues.apache.org/jira/browse/YARN-10506 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10506.001.patch, YARN-10506.002.patch > > > The queue creation logic should be updated to use weight mode and support the > flexible creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10516) In HA mode, when one Resource Manager has networking issue, getTokenService() should not throw runtime exception
[ https://issues.apache.org/jira/browse/YARN-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253675#comment-17253675 ] Xu Cang commented on YARN-10516: [~epayne] [~hexiaoqiao] [~Jim_Brennan] Hi, would love to get some review on this, thank you > In HA mode, when one Resource Manager has networking issue, getTokenService() > should not throw runtime exception > > > Key: YARN-10516 > URL: https://issues.apache.org/jira/browse/YARN-10516 > Project: Hadoop YARN > Issue Type: Improvement > Components: client >Reporter: Xu Cang >Priority: Minor > Attachments: YARN-10516.001.patch, YARN-10516.002.patch, > YARN-10516.003.patch, YARN-10516.004.patch, YARN-10516.007.patch > > > We have observed one issue from YARN client around this piece of code: > [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ClientRMProxy.java#L145] > > While > {code:java} > services.add(SecurityUtil.buildTokenService( yarnConf.getSocketAddr(address, > defaultAddr, defaultPort)) .toString()); > > {code} > is being called, buildTokenService() fails and will throw runtime > exception, more specifically, UnknownHostException from here: > [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/SecurityUtil.java#L466] > *while one of the RM host was having networking issue* that IP cannot be > resolved. > This runtime exception then floats all the way up to our application and > causes MR job submission failed. > In my opinion, since we have HA here, multiple RMs are still alive and > available. We should catch this exception in getTokenService() and handle it > properly, instead of failing the whole action. > > > Would like to hear your opinion on this, if agreed, I will provide a patch on > this. Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10545) Improve the readability of diagnostics log in yarn-ui2 web page.
akiyamaneko created YARN-10545: -- Summary: Improve the readability of diagnostics log in yarn-ui2 web page. Key: YARN-10545 URL: https://issues.apache.org/jira/browse/YARN-10545 Project: Hadoop YARN Issue Type: Improvement Components: yarn-ui-v2 Reporter: akiyamaneko Attachments: Diagnostics shows unreadble.png If the diagnostic log in yarn-ui2 has multiple lines, line breaks and spaces will not be displayed, which is hard to read. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253654#comment-17253654 ] Hadoop QA commented on YARN-10427: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 35s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 5s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 9s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 48s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 15s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/410/artifact/out/diff-checkstyle-hadoop-tools_hadoop-sls.txt{color} | {color:orange} hadoop-tools/hadoop-sls: The patch generated 1 new + 22 unchanged - 1 fixed = 23 total (was 23) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 23s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 47s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {col
[jira] [Updated] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-10427: -- Attachment: YARN-10427.003.patch > Duplicate Job IDs in SLS output > --- > > Key: YARN-10427 > URL: https://issues.apache.org/jira/browse/YARN-10427 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0 > Environment: I ran the attached inputs on my MacBook Pro, using > Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also > tested against 3.2.1 and 3.3.0 release branches. > >Reporter: Drew Merrill >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, > YARN-10427.001.patch, YARN-10427.002.patch, YARN-10427.003.patch, > fair-scheduler.xml, inputsls.json, jobruntime.csv, jobruntime.csv, > mapred-site.xml, sls-runner.xml, yarn-site.xml > > > Hello, I'm hoping someone can help me resolve or understand some issues I've > been having with the YARN Scheduler Load Simulator (SLS). I've been > experimenting with SLS for several months now at work as we're trying to > build a simulation model to characterize our enterprise Hadoop infrastructure > for purposes of future capacity planning. In the process of attempting to > verify and validate the SLS output, I've encountered a number of issues > including runtime exceptions and bad output. The focus of this issue is the > bad output. In all my simulation runs, the jobruntime.csv output seems to > have one or more of the following problems: no output, duplicate job ids, > and/or missing job ids. > > Because of where I work, I'm unable to provide the exact inputs I typically > use, but I'm able to reproduce the problem of the duplicate Job IDS using > some simplified inputs and configuration files, which I've attached, along > with the output I obtained. > > The command I used to run the simulation: > {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json > --output-dir=sls-run-1 --print-simulation > --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}} > > Can anyone help me understand what would cause the duplicate Job IDs in the > output? Is this a bug in Hadoop or a problem with my inputs? Thanks in > advance. > > PS: This is my first issue I've ever opened so please be kind if I've missed > something or am not understanding something obvious about the way Hadoop > works. I'll gladly follow-up with more info as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589 ] Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 4:29 PM: -- Hi [~werd.up], Thanks for reporting this issue and congratulations for the first reported Hadoop YARN jira. {quote}In the process of attempting to verify and validate the SLS output, I've encountered a number of issues including runtime exceptions and bad output. {quote} I read through your observations and spent some time to play around with SLS. If you encountered other issues, please report other jiras if you have some time. As the process of running SLS involved some repetitive tasks like uploading configs to the remote machine, launch SLS, save the resulted logs..., I created some scripts into my public [Github repo here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427] Let me summarize what the scripts are doing: 1. [config dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]: This is the exact same configuration file set that you attached to this jira, with one exception of the log4j.properties file, that turns on DEBUG logging for SLS. 2. [upstream-patches dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]: This is the directory of the logging patch that helped me see the issues more clearly. My code changes are also pushed to my [Hadoop fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation] 3. [scripts dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]: This is the directory that contains all my scripts to build Hadoop + launch SLS and save produced logs to the local machine. As I have been working on a remote cluster, there's a script called [setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh] that contains some configuration values for the remote cluster + some local directories. If you want to use the scripts, all you need to do is to replace the configs in this file according to your environment. 3.1 [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]: This is the script that builds Hadoop according to the environment variables and launches the SLS suite on the remote cluster. 3.2 [start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]: This is the most important script as this will be executed on the remote machine. I think the script itself is straightforward enough, but let me briefly list what it does: - This script assumes that the Hadoop dist package is copied to the remote machine (this was done by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]) - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz - Copies the config to Hadoop's config dirs so SLS will use these particular configs - Launches SLS by starting slsrun.sh with the appropriate CLI swithces - Greps for some useful data in the resulted SLS log file. 3.3 [launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]: This script is executed by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh] as its last step. Once the start-sls.sh is finished, the [save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh] script is started. As the name implies it saves the latest SLS log dir and SCPs it to the local machine. The target directory of the local machine is determined by the config ([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scrip
[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253608#comment-17253608 ] Hadoop QA commented on YARN-10427: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 10s{color} | {color:red}{color} | {color:red} YARN-10427 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-10427 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13017522/YARN-10427.002.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/409/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Duplicate Job IDs in SLS output > --- > > Key: YARN-10427 > URL: https://issues.apache.org/jira/browse/YARN-10427 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0 > Environment: I ran the attached inputs on my MacBook Pro, using > Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also > tested against 3.2.1 and 3.3.0 release branches. > >Reporter: Drew Merrill >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, > YARN-10427.001.patch, YARN-10427.002.patch, fair-scheduler.xml, > inputsls.json, jobruntime.csv, jobruntime.csv, mapred-site.xml, > sls-runner.xml, yarn-site.xml > > > Hello, I'm hoping someone can help me resolve or understand some issues I've > been having with the YARN Scheduler Load Simulator (SLS). I've been > experimenting with SLS for several months now at work as we're trying to > build a simulation model to characterize our enterprise Hadoop infrastructure > for purposes of future capacity planning. In the process of attempting to > verify and validate the SLS output, I've encountered a number of issues > including runtime exceptions and bad output. The focus of this issue is the > bad output. In all my simulation runs, the jobruntime.csv output seems to > have one or more of the following problems: no output, duplicate job ids, > and/or missing job ids. > > Because of where I work, I'm unable to provide the exact inputs I typically > use, but I'm able to reproduce the problem of the duplicate Job IDS using > some simplified inputs and configuration files, which I've attached, along > with the output I obtained. > > The command I used to run the simulation: > {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json > --output-dir=sls-run-1 --print-simulation > --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}} > > Can anyone help me understand what would cause the duplicate Job IDs in the > output? Is this a bug in Hadoop or a problem with my inputs? Thanks in > advance. > > PS: This is my first issue I've ever opened so please be kind if I've missed > something or am not understanding something obvious about the way Hadoop > works. I'll gladly follow-up with more info as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589 ] Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 4:26 PM: -- Hi [~werd.up], Thanks for reporting this issue and congratulations for the first reported Hadoop YARN jira. {quote}In the process of attempting to verify and validate the SLS output, I've encountered a number of issues including runtime exceptions and bad output. {quote} I read through your observations and spent some time to play around with SLS. If you encountered other issues, please report other jiras if you have some time. As the process of running SLS involved some repetitive tasks like uploading configs to the remote machine, launch SLS, save the resulted logs..., I created some scripts into my public [Github repo here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427] Let me summarize what are these scripts are doing: 1. [config dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]: This is the exact same configuration file set that you attached to this jira, with one exception of the log4j.properties file, that turns on DEBUG logging for SLS. 2. [upstream-patches dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]: This is the directory of the logging patch that helped me see the issues more clearly. My code changes are also pushed to my [Hadoop fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation] 3. [scripts dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]: This is the directory that contains all my scripts to build Hadoop + launch SLS and save produced logs to the local machine. As I have been working on a remote cluster, there's a script called [setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh] that contains some configuration values for the remote cluster + some local directories. If you want to use the scripts, all you need to do is to replace the configs in this file according to your environment. 3.1 [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]: This is the script that builds Hadoop according to the environment variables and launches the SLS suite on the remote cluster. 3.2 [start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]: This is the most important script as this will be executed on the remote machine. I think the script itself is straightforward enough, but let me briefly list what it does: - This script assumes that the Hadoop dist package is copied to the remote machine (this was done by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]) - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz - Copies the config to Hadoop's config dirs so SLS will use these particular configs - Launches SLS by starting slsrun.sh with the appropriate CLI swithces - Greps for some useful data in the resulted SLS log file. 3.3 [launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]: This script is executed by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh] as its last step. Once the start-sls.sh is finished, the [save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh] script is started. As the name implies it saves the latest SLS log dir and SCPs it to the local machine. The target directory of the local machine is determined by the config ([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427
[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589 ] Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 4:26 PM: -- Hi [~werd.up], Thanks for reporting this issue and congratulations for the first reported Hadoop YARN jira. {quote}In the process of attempting to verify and validate the SLS output, I've encountered a number of issues including runtime exceptions and bad output. {quote} I read through your observations and spent some time to play around with SLS. If you encountered other issues, please report other jiras if you have some time. As the process of running SLS involved some repetitive tasks like uploading configs to the remote machine, launch SLS, save the resulted logs..., I created some scripts into my public [Github repo here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427] Let me summarize what the scripts are doing: 1. [config dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]: This is the exact same configuration file set that you attached to this jira, with one exception of the log4j.properties file, that turns on DEBUG logging for SLS. 2. [upstream-patches dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]: This is the directory of the logging patch that helped me see the issues more clearly. My code changes are also pushed to my [Hadoop fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation] 3. [scripts dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]: This is the directory that contains all my scripts to build Hadoop + launch SLS and save produced logs to the local machine. As I have been working on a remote cluster, there's a script called [setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh] that contains some configuration values for the remote cluster + some local directories. If you want to use the scripts, all you need to do is to replace the configs in this file according to your environment. 3.1 [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]: This is the script that builds Hadoop according to the environment variables and launches the SLS suite on the remote cluster. 3.2 [start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]: This is the most important script as this will be executed on the remote machine. I think the script itself is straightforward enough, but let me briefly list what it does: - This script assumes that the Hadoop dist package is copied to the remote machine (this was done by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]) - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz - Copies the config to Hadoop's config dirs so SLS will use these particular configs - Launches SLS by starting slsrun.sh with the appropriate CLI swithces - Greps for some useful data in the resulted SLS log file. 3.3 [launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]: This script is executed by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh] as its last step. Once the start-sls.sh is finished, the [save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh] script is started. As the name implies it saves the latest SLS log dir and SCPs it to the local machine. The target directory of the local machine is determined by the config ([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scrip
[jira] [Updated] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-10427: -- Attachment: YARN-10427.002.patch > Duplicate Job IDs in SLS output > --- > > Key: YARN-10427 > URL: https://issues.apache.org/jira/browse/YARN-10427 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0 > Environment: I ran the attached inputs on my MacBook Pro, using > Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also > tested against 3.2.1 and 3.3.0 release branches. > >Reporter: Drew Merrill >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, > YARN-10427.001.patch, YARN-10427.002.patch, fair-scheduler.xml, > inputsls.json, jobruntime.csv, jobruntime.csv, mapred-site.xml, > sls-runner.xml, yarn-site.xml > > > Hello, I'm hoping someone can help me resolve or understand some issues I've > been having with the YARN Scheduler Load Simulator (SLS). I've been > experimenting with SLS for several months now at work as we're trying to > build a simulation model to characterize our enterprise Hadoop infrastructure > for purposes of future capacity planning. In the process of attempting to > verify and validate the SLS output, I've encountered a number of issues > including runtime exceptions and bad output. The focus of this issue is the > bad output. In all my simulation runs, the jobruntime.csv output seems to > have one or more of the following problems: no output, duplicate job ids, > and/or missing job ids. > > Because of where I work, I'm unable to provide the exact inputs I typically > use, but I'm able to reproduce the problem of the duplicate Job IDS using > some simplified inputs and configuration files, which I've attached, along > with the output I obtained. > > The command I used to run the simulation: > {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json > --output-dir=sls-run-1 --print-simulation > --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}} > > Can anyone help me understand what would cause the duplicate Job IDs in the > output? Is this a bug in Hadoop or a problem with my inputs? Thanks in > advance. > > PS: This is my first issue I've ever opened so please be kind if I've missed > something or am not understanding something obvious about the way Hadoop > works. I'll gladly follow-up with more info as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253604#comment-17253604 ] Szilard Nemeth commented on YARN-10427: --- Accidentally attached a patch that also contains all the logging. Adding a second patch with just the fix. > Duplicate Job IDs in SLS output > --- > > Key: YARN-10427 > URL: https://issues.apache.org/jira/browse/YARN-10427 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0 > Environment: I ran the attached inputs on my MacBook Pro, using > Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also > tested against 3.2.1 and 3.3.0 release branches. > >Reporter: Drew Merrill >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, > YARN-10427.001.patch, fair-scheduler.xml, inputsls.json, jobruntime.csv, > jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml > > > Hello, I'm hoping someone can help me resolve or understand some issues I've > been having with the YARN Scheduler Load Simulator (SLS). I've been > experimenting with SLS for several months now at work as we're trying to > build a simulation model to characterize our enterprise Hadoop infrastructure > for purposes of future capacity planning. In the process of attempting to > verify and validate the SLS output, I've encountered a number of issues > including runtime exceptions and bad output. The focus of this issue is the > bad output. In all my simulation runs, the jobruntime.csv output seems to > have one or more of the following problems: no output, duplicate job ids, > and/or missing job ids. > > Because of where I work, I'm unable to provide the exact inputs I typically > use, but I'm able to reproduce the problem of the duplicate Job IDS using > some simplified inputs and configuration files, which I've attached, along > with the output I obtained. > > The command I used to run the simulation: > {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json > --output-dir=sls-run-1 --print-simulation > --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}} > > Can anyone help me understand what would cause the duplicate Job IDs in the > output? Is this a bug in Hadoop or a problem with my inputs? Thanks in > advance. > > PS: This is my first issue I've ever opened so please be kind if I've missed > something or am not understanding something obvious about the way Hadoop > works. I'll gladly follow-up with more info as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589 ] Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 3:57 PM: -- Hi [~werd.up], Thanks for reporting this issue and congratulations for the first reported Hadoop YARN jira. {quote}In the process of attempting to verify and validate the SLS output, I've encountered a number of issues including runtime exceptions and bad output. {quote} I read through your observations and spent some time to play around with SLS. If you encountered other issues, please report other jiras if you have some time. As the process of running SLS involved some repetitive tasks like uploading configs to the remote machine, launch SLS, save the resulted logs..., I created some scripts into my public [Github repo here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427] Let me break summarize what are these scripts are doing: 1. [config dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]: This is the exact same configuration file set that you attached to this jira, with one exception of the log4j.properties file, that turns on DEBUG logging for SLS. 2. [upstream-patches dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]: This is the directory of the logging patch that helped me see the issues more clearly. My code changes are also pushed to my [Hadoop fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation] 3. [scripts dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]: This is the directory that contains all my scripts to build Hadoop + launch SLS and save produced logs to the local machine. As I have been working on a remote cluster, there's a script called [setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh] that contains some configuration values for the remote cluster + some local directories. If you want to use the scripts, all you need to do is to replace the configs in this file according to your environment. 3.1 [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]: This is the script that builds Hadoop according to the environment variables and launches the SLS suite on the remote cluster. 3.2 [start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]: This is the most important script as this will be executed on the remote machine. I think the script itself is straightforward enough, but let me briefly list what it does: - This script assumes that the Hadoop dist package is copied to the remote machine (this was done by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]) - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz - Copies the config to Hadoop's config dirs so SLS will use these particular configs - Launches SLS by starting slsrun.sh with the appropriate CLI swithces - Greps for some useful data in the resulted SLS log file. 3.3 [launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]: This script is executed by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh] as its last step. Once the start-sls.sh is finished, the [save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh] script is started. As the name implies it saves the latest SLS log dir and SCPs it to the local machine. The target directory of the local machine is determined by the config ([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN
[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589 ] Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 3:56 PM: -- Hi [~werd.up], Thanks for reporting this issue and congratulations for the first reported Hadoop YARN jira. {quote}In the process of attempting to verify and validate the SLS output, I've encountered a number of issues including runtime exceptions and bad output. {quote} I read through your observations and spent some time to play around with SLS. If you encountered other issues, please report other jiras if you have some time. As the process of running SLS involved some repetitive tasks like uploading configs to the remote machine, launch SLS, save the resulted logs..., I created some scripts into my public [Github repo here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427] Let me break summarize what are these scripts are doing: 1. [config dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]: This is the exact same configuration file set that you attached to this jira, with one exception of the log4j.properties file, that turns on DEBUG logging for SLS. 2. [upstream-patches dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]: This is the directory of the logging patch that helped me see the issues more clearly. My code changes are also pushed to my [Hadoop fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation] 3. [scripts dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]: This is the directory that contains all my scripts to build Hadoop + launch SLS and save produced logs to the local machine. As I have been working on a remote cluster, there's a script called [setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh] that contains some configuration values for the remote cluster + some local directories. If you want to use the scripts, all you need to do is to replace the configs in this file according to your environment. 3.1 [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]: This is the script that builds Hadoop according to the environment variables and launches the SLS suite on the remote cluster. 3.2 [start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]: This is the most important script as this will be executed on the remote machine. I think the script itself is straightforward enough, but let me briefly list what it does: - This script assumes that the Hadoop dist package is copied to the remote machine (this was done by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]) - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz - Copies the config to Hadoop's config dirs so SLS will use these particular configs - Launches SLS by starting slsrun.sh with the appropriate CLI swithces - Greps for some useful data in the resulted SLS log file. 3.3 [launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]: This script is executed by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh] as its last step. Once the start-sls.sh is finished, the [save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh] script is started. As the name implies it saves the latest SLS log dir and SCPs it to the local machine. The target directory of the local machine is determined by the config ([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN
[jira] [Updated] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-10427: -- Attachment: YARN-10427.001.patch > Duplicate Job IDs in SLS output > --- > > Key: YARN-10427 > URL: https://issues.apache.org/jira/browse/YARN-10427 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0 > Environment: I ran the attached inputs on my MacBook Pro, using > Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also > tested against 3.2.1 and 3.3.0 release branches. > >Reporter: Drew Merrill >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, > YARN-10427.001.patch, fair-scheduler.xml, inputsls.json, jobruntime.csv, > jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml > > > Hello, I'm hoping someone can help me resolve or understand some issues I've > been having with the YARN Scheduler Load Simulator (SLS). I've been > experimenting with SLS for several months now at work as we're trying to > build a simulation model to characterize our enterprise Hadoop infrastructure > for purposes of future capacity planning. In the process of attempting to > verify and validate the SLS output, I've encountered a number of issues > including runtime exceptions and bad output. The focus of this issue is the > bad output. In all my simulation runs, the jobruntime.csv output seems to > have one or more of the following problems: no output, duplicate job ids, > and/or missing job ids. > > Because of where I work, I'm unable to provide the exact inputs I typically > use, but I'm able to reproduce the problem of the duplicate Job IDS using > some simplified inputs and configuration files, which I've attached, along > with the output I obtained. > > The command I used to run the simulation: > {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json > --output-dir=sls-run-1 --print-simulation > --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}} > > Can anyone help me understand what would cause the duplicate Job IDs in the > output? Is this a bug in Hadoop or a problem with my inputs? Thanks in > advance. > > PS: This is my first issue I've ever opened so please be kind if I've missed > something or am not understanding something obvious about the way Hadoop > works. I'll gladly follow-up with more info as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-10427: -- Attachment: YARN-10427-sls-scriptsandlogs.tar.gz > Duplicate Job IDs in SLS output > --- > > Key: YARN-10427 > URL: https://issues.apache.org/jira/browse/YARN-10427 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0 > Environment: I ran the attached inputs on my MacBook Pro, using > Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also > tested against 3.2.1 and 3.3.0 release branches. > >Reporter: Drew Merrill >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, > YARN-10427.001.patch, fair-scheduler.xml, inputsls.json, jobruntime.csv, > jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml > > > Hello, I'm hoping someone can help me resolve or understand some issues I've > been having with the YARN Scheduler Load Simulator (SLS). I've been > experimenting with SLS for several months now at work as we're trying to > build a simulation model to characterize our enterprise Hadoop infrastructure > for purposes of future capacity planning. In the process of attempting to > verify and validate the SLS output, I've encountered a number of issues > including runtime exceptions and bad output. The focus of this issue is the > bad output. In all my simulation runs, the jobruntime.csv output seems to > have one or more of the following problems: no output, duplicate job ids, > and/or missing job ids. > > Because of where I work, I'm unable to provide the exact inputs I typically > use, but I'm able to reproduce the problem of the duplicate Job IDS using > some simplified inputs and configuration files, which I've attached, along > with the output I obtained. > > The command I used to run the simulation: > {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json > --output-dir=sls-run-1 --print-simulation > --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}} > > Can anyone help me understand what would cause the duplicate Job IDs in the > output? Is this a bug in Hadoop or a problem with my inputs? Thanks in > advance. > > PS: This is my first issue I've ever opened so please be kind if I've missed > something or am not understanding something obvious about the way Hadoop > works. I'll gladly follow-up with more info as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589 ] Szilard Nemeth commented on YARN-10427: --- Hi [~werd.up], Thanks for reporting this issue and congratulations for the first reported Hadoop YARN jira. {quote}In the process of attempting to verify and validate the SLS output, I've encountered a number of issues including runtime exceptions and bad output. {quote} I read through your observations and spent some time to play around with SLS. If you encountered other issues, please report other jiras if you have some time. As the process of running SLS involved some repetitive tasks like uploading configs to the remote machine, launch SLS, save the resulted logs..., I created some scripts into my public Github repo here: [https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427] Let me break summarize what are these scripts are doing: 1. [config dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]: This is the exact same configuration file set that you attached to this jira, with one exception of the log4j.properties file, that turns on DEBUG logging for SLS. 2. [upstream-patches dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]: This is the directory of the logging patch that helped me see the issues more clearly. My code changes are also pushed to my Hadoop fork: [https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation] 3. [scripts dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]: This is the directory that contains all my scripts to build Hadoop + launch SLS and save produced logs to the local machine. As I have been working on a remote cluster, there's a script called [setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh] that contains some configuration values for the remote cluster + some local directories. If you want to use the scripts, all you need to do is to replace the configs in this file according to your environment. 3.1 [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]: This is the script that builds Hadoop according to the environment variables and launches the SLS suite on the remote cluster. 3.2 [start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]: This is the most important script as this will be executed on the remote machine. I think the script itself is straightforward enough, but let me briefly list what it does: - This script assumes that the Hadoop dist package is copied to the remote machine (this was done by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]) - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz - Copies the config to Hadoop's config dirs so SLS will use these particular configs - Launches SLS by starting slsrun.sh with the appropriate CLI swithces - Greps for some useful data in the resulted SLS log file. 3.3 [launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]: This script is executed by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh] as its last step. Once the start-sls.sh is finished, the [save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh] script is started. As the name implies it saves the latest SLS log dir and SCPs it to the local machine. The target directory of the local machine is determined by the config ([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]). *The late
[jira] [Assigned] (YARN-10427) Duplicate Job IDs in SLS output
[ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reassigned YARN-10427: - Assignee: Szilard Nemeth > Duplicate Job IDs in SLS output > --- > > Key: YARN-10427 > URL: https://issues.apache.org/jira/browse/YARN-10427 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0 > Environment: I ran the attached inputs on my MacBook Pro, using > Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also > tested against 3.2.1 and 3.3.0 release branches. > >Reporter: Drew Merrill >Assignee: Szilard Nemeth >Priority: Major > Attachments: fair-scheduler.xml, inputsls.json, jobruntime.csv, > jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml > > > Hello, I'm hoping someone can help me resolve or understand some issues I've > been having with the YARN Scheduler Load Simulator (SLS). I've been > experimenting with SLS for several months now at work as we're trying to > build a simulation model to characterize our enterprise Hadoop infrastructure > for purposes of future capacity planning. In the process of attempting to > verify and validate the SLS output, I've encountered a number of issues > including runtime exceptions and bad output. The focus of this issue is the > bad output. In all my simulation runs, the jobruntime.csv output seems to > have one or more of the following problems: no output, duplicate job ids, > and/or missing job ids. > > Because of where I work, I'm unable to provide the exact inputs I typically > use, but I'm able to reproduce the problem of the duplicate Job IDS using > some simplified inputs and configuration files, which I've attached, along > with the output I obtained. > > The command I used to run the simulation: > {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json > --output-dir=sls-run-1 --print-simulation > --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}} > > Can anyone help me understand what would cause the duplicate Job IDs in the > output? Is this a bug in Hadoop or a problem with my inputs? Thanks in > advance. > > PS: This is my first issue I've ever opened so please be kind if I've missed > something or am not understanding something obvious about the way Hadoop > works. I'll gladly follow-up with more info as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
[ https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253560#comment-17253560 ] Jim Brennan commented on YARN-10540: Thanks [~ebadger]! And thanks [~hexiaoqiao], [~ayushtkn] and [~sunilg] for finding and investigating this bug. Much appreciated. > Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes > > > Key: YARN-10540 > URL: https://issues.apache.org/jira/browse/YARN-10540 > Project: Hadoop YARN > Issue Type: Task > Components: webapp >Affects Versions: 3.2.2 >Reporter: Sunil G >Assignee: Jim Brennan >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 > PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, > Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png > > > YARN-10450 added changes in NodeInfo class. > Various exceptions are showing while accessing UI2 and UI1 NODE pages. > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70) > {code} > {code:java} > 2020-12-19 22:55:54,846 WARN > org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caozhiqiang updated YARN-10501: --- Comment: was deleted (was: @[~bteke], [~snemeth], [~epayne], [~hexiaoqiao], [~wilfreds], [~jianhe], [~ebadger] could you help me to review this issue or recommend another suitable committer to review? Thank you.) > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Attachments: YARN-10501.002.patch, YARN-10501.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253330#comment-17253330 ] caozhiqiang commented on YARN-10501: could anyone review this issue? > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Attachments: YARN-10501.002.patch, YARN-10501.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org