[jira] [Comment Edited] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-22 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253943#comment-17253943
 ] 

zhuqi edited comment on YARN-10506 at 12/23/20, 7:53 AM:
-

[~gandras]

Thanks for your reply.

[~wangda] [~gandras]

1. What i said is related policy management, we should fixed it later, and may 
be we need a new sub-task.

2. I think we should disallow it by throwing an exception, but we can add a 
admin api to  realize destroying all its underlying children for optional, it 
will be more reasonable.

3. Also the 2. question is related to auto deleting, we should also discuss it.

Thanks.


was (Author: zhuqi):
[~gandras]

Thanks for your reply.

1. What i said is related policy management, we should fixed it later, and may 
be we need a new sub-task.

2. I think we should disallow it by throwing an exception, but we can add a 
admin api to  realize destroying all its underlying children for optional, it 
will be more reasonable.

3. Also the 2. question is related to auto deleting, we should also discuss it.

Thanks.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-22 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253943#comment-17253943
 ] 

zhuqi edited comment on YARN-10506 at 12/23/20, 7:52 AM:
-

[~gandras]

Thanks for your reply.

1. What i said is related policy management, we should fixed it later, and may 
be we need a new sub-task.

2. I think we should disallow it by throwing an exception, but we can add a 
admin api to  realize destroying all its underlying children for optional, it 
will be more reasonable.

3. Also the 2. question is related to auto deleting, we should also discuss it.

Thanks.


was (Author: zhuqi):
[~gandras]

Thanks for your reply.

1. What i said is related policy management, we should fixed it later, and may 
be we need a new sub-task.

2. I think we should disallow it by throwing an exception, but we can add a 
admin api to  realize destroying all its underlying children for optional, it 
will be more reasonable.

Thanks.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-22 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253943#comment-17253943
 ] 

zhuqi commented on YARN-10506:
--

[~gandras]

Thanks for your reply.

1. What i said is related policy management, we should fixed it later, and may 
be we need a new sub-task.

2. I think we should disallow it by throwing an exception, but we can add a 
admin api to  realize destroying all its underlying children for optional, it 
will be more reasonable.

Thanks.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8529) Add timeout to RouterWebServiceUtil#invokeRMWebService

2020-12-22 Thread Minni Mittal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Minni Mittal updated YARN-8529:
---
Attachment: YARN-8529.v11.patch

> Add timeout to RouterWebServiceUtil#invokeRMWebService
> --
>
> Key: YARN-8529
> URL: https://issues.apache.org/jira/browse/YARN-8529
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Minni Mittal
>Priority: Major
> Attachments: YARN-8529.v1.patch, YARN-8529.v10.patch, 
> YARN-8529.v11.patch, YARN-8529.v2.patch, YARN-8529.v3.patch, 
> YARN-8529.v4.patch, YARN-8529.v5.patch, YARN-8529.v6.patch, 
> YARN-8529.v7.patch, YARN-8529.v8.patch, YARN-8529.v9.patch
>
>
> {{RouterWebServiceUtil#invokeRMWebService}} currently has a fixed timeout. 
> This should be configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread

2020-12-22 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-8557:

Attachment: (was: YARN-8557.001.patch)

> Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
> 
>
> Key: YARN-8557
> URL: https://issues.apache.org/jira/browse/YARN-8557
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.4.0
>Reporter: Weiwei Yang
>Assignee: zhuqi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which 
> we should make it configurable. And more over, we need to exclude unhealthy 
> and decommissioned nodes too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread

2020-12-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-8557:
-
Labels: pull-request-available  (was: )

> Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
> 
>
> Key: YARN-8557
> URL: https://issues.apache.org/jira/browse/YARN-8557
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.4.0
>Reporter: Weiwei Yang
>Assignee: zhuqi
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-8557.001.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which 
> we should make it configurable. And more over, we need to exclude unhealthy 
> and decommissioned nodes too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253904#comment-17253904
 ] 

Xiaoqiao He commented on YARN-10540:


I am not sure if it is expected. if yes, I would like to backport to 3.2.2 and 
prepare another RC. [~ebadger],[~Jim_Brennan] would you like to give another 
check? Thanks.

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, osx-yarn-ui2.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread

2020-12-22 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239526#comment-17239526
 ] 

zhuqi edited comment on YARN-8557 at 12/23/20, 5:50 AM:


[~cheersyang] [~wangda] [~tangzhankun] [~sunilg] [~BilwaST] [~Tao Yang]

I add the patch , if you can review for it:

1 Add configurable HB lag.

2 Support exclude not running nodes.

Thanks.


was (Author: zhuqi):
[~cheersyang] [~wangda] [~tangzhankun] [~sunilg] [~BilwaST] [~Tao Yang]

I add the patch , if you can review for it:

1 Add configurable HB lag.

2 Support exclude unhealthy and decommissioned and decommissing nodes.

Thanks.

> Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
> 
>
> Key: YARN-8557
> URL: https://issues.apache.org/jira/browse/YARN-8557
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.4.0
>Reporter: Weiwei Yang
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8557.001.patch
>
>
> Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which 
> we should make it configurable. And more over, we need to exclude unhealthy 
> and decommissioned nodes too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Drew Merrill (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253892#comment-17253892
 ] 

Drew Merrill commented on YARN-10427:
-

Hi [~snemeth], wow, your response is amazing! I need to set aside a good chunk 
of time to digest it in its entirety and actually work through the debugging 
procedure you went through, step-by-step. But I just want to express my sincere 
gratitude for putting the time and energy into crafting such a detailed and 
instructive follow-up that both confirmed my findings, while also showing in 
great detail and clarity the steps you took to identify the source of the 
problem along with possible solutions. Thank you!

> Duplicate Job IDs in SLS output
> ---
>
> Key: YARN-10427
> URL: https://issues.apache.org/jira/browse/YARN-10427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
> Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>Reporter: Drew Merrill
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, 
> YARN-10427.001.patch, YARN-10427.002.patch, YARN-10427.003.patch, 
> fair-scheduler.xml, inputsls.json, jobruntime.csv, jobruntime.csv, 
> mapred-site.xml, sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-22 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253889#comment-17253889
 ] 

Xiaoqiao He commented on YARN-10540:


Thanks [~Jim_Brennan],[~ebadger],[~sunilg] for your works. It seems works well 
mostly but one link could not jump correctly using ui2 on local mac and ubuntu 
now. reference the following chart. Thanks.
 !osx-yarn-ui2.png! 

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, osx-yarn-ui2.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated YARN-10540:
---
Attachment: osx-yarn-ui2.png

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, osx-yarn-ui2.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2020-12-22 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253792#comment-17253792
 ] 

Eric Badger commented on YARN-10501:


I'm looking at this patch and I have some questions about other pieces of code 
that are in the same section. I'll admit, this code is a little bit confusing 
to me because we have Hosts->Labels maps as well as Labels->Nodes maps and then 
on top of that, each Host can have multiple Nodes. Before I can comment on your 
patch, I think I need to clear up some things that are going on in this area of 
code that are confusing to me. Below is what I _think_ is happening. Feel free 
to correct me where I'm wrong.

(Assuming no port and/or wildcard port for this)
1. When we add a node label, we invoke this piece of code. 
{noformat}
case ADD:
  addNodeToLabels(nodeId, labels);
  host.labels.addAll(labels);
  for (Node node : host.nms.values()) {
if (node.labels != null) {
  node.labels.addAll(labels);
}
addNodeToLabels(node.nodeId, labels);
  }
  break;
{noformat}
  1a. This code adds the NodeId (without a port/with a wildcard port) to the 
Labels->Nodes map via addNodeToLabels. 
*Why do we do this? There is no port associated with this node. In 1d we 
add the nodes to the map with their associated port, so I don't understand why 
we're adding the node here when it doesn't have a port.*
  1b. It adds all of the labels to the Host. This part doesn't make sense to 
me. *If we are giving Hosts the granularity to have multiple labels per host 
(due to multiple NMs), then why does the Host itself have labels?*
  1c. We add all the labels to each Node in the host, but _only_ if they 
already have labels. *Why do we only add the labels if they already have 
labels? Don't we want to add the labels regardless? Should it be possible for 
us to be in the ADD method while node.labels == null? Maybe this should throw 
an exception*
  1d. We add the Nodes (with their associated NM port) to the Labels->nodes map 
via addNodeToLabels.

2. When we replace the node label we invoke this piece of code
{noformat}
case REPLACE:
  replaceNodeForLabels(nodeId, host.labels, labels);
  host.labels.clear();
  host.labels.addAll(labels);
  for (Node node : host.nms.values()) {
replaceNodeForLabels(node.nodeId, node.labels, labels);
node.labels = null;
  }
{noformat}
  2a. We remove the Node (without port or with wildcard port) from the specific 
label in the Labels->Nodes map via removeNoveFromLabels(). *Why do we have the 
node without a port in the first place?* This comes from 1a.
  2b. We add the Node (without port or with wildcard port) to the new specific 
label in the Labels->Nodes map via addNodeToLabels(). *Why do we add the node 
without a port?*
  2c. We clear the labels associated with the Host. *Why are there labels 
associated with a Host when each Host is actually a collection of Nodes?*
  2d. We add the new labels to the Host. Same question as 2c.
  2e. We iterate through the list of Nodes associated with each Host and 
perform 2a and 2b, except with Nodes that have their associated ports.
  2f. We set the Labels to Null for each Node associated with the Host. I don't 
understand the purpose of this. I must be missing something here.

Overall I have 2 main issues with the code that I need cleared up because I 
either don't understand the code or think things are broken/unnecessary. 
1) We have labels associated with Hosts (which are collections of Hosts) _and_ 
labels associated with just Nodes
2) We add Nodes that have no associated port or have the wildcard port _and_ 
add those same nodes with their associated ports.

Probably need [~leftnoteasy], [~varunsaxena], or [~sunilg] to comment on this 
since they were involved with YARN-3075 that added much of this code

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Attachments: YARN-10501.002.patch, YARN-10501.003.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclu

[jira] [Updated] (YARN-10040) DistributedShell test failure on X86 and ARM

2020-12-22 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated YARN-10040:
-
Priority: Major  (was: Blocker)

> DistributedShell test failure on X86 and ARM
> 
>
> Key: YARN-10040
> URL: https://issues.apache.org/jira/browse/YARN-10040
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
> Environment: X86/ARM
> OS: ubuntu1804
> Java 8
>Reporter: zhao bo
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-10040.001.patch
>
>
> * 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers
>  * 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType
> Please see the Apache Jenkins Test result:
> [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/]
>  
> These 2 tests are failed on both X86 and ARM platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10040) DistributedShell test failure on X86 and ARM

2020-12-22 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253737#comment-17253737
 ] 

Ahmed Hussein edited comment on YARN-10040 at 12/22/20, 8:48 PM:
-

[~abmodi] can you suggest anyone familiar with the changes done in YARN-9697?


was (Author: ahussein):
I changed the status of this Jira to blocker.

[~abmodi] can you suggest anyone familiar with the changes done in YARN-9697?

> DistributedShell test failure on X86 and ARM
> 
>
> Key: YARN-10040
> URL: https://issues.apache.org/jira/browse/YARN-10040
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
> Environment: X86/ARM
> OS: ubuntu1804
> Java 8
>Reporter: zhao bo
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-10040.001.patch
>
>
> * 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers
>  * 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType
> Please see the Apache Jenkins Test result:
> [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/]
>  
> These 2 tests are failed on both X86 and ARM platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10040) DistributedShell test failure on X86 and ARM

2020-12-22 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253737#comment-17253737
 ] 

Ahmed Hussein commented on YARN-10040:
--

I changed the status of this Jira to blocker.

[~abmodi] can you suggest anyone familiar with the changes done in YARN-9697?

> DistributedShell test failure on X86 and ARM
> 
>
> Key: YARN-10040
> URL: https://issues.apache.org/jira/browse/YARN-10040
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
> Environment: X86/ARM
> OS: ubuntu1804
> Java 8
>Reporter: zhao bo
>Assignee: Abhishek Modi
>Priority: Blocker
> Attachments: YARN-10040.001.patch
>
>
> * 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers
>  * 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType
> Please see the Apache Jenkins Test result:
> [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/]
>  
> These 2 tests are failed on both X86 and ARM platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10040) DistributedShell test failure on X86 and ARM

2020-12-22 Thread Ahmed Hussein (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated YARN-10040:
-
Priority: Blocker  (was: Major)

> DistributedShell test failure on X86 and ARM
> 
>
> Key: YARN-10040
> URL: https://issues.apache.org/jira/browse/YARN-10040
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications/distributed-shell
> Environment: X86/ARM
> OS: ubuntu1804
> Java 8
>Reporter: zhao bo
>Assignee: Abhishek Modi
>Priority: Blocker
> Attachments: YARN-10040.001.patch
>
>
> * 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers
>  * 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType
> Please see the Apache Jenkins Test result:
> [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/]
>  
> These 2 tests are failed on both X86 and ARM platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253693#comment-17253693
 ] 

Hadoop QA commented on YARN-10427:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
44s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
52s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 59s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
49s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
26s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
22s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 16s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/411/artifact/out/diff-checkstyle-hadoop-tools_hadoop-sls.txt{color}
 | {color:orange} hadoop-tools/hadoop-sls: The patch generated 1 new + 22 
unchanged - 1 fixed = 23 total (was 23) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
22s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 40s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {col

[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-22 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253687#comment-17253687
 ] 

Hadoop QA commented on YARN-10506:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  9s{color} 
| {color:red}{color} | {color:red} YARN-10506 does not apply to trunk. Rebase 
required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for 
help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-10506 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13017530/YARN-10506.002.patch |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/412/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-22 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10506:

Attachment: YARN-10506.002.patch

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-22 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253682#comment-17253682
 ] 

Andras Gyori commented on YARN-10506:
-

I have taken your advise into account, however, I would need more details about 
from you about the problem at hand. You mentioned ManagedParent, but I need to 
emphasise, that we have taken an approach that does not involve 
ManagedParentQueue or AutoCreatedLeafQueue at all. We are reusing the original 
ParentQueue and LeafQueue classes. 
I have fixed the failing tests, but I have also identified a possible issue, 
which needs to be addressed. If a dynamic queue is a ParentQueue, and it has 
child queues as well, and we make it a static queue, it will be converted to a 
LeafQueue, basically destroying all its underlying children. Either we disallow 
it by throwing an exception, or we convert it successfully. What is your 
opinion about it?

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10516) In HA mode, when one Resource Manager has networking issue, getTokenService() should not throw runtime exception

2020-12-22 Thread Xu Cang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253675#comment-17253675
 ] 

Xu Cang commented on YARN-10516:


[~epayne] [~hexiaoqiao] [~Jim_Brennan]

Hi, would love to get some review on this, thank you

> In HA mode, when one Resource Manager has networking issue, getTokenService() 
> should not throw runtime exception
> 
>
> Key: YARN-10516
> URL: https://issues.apache.org/jira/browse/YARN-10516
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Reporter: Xu Cang
>Priority: Minor
> Attachments: YARN-10516.001.patch, YARN-10516.002.patch, 
> YARN-10516.003.patch, YARN-10516.004.patch, YARN-10516.007.patch
>
>
> We have observed one issue from YARN client around this piece of code:
> [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ClientRMProxy.java#L145]
>  
> While 
> {code:java}
> services.add(SecurityUtil.buildTokenService( yarnConf.getSocketAddr(address, 
> defaultAddr, defaultPort)) .toString());
>  
> {code}
> is being called,    buildTokenService()  fails and will throw runtime 
> exception, more specifically, UnknownHostException from here: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/SecurityUtil.java#L466]
>  *while one of the RM host was having networking issue* that IP cannot be 
> resolved.
> This runtime exception then floats all the way up to our application and 
> causes MR job submission failed. 
> In my opinion, since we have HA here, multiple RMs are still alive and 
> available. We should catch this exception in  getTokenService() and handle it 
> properly, instead of failing the whole action. 
>  
>  
> Would like to hear your opinion on this, if agreed, I will provide a patch on 
> this. Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10545) Improve the readability of diagnostics log in yarn-ui2 web page.

2020-12-22 Thread akiyamaneko (Jira)
akiyamaneko created YARN-10545:
--

 Summary: Improve the readability of diagnostics log in yarn-ui2 
web page.
 Key: YARN-10545
 URL: https://issues.apache.org/jira/browse/YARN-10545
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn-ui-v2
Reporter: akiyamaneko
 Attachments: Diagnostics shows unreadble.png

If the diagnostic log in yarn-ui2 has multiple lines, line breaks and spaces 
will not be displayed, which is hard to read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253654#comment-17253654
 ] 

Hadoop QA commented on YARN-10427:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
35s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 5s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
31s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
31s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  9s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
48s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
45s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
26s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
21s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 15s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/410/artifact/out/diff-checkstyle-hadoop-tools_hadoop-sls.txt{color}
 | {color:orange} hadoop-tools/hadoop-sls: The patch generated 1 new + 22 
unchanged - 1 fixed = 23 total (was 23) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
23s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 47s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {col

[jira] [Updated] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10427:
--
Attachment: YARN-10427.003.patch

> Duplicate Job IDs in SLS output
> ---
>
> Key: YARN-10427
> URL: https://issues.apache.org/jira/browse/YARN-10427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
> Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>Reporter: Drew Merrill
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, 
> YARN-10427.001.patch, YARN-10427.002.patch, YARN-10427.003.patch, 
> fair-scheduler.xml, inputsls.json, jobruntime.csv, jobruntime.csv, 
> mapred-site.xml, sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589
 ] 

Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 4:29 PM:
--

Hi [~werd.up],
 Thanks for reporting this issue and congratulations for the first reported 
Hadoop YARN jira.
{quote}In the process of attempting to verify and validate the SLS output, I've 
encountered a number of issues including runtime exceptions and bad output.
{quote}
I read through your observations and spent some time to play around with SLS.

If you encountered other issues, please report other jiras if you have some 
time.

As the process of running SLS involved some repetitive tasks like uploading 
configs to the remote machine, launch SLS, save the resulted logs..., I created 
some scripts into my public [Github repo 
here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427]

Let me summarize what the scripts are doing: 
 1. [config 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]:
 This is the exact same configuration file set that you attached to this jira, 
with one exception of the log4j.properties file, that turns on DEBUG logging 
for SLS.

2. [upstream-patches 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]:
 This is the directory of the logging patch that helped me see the issues more 
clearly.
 My code changes are also pushed to my [Hadoop 
fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation]

3. [scripts 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]:
 This is the directory that contains all my scripts to build Hadoop + launch 
SLS and save produced logs to the local machine.
 As I have been working on a remote cluster, there's a script called 
[setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]
 that contains some configuration values for the remote cluster + some local 
directories. If you want to use the scripts, all you need to do is to replace 
the configs in this file according to your environment.

3.1 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]:
 This is the script that builds Hadoop according to the environment variables 
and launches the SLS suite on the remote cluster.

3.2 
[start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]:
 This is the most important script as this will be executed on the remote 
machine. 
 I think the script itself is straightforward enough, but let me briefly list 
what it does:
 - This script assumes that the Hadoop dist package is copied to the remote 
machine (this was done by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh])
 - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz
 - Copies the config to Hadoop's config dirs so SLS will use these particular 
configs
 - Launches SLS by starting slsrun.sh with the appropriate CLI swithces
 - Greps for some useful data in the resulted SLS log file.

3.3 
[launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]:
 This script is executed by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]
 as its last step. Once the start-sls.sh is finished, the 
[save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh]
 script is started. As the name implies it saves the latest SLS log dir and 
SCPs it to the local machine. The target directory of the local machine is 
determined by the config 
([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scrip

[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253608#comment-17253608
 ] 

Hadoop QA commented on YARN-10427:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m 10s{color} 
| {color:red}{color} | {color:red} YARN-10427 does not apply to trunk. Rebase 
required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for 
help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-10427 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13017522/YARN-10427.002.patch |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/409/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> Duplicate Job IDs in SLS output
> ---
>
> Key: YARN-10427
> URL: https://issues.apache.org/jira/browse/YARN-10427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
> Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>Reporter: Drew Merrill
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, 
> YARN-10427.001.patch, YARN-10427.002.patch, fair-scheduler.xml, 
> inputsls.json, jobruntime.csv, jobruntime.csv, mapred-site.xml, 
> sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589
 ] 

Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 4:26 PM:
--

Hi [~werd.up],
 Thanks for reporting this issue and congratulations for the first reported 
Hadoop YARN jira.
{quote}In the process of attempting to verify and validate the SLS output, I've 
encountered a number of issues including runtime exceptions and bad output.
{quote}
I read through your observations and spent some time to play around with SLS.

If you encountered other issues, please report other jiras if you have some 
time.

As the process of running SLS involved some repetitive tasks like uploading 
configs to the remote machine, launch SLS, save the resulted logs..., I created 
some scripts into my public [Github repo 
here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427]

Let me summarize what are these scripts are doing: 
 1. [config 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]:
 This is the exact same configuration file set that you attached to this jira, 
with one exception of the log4j.properties file, that turns on DEBUG logging 
for SLS.

2. [upstream-patches 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]:
 This is the directory of the logging patch that helped me see the issues more 
clearly.
 My code changes are also pushed to my [Hadoop 
fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation]

3. [scripts 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]:
 This is the directory that contains all my scripts to build Hadoop + launch 
SLS and save produced logs to the local machine.
 As I have been working on a remote cluster, there's a script called 
[setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]
 that contains some configuration values for the remote cluster + some local 
directories. If you want to use the scripts, all you need to do is to replace 
the configs in this file according to your environment.

3.1 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]:
 This is the script that builds Hadoop according to the environment variables 
and launches the SLS suite on the remote cluster.

3.2 
[start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]:
 This is the most important script as this will be executed on the remote 
machine. 
 I think the script itself is straightforward enough, but let me briefly list 
what it does:
 - This script assumes that the Hadoop dist package is copied to the remote 
machine (this was done by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh])
 - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz
 - Copies the config to Hadoop's config dirs so SLS will use these particular 
configs
 - Launches SLS by starting slsrun.sh with the appropriate CLI swithces
 - Greps for some useful data in the resulted SLS log file.

3.3 
[launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]:
 This script is executed by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]
 as its last step. Once the start-sls.sh is finished, the 
[save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh]
 script is started. As the name implies it saves the latest SLS log dir and 
SCPs it to the local machine. The target directory of the local machine is 
determined by the config 
([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427

[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589
 ] 

Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 4:26 PM:
--

Hi [~werd.up],
 Thanks for reporting this issue and congratulations for the first reported 
Hadoop YARN jira.
{quote}In the process of attempting to verify and validate the SLS output, I've 
encountered a number of issues including runtime exceptions and bad output.
{quote}
I read through your observations and spent some time to play around with SLS.

If you encountered other issues, please report other jiras if you have some 
time.

As the process of running SLS involved some repetitive tasks like uploading 
configs to the remote machine, launch SLS, save the resulted logs..., I created 
some scripts into my public [Github repo 
here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427]

Let me summarize what the scripts are doing: 
 1. [config 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]:
 This is the exact same configuration file set that you attached to this jira, 
with one exception of the log4j.properties file, that turns on DEBUG logging 
for SLS.

2. [upstream-patches 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]:
 This is the directory of the logging patch that helped me see the issues more 
clearly.
 My code changes are also pushed to my [Hadoop 
fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation]

3. [scripts 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]:
 This is the directory that contains all my scripts to build Hadoop + launch 
SLS and save produced logs to the local machine.
 As I have been working on a remote cluster, there's a script called 
[setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]
 that contains some configuration values for the remote cluster + some local 
directories. If you want to use the scripts, all you need to do is to replace 
the configs in this file according to your environment.

3.1 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]:
 This is the script that builds Hadoop according to the environment variables 
and launches the SLS suite on the remote cluster.

3.2 
[start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]:
 This is the most important script as this will be executed on the remote 
machine. 
 I think the script itself is straightforward enough, but let me briefly list 
what it does:
 - This script assumes that the Hadoop dist package is copied to the remote 
machine (this was done by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh])
 - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz
 - Copies the config to Hadoop's config dirs so SLS will use these particular 
configs
 - Launches SLS by starting slsrun.sh with the appropriate CLI swithces
 - Greps for some useful data in the resulted SLS log file.

3.3 
[launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]:
 This script is executed by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]
 as its last step. Once the start-sls.sh is finished, the 
[save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh]
 script is started. As the name implies it saves the latest SLS log dir and 
SCPs it to the local machine. The target directory of the local machine is 
determined by the config 
([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scrip

[jira] [Updated] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10427:
--
Attachment: YARN-10427.002.patch

> Duplicate Job IDs in SLS output
> ---
>
> Key: YARN-10427
> URL: https://issues.apache.org/jira/browse/YARN-10427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
> Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>Reporter: Drew Merrill
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, 
> YARN-10427.001.patch, YARN-10427.002.patch, fair-scheduler.xml, 
> inputsls.json, jobruntime.csv, jobruntime.csv, mapred-site.xml, 
> sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253604#comment-17253604
 ] 

Szilard Nemeth commented on YARN-10427:
---

Accidentally attached a patch that also contains all the logging.
Adding a second patch with just the fix.

> Duplicate Job IDs in SLS output
> ---
>
> Key: YARN-10427
> URL: https://issues.apache.org/jira/browse/YARN-10427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
> Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>Reporter: Drew Merrill
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, 
> YARN-10427.001.patch, fair-scheduler.xml, inputsls.json, jobruntime.csv, 
> jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589
 ] 

Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 3:57 PM:
--

Hi [~werd.up],
 Thanks for reporting this issue and congratulations for the first reported 
Hadoop YARN jira.
{quote}In the process of attempting to verify and validate the SLS output, I've 
encountered a number of issues including runtime exceptions and bad output.
{quote}
I read through your observations and spent some time to play around with SLS.

If you encountered other issues, please report other jiras if you have some 
time.

As the process of running SLS involved some repetitive tasks like uploading 
configs to the remote machine, launch SLS, save the resulted logs..., I created 
some scripts into my public [Github repo 
here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427]

Let me break summarize what are these scripts are doing: 
 1. [config 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]:
 This is the exact same configuration file set that you attached to this jira, 
with one exception of the log4j.properties file, that turns on DEBUG logging 
for SLS.

2. [upstream-patches 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]:
 This is the directory of the logging patch that helped me see the issues more 
clearly.
 My code changes are also pushed to my [Hadoop 
fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation]

3. [scripts 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]:
 This is the directory that contains all my scripts to build Hadoop + launch 
SLS and save produced logs to the local machine.
 As I have been working on a remote cluster, there's a script called 
[setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]
 that contains some configuration values for the remote cluster + some local 
directories. If you want to use the scripts, all you need to do is to replace 
the configs in this file according to your environment.

3.1 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]:
 This is the script that builds Hadoop according to the environment variables 
and launches the SLS suite on the remote cluster.

3.2 
[start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]:
 This is the most important script as this will be executed on the remote 
machine. 
 I think the script itself is straightforward enough, but let me briefly list 
what it does:
 - This script assumes that the Hadoop dist package is copied to the remote 
machine (this was done by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh])
 - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz
 - Copies the config to Hadoop's config dirs so SLS will use these particular 
configs
 - Launches SLS by starting slsrun.sh with the appropriate CLI swithces
 - Greps for some useful data in the resulted SLS log file.

3.3 
[launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]:
 This script is executed by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]
 as its last step. Once the start-sls.sh is finished, the 
[save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh]
 script is started. As the name implies it saves the latest SLS log dir and 
SCPs it to the local machine. The target directory of the local machine is 
determined by the config 
([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN

[jira] [Comment Edited] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589
 ] 

Szilard Nemeth edited comment on YARN-10427 at 12/22/20, 3:56 PM:
--

Hi [~werd.up],
 Thanks for reporting this issue and congratulations for the first reported 
Hadoop YARN jira.
{quote}In the process of attempting to verify and validate the SLS output, I've 
encountered a number of issues including runtime exceptions and bad output.
{quote}
I read through your observations and spent some time to play around with SLS.

If you encountered other issues, please report other jiras if you have some 
time.

As the process of running SLS involved some repetitive tasks like uploading 
configs to the remote machine, launch SLS, save the resulted logs..., I created 
some scripts into my public [Github repo 
here|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427]

Let me break summarize what are these scripts are doing: 
 1. [config 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]:
 This is the exact same configuration file set that you attached to this jira, 
with one exception of the log4j.properties file, that turns on DEBUG logging 
for SLS.

2. [upstream-patches 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]:
 This is the directory of the logging patch that helped me see the issues more 
clearly.
 My code changes are also pushed to my [Hadoop 
fork|https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation]

3. [scripts 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]:
 This is the directory that contains all my scripts to build Hadoop + launch 
SLS and save produced logs to the local machine.
 As I have been working on a remote cluster, there's a script called 
[setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]
 that contains some configuration values for the remote cluster + some local 
directories. If you want to use the scripts, all you need to do is to replace 
the configs in this file according to your environment.

3.1 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]:
 This is the script that builds Hadoop according to the environment variables 
and launches the SLS suite on the remote cluster.

3.2 
[start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]:
 This is the most important script as this will be executed on the remote 
machine. 
 I think the script itself is straightforward enough, but let me briefly list 
what it does:
 - This script assumes that the Hadoop dist package is copied to the remote 
machine (this was done by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh])
 - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz
 - Copies the config to Hadoop's config dirs so SLS will use these particular 
configs
 - Launches SLS by starting slsrun.sh with the appropriate CLI swithces
 - Greps for some useful data in the resulted SLS log file.

3.3 
[launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]:
 This script is executed by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]
 as its last step. Once the start-sls.sh is finished, the 
[save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh]
 script is started. As the name implies it saves the latest SLS log dir and 
SCPs it to the local machine. The target directory of the local machine is 
determined by the config 
([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN

[jira] [Updated] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10427:
--
Attachment: YARN-10427.001.patch

> Duplicate Job IDs in SLS output
> ---
>
> Key: YARN-10427
> URL: https://issues.apache.org/jira/browse/YARN-10427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
> Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>Reporter: Drew Merrill
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, 
> YARN-10427.001.patch, fair-scheduler.xml, inputsls.json, jobruntime.csv, 
> jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10427:
--
Attachment: YARN-10427-sls-scriptsandlogs.tar.gz

> Duplicate Job IDs in SLS output
> ---
>
> Key: YARN-10427
> URL: https://issues.apache.org/jira/browse/YARN-10427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
> Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>Reporter: Drew Merrill
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10427-sls-scriptsandlogs.tar.gz, 
> YARN-10427.001.patch, fair-scheduler.xml, inputsls.json, jobruntime.csv, 
> jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589
 ] 

Szilard Nemeth commented on YARN-10427:
---

Hi [~werd.up],
 Thanks for reporting this issue and congratulations for the first reported 
Hadoop YARN jira.
{quote}In the process of attempting to verify and validate the SLS output, I've 
encountered a number of issues including runtime exceptions and bad output.
{quote}
I read through your observations and spent some time to play around with SLS.

If you encountered other issues, please report other jiras if you have some 
time.

As the process of running SLS involved some repetitive tasks like uploading 
configs to the remote machine, launch SLS, save the resulted logs..., I created 
some scripts into my public Github repo here: 
[https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427]

Let me break summarize what are these scripts are doing: 
 1. [config 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]:
 This is the exact same configuration file set that you attached to this jira, 
with one exception of the log4j.properties file, that turns on DEBUG logging 
for SLS.

2. [upstream-patches 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]:
 This is the directory of the logging patch that helped me see the issues more 
clearly.
 My code changes are also pushed to my Hadoop fork: 
[https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation]

3. [scripts 
dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]:
 This is the directory that contains all my scripts to build Hadoop + launch 
SLS and save produced logs to the local machine.
 As I have been working on a remote cluster, there's a script called 
[setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]
 that contains some configuration values for the remote cluster + some local 
directories. If you want to use the scripts, all you need to do is to replace 
the configs in this file according to your environment.

3.1 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]:
 This is the script that builds Hadoop according to the environment variables 
and launches the SLS suite on the remote cluster.

3.2 
[start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]:
 This is the most important script as this will be executed on the remote 
machine. 
 I think the script itself is straightforward enough, but let me briefly list 
what it does:
 - This script assumes that the Hadoop dist package is copied to the remote 
machine (this was done by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh])
 - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz
 - Copies the config to Hadoop's config dirs so SLS will use these particular 
configs
 - Launches SLS by starting slsrun.sh with the appropriate CLI swithces
 - Greps for some useful data in the resulted SLS log file.

3.3 
[launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]:
 This script is executed by 
[build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]
 as its last step. Once the start-sls.sh is finished, the 
[save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh]
 script is started. As the name implies it saves the latest SLS log dir and 
SCPs it to the local machine. The target directory of the local machine is 
determined by the config 
([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]).

*The late

[jira] [Assigned] (YARN-10427) Duplicate Job IDs in SLS output

2020-12-22 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10427:
-

Assignee: Szilard Nemeth

> Duplicate Job IDs in SLS output
> ---
>
> Key: YARN-10427
> URL: https://issues.apache.org/jira/browse/YARN-10427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
> Environment: I ran the attached inputs on my MacBook Pro, using 
> Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also 
> tested against 3.2.1 and 3.3.0 release branches.
>  
>Reporter: Drew Merrill
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: fair-scheduler.xml, inputsls.json, jobruntime.csv, 
> jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've 
> been having with the YARN Scheduler Load Simulator (SLS). I've been 
> experimenting with SLS for several months now at work as we're trying to 
> build a simulation model to characterize our enterprise Hadoop infrastructure 
> for purposes of future capacity planning. In the process of attempting to 
> verify and validate the SLS output, I've encountered a number of issues 
> including runtime exceptions and bad output. The focus of this issue is the 
> bad output. In all my simulation runs, the jobruntime.csv output seems to 
> have one or more of the following problems: no output, duplicate job ids, 
> and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically 
> use, but I'm able to reproduce the problem of the duplicate Job IDS using 
> some simplified inputs and configuration files, which I've attached, along 
> with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json 
> --output-dir=sls-run-1 --print-simulation 
> --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the 
> output? Is this a bug in Hadoop or a problem with my inputs? Thanks in 
> advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed 
> something or am not understanding something obvious about the way Hadoop 
> works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-22 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253560#comment-17253560
 ] 

Jim Brennan commented on YARN-10540:


Thanks [~ebadger]!  And thanks [~hexiaoqiao],  [~ayushtkn] and [~sunilg] for 
finding and investigating this bug.  Much appreciated.

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2020-12-22 Thread caozhiqiang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caozhiqiang updated YARN-10501:
---
Comment: was deleted

(was: @[~bteke],  [~snemeth],  [~epayne], [~hexiaoqiao], [~wilfreds], 
[~jianhe], [~ebadger]

could you help me to review this issue or recommend another suitable committer 
to review? Thank you.)

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Attachments: YARN-10501.002.patch, YARN-10501.003.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2020-12-22 Thread caozhiqiang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253330#comment-17253330
 ] 

caozhiqiang commented on YARN-10501:


could anyone review this issue?

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Attachments: YARN-10501.002.patch, YARN-10501.003.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org