[jira] [Commented] (YARN-10442) RM should make sure node label file highly available

2020-10-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214316#comment-17214316
 ] 

Hadoop QA commented on YARN-10442:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
13s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 1 new or modified 
test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} |  | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
11s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
23s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
22s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
48s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
21s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
23m 49s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
38s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
35s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
12s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
57s{color} |  | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
32s{color} |  | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
47s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m  
8s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 11m  
8s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m  
0s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m  
0s{color} |  | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} blanks {color} | {color:red}  0m  
0s{color} | 
[/blanks-eol.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/235/artifact/out/blanks-eol.txt]
 | {color:red} The patch has 2 line(s) that end in blanks. Use git apply 
--blanks=fix <>. Refer https://git-scm.com/docs/git-apply {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 45s{color} | 
[/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/235/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt]
 | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 3 new + 
210 unchanged - 0 fixed = 213 total (was 210) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
12s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
25m 10s{color} |  | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{colo

[jira] [Updated] (YARN-10442) RM should make sure node label file highly available

2020-10-14 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated YARN-10442:
--
Attachment: YARN-10442.002.patch

> RM should make sure node label file highly available
> 
>
> Key: YARN-10442
> URL: https://issues.apache.org/jira/browse/YARN-10442
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: YARN-10442.001.patch, YARN-10442.002.patch
>
>
> One of my cluster RM failed transition to Active because node label file 
> blocks are missing. I think RM should to make sure important files are highly 
> available. 
> {noformat}
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Could not 
> obtain block: BP-2121803626-10.0.0.22-1597301807397:blk_1073832522_91774 
> file=/yarn/node-labels/nodelabel.mirror
>   at 
> com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:238)
>   at 
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
>   at 
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
>   at 
> com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
>   at 
> org.apache.hadoop.yarn.proto.YarnServerResourceManagerServiceProtos$AddToClusterNodeLabelsRequestProto.parseDelimitedFrom(YarnServerResourceManagerServiceProtos.java:7493)
>   at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.loadFromMirror(FileSystemNodeLabelsStore.java:168)
>   at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:205)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:254)
>   at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:268)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)(AbstractService.java:194){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-10-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214128#comment-17214128
 ] 

Hadoop QA commented on YARN-10421:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
14s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 4 new or modified 
test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
58s{color} |  | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
32s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
27s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
44s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 0s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
16s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 47s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
40s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
21s{color} |  | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} |  | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
11s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
20s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
20s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
40s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
40s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} blanks {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch has no blanks issues. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 54s{color} | 
[/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/234/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server.txt]
 | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: The patch 
generated 1 new + 16 unchanged - 0 fixed = 17 total (was 16) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
8s{color} |  | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} pylint {color} | {color:orange}  0m  
2s{color} | 
[/diff-pylint.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/234/artifact/out/diff-pylint.txt]
 | {color:orange} The patch generated 18 new + 0 unchanged - 0 fixed = 18 total 
(was 0) {color} |
| {color:green}+1{color} | {color:green} 

[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-10-14 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214050#comment-17214050
 ] 

Eric Badger commented on YARN-10244:


I'm pretty confused with all of the JIRAs on this. In the future, I think we 
should revert the JIRA using the JIRA that was committed. Let me summarize what 
I think happened and you all can let me know if I have it right.

YARN-4946 committed to 3.2, so it is in 3.2, 3.3, and trunk
YARN-9848 reverted YARN-4946 from 3.3, so YARN-4946 only remains in 3.2
YARN-10244 reverted YARN-4946 from 3.2, so YARN-4946 has been completely 
reverted

It's really confusing to me because YARN-4946 has the Fix Version set as 3.2. 
And then this JIRA says it is backporting YARN-9848, instead of saying it's 
reverting YARN-4946. Anyway, like I said above, if we're going to revert stuff, 
I think it is better to do it on the JIRA where it was committed so that we 
have a clear linear log of where it was committed to and reverted from. We can 
also then look at the Fix Version for that particular JIRA and know where it is 
actually committed

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10461) Update the ResourceManagerRest.md with the introduced endpoints

2020-10-14 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reassigned YARN-10461:


Assignee: Benjamin Teke

> Update the ResourceManagerRest.md with the introduced endpoints
> ---
>
> Key: YARN-10461
> URL: https://issues.apache.org/jira/browse/YARN-10461
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> The new APIs introduced in YARN-10421 should be added to the file 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10461) Update the ResourceManagerRest.md with the introduced endpoints

2020-10-14 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-10461:


 Summary: Update the ResourceManagerRest.md with the introduced 
endpoints
 Key: YARN-10461
 URL: https://issues.apache.org/jira/browse/YARN-10461
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


The new APIs introduced in YARN-10421 should be added to the file 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10433) Extend the DiagnosticService to initiate the diagnostic bundle collection

2020-10-14 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-10433.
--
Resolution: Abandoned

Migrated to YARN-10421

 

> Extend the DiagnosticService to initiate the diagnostic bundle collection
> -
>
> Key: YARN-10433
> URL: https://issues.apache.org/jira/browse/YARN-10433
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> YARN-10421 introduces the new DiagnosticService class, the two new endpoints 
> for listing the available actions and starting the diagnostic script collect 
> method, and a basic diagnostic collector script. After the scripts form is 
> finalized (YARN-10422) the DiagnosticService should be extended to spawn the 
> requested collection method based on the input parameters and return the 
> collected bundle as a response.
> To ease the load on the RM, the servlet should allow only one HTTP request at 
> a time. If a new request comes in while serving another an appropriate 
> response code should be returned, with the message "Diagnostics Collection in 
> Progress”. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-10-14 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214005#comment-17214005
 ] 

Benjamin Teke commented on YARN-10421:
--

Hi [~snemeth],

 

Thanks for the insights!

1-2-7-9-11-12-15: corrected them.

3. Other similar methods of this class use the same message so I didn't want to 
introduce something different.

4. I added some TODO comments. This class is used for testing purposes and the 
REST testing related tasks are part of YARN-10434.

5. Yes, I tried to follow the convention.

6. Thanks, I would have missed this document. As the overall implementation is 
subject to changes I'll create a new Jira for the documentation updates.

8. This item is  to be changed in the tests, and in YARN-10432 it will be 
replaced by a configuration entry. Should I uppercase it regardless?

10. It is currently a placeholder, in YARN-10422 I'll bundle the script and 
after that I'll update the location.

13-14. I refactored the tests.

> Create YarnDiagnosticsService to serve diagnostic queries 
> --
>
> Key: YARN-10421
> URL: https://issues.apache.org/jira/browse/YARN-10421
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10421.001.patch, YARN-10421.002.patch, 
> YARN-10421.003.patch, YARN-10421.004.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
> forks a separate process, which executes a shell/Python/etc script. Based on 
> the use-cases listed below the script collects information, bundles it and 
> sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL
>  # Application failed: 
>  ** Application logs
>  ** ResourceManager logs during job lifecycle.
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL.
>  ** Job related metrics like container, attempts.
>  # Scheduler related issue:
>  ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
>  ** Multiple Jstacks of ResourceManager
>  ** YARN and Scheduler Configuration
>  ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API 
> _/ws/v1/cluster/nodes response_
>  ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response 
> (YARN-10319)
>  # ResourceManager / NodeManager daemon fails to start:
>  ** ResourceManager and NodeManager out and log file
>  ** YARN and Scheduler Configuration
> Two new endpoints should be added to the RM web service: one for listing the 
> available diagnostic options (_/common-issue/list_), and one for calling a 
> selected option with the user provided parameters (_/common-issue/collect_). 
> The service should be transparent to the script changes to help with the 
> (on-the-fly) extensibility of the diagnostic tool. To split the changes to 
> smaller chunks the implementation behind _collect_ endpoint is to be provided 
> in YARN-10433.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-10-14 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10421:
-
Description: 
YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
forks a separate process, which executes a shell/Python/etc script. Based on 
the use-cases listed below the script collects information, bundles it and 
sends it to UI2. The diagnostic options are the following:
 # Application hanging: 
 ** Application logs
 ** Find the hanging container and get multiple Jstacks
 ** ResourceManager logs during job lifecycle
 ** NodeManager logs from NodeManager where the hanging containers of the jobs 
ran
 ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
History URL
 # Application failed: 
 ** Application logs
 ** ResourceManager logs during job lifecycle.
 ** NodeManager logs from NodeManager where the hanging containers of the jobs 
ran
 ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
History URL.
 ** Job related metrics like container, attempts.
 # Scheduler related issue:
 ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
 ** Multiple Jstacks of ResourceManager
 ** YARN and Scheduler Configuration
 ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API 
_/ws/v1/cluster/nodes response_
 ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response 
(YARN-10319)
 # ResourceManager / NodeManager daemon fails to start:
 ** ResourceManager and NodeManager out and log file
 ** YARN and Scheduler Configuration

Two new endpoints should be added to the RM web service: one for listing the 
available diagnostic options (_/common-issue/list_), and one for calling a 
selected option with the user provided parameters (_/common-issue/collect_).

  was:
YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
forks a separate process, which executes a shell/Python/etc script. Based on 
the use-cases listed below the script collects information, bundles it and 
sends it to UI2. The diagnostic options are the following:
 # Application hanging: 
 ** Application logs
 ** Find the hanging container and get multiple Jstacks
 ** ResourceManager logs during job lifecycle
 ** NodeManager logs from NodeManager where the hanging containers of the jobs 
ran
 ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
History URL
 # Application failed: 
 ** Application logs
 ** ResourceManager logs during job lifecycle.
 ** NodeManager logs from NodeManager where the hanging containers of the jobs 
ran
 ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
History URL.
 ** Job related metrics like container, attempts.
 # Scheduler related issue:
 ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
 ** Multiple Jstacks of ResourceManager
 ** YARN and Scheduler Configuration
 ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API 
_/ws/v1/cluster/nodes response_
 ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response 
(YARN-10319)
 # ResourceManager / NodeManager daemon fails to start:
 ** ResourceManager and NodeManager out and log file
 ** YARN and Scheduler Configuration

Two new endpoints should be added to the RM web service: one for listing the 
available diagnostic options (_/common-issue/list_), and one for calling a 
selected option with the user provided parameters (_/common-issue/collect_). 
The service should be transparent to the script changes to help with the 
(on-the-fly) extensibility of the diagnostic tool. To split the changes to 
smaller chunks the implementation behind _collect_ endpoint is to be provided 
in YARN-10433.


> Create YarnDiagnosticsService to serve diagnostic queries 
> --
>
> Key: YARN-10421
> URL: https://issues.apache.org/jira/browse/YARN-10421
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10421.001.patch, YARN-10421.002.patch, 
> YARN-10421.003.patch, YARN-10421.004.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
> forks a separate process, which executes a shell/Python/etc script. Based on 
> the use-cases listed below the script collects information, bundles it and 
> sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL
>  # Applicatio

[jira] [Updated] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-10-14 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10421:
-
Attachment: YARN-10421.004.patch

> Create YarnDiagnosticsService to serve diagnostic queries 
> --
>
> Key: YARN-10421
> URL: https://issues.apache.org/jira/browse/YARN-10421
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10421.001.patch, YARN-10421.002.patch, 
> YARN-10421.003.patch, YARN-10421.004.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
> forks a separate process, which executes a shell/Python/etc script. Based on 
> the use-cases listed below the script collects information, bundles it and 
> sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL
>  # Application failed: 
>  ** Application logs
>  ** ResourceManager logs during job lifecycle.
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL.
>  ** Job related metrics like container, attempts.
>  # Scheduler related issue:
>  ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
>  ** Multiple Jstacks of ResourceManager
>  ** YARN and Scheduler Configuration
>  ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API 
> _/ws/v1/cluster/nodes response_
>  ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response 
> (YARN-10319)
>  # ResourceManager / NodeManager daemon fails to start:
>  ** ResourceManager and NodeManager out and log file
>  ** YARN and Scheduler Configuration
> Two new endpoints should be added to the RM web service: one for listing the 
> available diagnostic options (_/common-issue/list_), and one for calling a 
> selected option with the user provided parameters (_/common-issue/collect_). 
> The service should be transparent to the script changes to help with the 
> (on-the-fly) extensibility of the diagnostic tool. To split the changes to 
> smaller chunks the implementation behind _collect_ endpoint is to be provided 
> in YARN-10433.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10460:

Attachment: YARN-10460-POC.patch

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as 
> long as they're needed. But since the backing thr

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10460:

Attachment: YARN-10460-POC.patch

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as 
> long as they're needed. But since the backing thread group is destroyed in 
> the previous test

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10460:

Attachment: (was: YARN-10460-POC.patch)

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as 
> long as they're needed. But since the backing thread group is destroyed in 
> the pr

[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-14 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213830#comment-17213830
 ] 

Peter Bacsko commented on YARN-10460:
-

cc [~Jim_Brennan] [~ebadger] [~weichiu] [~aajisaka] what do you guys think?

Sooner or later we'll bump the JUnit version and this will happen.

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> client ob

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10460:

Description: 
In our downstream build environment, we're using JUnit 4.13. Recently, we 
discovered a truly weird test failure in TestNodeStatusUpdater.

The problem is that timeout handling has changed in Junit 4.13. See the 
difference between these two snippets:

4.12
{noformat}
@Override
public void evaluate() throws Throwable {
CallableStatement callable = new CallableStatement();
FutureTask task = new FutureTask(callable);
threadGroup = new ThreadGroup("FailOnTimeoutGroup");
Thread thread = new Thread(threadGroup, task, "Time-limited test");
thread.setDaemon(true);
thread.start();
callable.awaitStarted();
Throwable throwable = getResult(task, thread);
if (throwable != null) {
throw throwable;
}
}
{noformat}
 
 4.13
{noformat}
@Override
public void evaluate() throws Throwable {
CallableStatement callable = new CallableStatement();
FutureTask task = new FutureTask(callable);
ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
Thread thread = new Thread(threadGroup, task, "Time-limited test");
try {
thread.setDaemon(true);
thread.start();
callable.awaitStarted();
Throwable throwable = getResult(task, thread);
if (throwable != null) {
throw throwable;
}
} finally {
try {
thread.join(1);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
try {
threadGroup.destroy();  < This
} catch (IllegalThreadStateException e) {
// If a thread from the group is still alive, the ThreadGroup 
cannot be destroyed.
// Swallow the exception to keep the same behavior prior to 
this change.
}
}
}
{noformat}
The change comes from [https://github.com/junit-team/junit4/pull/1517].

Unfortunately, destroying the thread group causes an issue because there are 
all sorts of object caching in the IPC layer. The exception is:
{noformat}
java.lang.IllegalThreadStateException
at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
at java.lang.Thread.init(Thread.java:402)
at java.lang.Thread.init(Thread.java:349)
at java.lang.Thread.(Thread.java:675)
at 
java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
at 
com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
at org.apache.hadoop.ipc.Client.call(Client.java:1458)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
{noformat}
Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client 
object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as long as 
they're needed. But since the backing thread group is destroyed in the previous 
test, it's no longer possible to create new threads.

A quick workaround is to stop the clients and completely clear the 
{{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and 
it solves the problem but it feels hacky. Not sure if there is a better 
approach.

  was:
In our downstream build environment, we're using JUnit 4.13. Recently, we 
discovered a truly weird test failure in TestNodeStatusUpdater.

The problem is that timeout handling has changed in Junit 4.13. See the 
difference between these two snippets:

4.12
{noformat}
@Override
public void evaluate() throws Throwable {
Ca

[jira] [Created] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-14 Thread Peter Bacsko (Jira)
Peter Bacsko created YARN-10460:
---

 Summary: Upgrading to JUnit 4.13 causes tests in 
TestNodeStatusUpdater to fail
 Key: YARN-10460
 URL: https://issues.apache.org/jira/browse/YARN-10460
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, test
Reporter: Peter Bacsko
Assignee: Peter Bacsko


In our downstream build environment, we're using JUnit 4.13. Recently, we 
discovered a truly weird test failure in TestNodeStatusUpdater.

The problem is that timeout handling has changed in Junit 4.13. See the 
difference between these two snippets:

4.12
{noformat}
@Override
public void evaluate() throws Throwable {
CallableStatement callable = new CallableStatement();
FutureTask task = new FutureTask(callable);
threadGroup = new ThreadGroup("FailOnTimeoutGroup");
Thread thread = new Thread(threadGroup, task, "Time-limited test");
thread.setDaemon(true);
thread.start();
callable.awaitStarted();
Throwable throwable = getResult(task, thread);
if (throwable != null) {
throw throwable;
}
}
{noformat}
 
 4.13
{noformat}
@Override
public void evaluate() throws Throwable {
CallableStatement callable = new CallableStatement();
FutureTask task = new FutureTask(callable);
ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
Thread thread = new Thread(threadGroup, task, "Time-limited test");
try {
thread.setDaemon(true);
thread.start();
callable.awaitStarted();
Throwable throwable = getResult(task, thread);
if (throwable != null) {
throw throwable;
}
} finally {
try {
thread.join(1);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
try {
threadGroup.destroy();  < This
} catch (IllegalThreadStateException e) {
// If a thread from the group is still alive, the ThreadGroup 
cannot be destroyed.
// Swallow the exception to keep the same behavior prior to 
this change.
}
}
}
{noformat}
The change comes from [https://github.com/junit-team/junit4/pull/1517].

Unfortunately, destroying the thread group causes an issue because there are 
all sorts of object caching in the IPC layer. The exception is:
{noformat}
java.lang.IllegalThreadStateException
at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
at java.lang.Thread.init(Thread.java:402)
at java.lang.Thread.init(Thread.java:349)
at java.lang.Thread.(Thread.java:675)
at 
java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
at 
com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
at org.apache.hadoop.ipc.Client.call(Client.java:1458)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
{noformat}
Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client 
object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} is stored as long as 
they're needed. But since the backing thread group is destroyed in the previous 
test, it's no longer possible to create new threads.

A quick workaround is to stop the clients and completely clear the 
{{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and 
it solves the problem but it feels hacky. Not sure if there is a better 
approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---

[jira] [Updated] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock

2020-10-14 Thread Ryan Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Wu updated YARN-10459:
---
Description: 
 

Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler.
{code:java}
// public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) 
{ 
// Inform the container 
writelock.lock 
try { 
  RMContainer rmContainer = getRMContainer(containerId); 
  if (rmContainer == null) { 
  // Some unknown container sneaked into the system. Kill it.  
rmContext.getDispatcher().getEventHandler().handle( new 
RMNodeCleanContainerEvent(nodeId, containerId)); return; 
  } 
  rmContainer.handle( new RMContainerEvent(containerId, 
RMContainerEventType.LAUNCHED)); 
   }finally { 
  writeLock.unlock(); 
   } 
}  
{code}
 

  was:
 

Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler.
{code:java}
// code placeholder
{code}
public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) \{ 
// Inform the container writelock.lock try { RMContainer rmContainer = 
getRMContainer(containerId); if (rmContainer == null) { // Some unknown 
container sneaked into the system. Kill it. 
rmContext.getDispatcher().getEventHandler().handle( new 
RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( 
new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally \{ 
writeLock.unlock(); } }  


> containerLaunchedOnNode method not need to hold schedulerApptemt lock 
> --
>
> Key: YARN-10459
> URL: https://issues.apache.org/jira/browse/YARN-10459
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 3.1.3
>Reporter: Ryan Wu
>Priority: Major
> Fix For: 3.2.1
>
>
>  
> Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt 
> writelock, but looking at the method, it does not change any field. And more 
> seriously, this will affect the scheduler.
> {code:java}
> // public void containerLaunchedOnNode(ContainerId containerId, NodeId 
> nodeId) { 
> // Inform the container 
> writelock.lock 
> try { 
>   RMContainer rmContainer = getRMContainer(containerId); 
>   if (rmContainer == null) { 
>   // Some unknown container sneaked into the system. Kill it.  
> rmContext.getDispatcher().getEventHandler().handle( new 
> RMNodeCleanContainerEvent(nodeId, containerId)); return; 
>   } 
>   rmContainer.handle( new RMContainerEvent(containerId, 
> RMContainerEventType.LAUNCHED)); 
>}finally { 
>   writeLock.unlock(); 
>} 
> }  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock

2020-10-14 Thread Ryan Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Wu updated YARN-10459:
---
Description: 
 

Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler.
{code:java}
 public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) { 
// Inform the container 
writelock.lock 
try { 
  RMContainer rmContainer = getRMContainer(containerId); 
  if (rmContainer == null) { 
  // Some unknown container sneaked into the system. Kill it.  
rmContext.getDispatcher().getEventHandler().handle( new 
RMNodeCleanContainerEvent(nodeId, containerId)); return; 
  } 
  rmContainer.handle( new RMContainerEvent(containerId, 
RMContainerEventType.LAUNCHED)); 
   }finally { 
  writeLock.unlock(); 
   } 
}  
{code}
 

  was:
 

Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler.
{code:java}
// public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) 
{ 
// Inform the container 
writelock.lock 
try { 
  RMContainer rmContainer = getRMContainer(containerId); 
  if (rmContainer == null) { 
  // Some unknown container sneaked into the system. Kill it.  
rmContext.getDispatcher().getEventHandler().handle( new 
RMNodeCleanContainerEvent(nodeId, containerId)); return; 
  } 
  rmContainer.handle( new RMContainerEvent(containerId, 
RMContainerEventType.LAUNCHED)); 
   }finally { 
  writeLock.unlock(); 
   } 
}  
{code}
 


> containerLaunchedOnNode method not need to hold schedulerApptemt lock 
> --
>
> Key: YARN-10459
> URL: https://issues.apache.org/jira/browse/YARN-10459
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 3.1.3
>Reporter: Ryan Wu
>Priority: Major
> Fix For: 3.2.1
>
>
>  
> Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt 
> writelock, but looking at the method, it does not change any field. And more 
> seriously, this will affect the scheduler.
> {code:java}
>  public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) 
> { 
> // Inform the container 
> writelock.lock 
> try { 
>   RMContainer rmContainer = getRMContainer(containerId); 
>   if (rmContainer == null) { 
>   // Some unknown container sneaked into the system. Kill it.  
> rmContext.getDispatcher().getEventHandler().handle( new 
> RMNodeCleanContainerEvent(nodeId, containerId)); return; 
>   } 
>   rmContainer.handle( new RMContainerEvent(containerId, 
> RMContainerEventType.LAUNCHED)); 
>}finally { 
>   writeLock.unlock(); 
>} 
> }  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock

2020-10-14 Thread Ryan Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Wu updated YARN-10459:
---
Description: 
 

Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler.
{code:java}
// code placeholder
{code}
public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) \{ 
// Inform the container writelock.lock try { RMContainer rmContainer = 
getRMContainer(containerId); if (rmContainer == null) { // Some unknown 
container sneaked into the system. Kill it. 
rmContext.getDispatcher().getEventHandler().handle( new 
RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( 
new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally \{ 
writeLock.unlock(); } }  

  was:
 

Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler.
{code:java}
// public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) 
{ // Inform the container writelock.lock try { RMContainer rmContainer = 
getRMContainer(containerId); if (rmContainer == null) { // Some unknown 
container sneaked into the system. Kill it. 
rmContext.getDispatcher().getEventHandler().handle( new 
RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( 
new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { 
writeLock.unlock(); } }  
{code}
 


> containerLaunchedOnNode method not need to hold schedulerApptemt lock 
> --
>
> Key: YARN-10459
> URL: https://issues.apache.org/jira/browse/YARN-10459
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 3.1.3
>Reporter: Ryan Wu
>Priority: Major
> Fix For: 3.2.1
>
>
>  
> Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt 
> writelock, but looking at the method, it does not change any field. And more 
> seriously, this will affect the scheduler.
> {code:java}
> // code placeholder
> {code}
> public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) 
> \{ // Inform the container writelock.lock try { RMContainer rmContainer = 
> getRMContainer(containerId); if (rmContainer == null) { // Some unknown 
> container sneaked into the system. Kill it. 
> rmContext.getDispatcher().getEventHandler().handle( new 
> RMNodeCleanContainerEvent(nodeId, containerId)); return; } 
> rmContainer.handle( new RMContainerEvent(containerId, 
> RMContainerEventType.LAUNCHED)); }finally \{ writeLock.unlock(); } }  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock

2020-10-14 Thread Ryan Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Wu updated YARN-10459:
---
Description: 
 

Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler.
{code:java}
// public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) 
{ // Inform the container writelock.lock try { RMContainer rmContainer = 
getRMContainer(containerId); if (rmContainer == null) { // Some unknown 
container sneaked into the system. Kill it. 
rmContext.getDispatcher().getEventHandler().handle( new 
RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( 
new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { 
writeLock.unlock(); } }  
{code}
 

  was:
 

Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler  

public void containerLaunchedOnNode(ContainerId containerId,
NodeId nodeId) {
// Inform the container
writelock.lock
try \{
RMContainer rmContainer = getRMContainer(containerId);
if (rmContainer == null) {
  // Some unknown container sneaked into the system. Kill it.
  rmContext.getDispatcher().getEventHandler().handle(
  new RMNodeCleanContainerEvent(nodeId, containerId));
  return;
}

rmContainer.handle(
new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED));

}finally {  
 writeLock.unlock(); 
}   
}

 


> containerLaunchedOnNode method not need to hold schedulerApptemt lock 
> --
>
> Key: YARN-10459
> URL: https://issues.apache.org/jira/browse/YARN-10459
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 3.1.3
>Reporter: Ryan Wu
>Priority: Major
> Fix For: 3.2.1
>
>
>  
> Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt 
> writelock, but looking at the method, it does not change any field. And more 
> seriously, this will affect the scheduler.
> {code:java}
> // public void containerLaunchedOnNode(ContainerId containerId, NodeId 
> nodeId) { // Inform the container writelock.lock try { RMContainer 
> rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some 
> unknown container sneaked into the system. Kill it. 
> rmContext.getDispatcher().getEventHandler().handle( new 
> RMNodeCleanContainerEvent(nodeId, containerId)); return; } 
> rmContainer.handle( new RMContainerEvent(containerId, 
> RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } }  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock

2020-10-14 Thread Ryan Wu (Jira)
Ryan Wu created YARN-10459:
--

 Summary: containerLaunchedOnNode method not need to hold 
schedulerApptemt lock 
 Key: YARN-10459
 URL: https://issues.apache.org/jira/browse/YARN-10459
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.1.3, 3.2.0
Reporter: Ryan Wu
 Fix For: 3.2.1


 

Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt 
writelock, but looking at the method, it does not change any field. And more 
seriously, this will affect the scheduler  

public void containerLaunchedOnNode(ContainerId containerId,
NodeId nodeId) {
// Inform the container
writelock.lock
try \{
RMContainer rmContainer = getRMContainer(containerId);
if (rmContainer == null) {
  // Some unknown container sneaked into the system. Kill it.
  rmContext.getDispatcher().getEventHandler().handle(
  new RMNodeCleanContainerEvent(nodeId, containerId));
  return;
}

rmContainer.handle(
new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED));

}finally {  
 writeLock.unlock(); 
}   
}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-10-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213786#comment-17213786
 ] 

Hadoop QA commented on YARN-10244:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
15s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 4 new or modified 
test files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 29m 
 2s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 46s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
39s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
38s{color} |  | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} blanks {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch has no blanks issues. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 33s{color} | 
[/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/233/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt]
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 197 unchanged - 6 fixed = 199 total (was 203) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 18s{color} |  | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
41s{color} |  | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 22s{color} 
| 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/233/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt]
 | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
30s{color} |  | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}147m 26s{color} | 
 | {color:black}{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 |
|   | 
hadoop.y

[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-10-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213762#comment-17213762
 ] 

Hadoop QA commented on YARN-10244:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m  
1s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 4 new or modified 
test files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
28s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 58s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} |  | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
50s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
48s{color} |  | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
50s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} blanks {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch has no blanks issues. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 32s{color} | 
[/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/232/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt]
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 197 unchanged - 6 fixed = 199 total (was 203) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 30s{color} |  | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
49s{color} |  | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 31s{color} 
| 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/232/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt]
 | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} |  | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}158m 39s{color} | 
 | {color:black}{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 |
|   | 
hadoop.y

[jira] [Updated] (YARN-9351) user can't use total resources of one partition even when yarn.scheduler.capacity..minimum-user-limit-percent is set to 100

2020-10-14 Thread Juanjuan Tian (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juanjuan Tian  updated YARN-9351:
-
Summary: user can't use total resources of one partition even when 
yarn.scheduler.capacity..minimum-user-limit-percent is set to 100   
(was: user can't use total resources of one partition even 
yarn.scheduler.capacity..minimum-user-limit-percent is set to 100 )

> user can't use total resources of one partition even when 
> yarn.scheduler.capacity..minimum-user-limit-percent is set to 100 
> 
>
> Key: YARN-9351
> URL: https://issues.apache.org/jira/browse/YARN-9351
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Juanjuan Tian 
>Assignee: Juanjuan Tian 
>Priority: Major
>
> if we configure queue capacity in absolute term, users can't use total 
> resource of one partition even 
> yarn.scheduler.capacity..minimum-user-limit-percent is set to 100 
>  for example there are two partition A,B, partition A has (120G memory,30 
> vcores), and partition B has (180G memory,60 vcores), and Queue Prod is 
> configured with (75G memory, 25 vcores) partition A resource, like 
> yarn.scheduler.capacity.root.Prod.accessible-node-labels.A.capacity=[memory=75Gi,vcores=25],
> and 
> yarn.scheduler.capacity.root.Prod.accessible-node-labels.A.maximum-capacity=[memory=120Gi,vcores=30]
> yarn.scheduler.capacity.root.Prod.minimum-user-limit-percent=100, and at one 
> point the used resource of queue Prod is (90G memory,10 vcores), at this time 
> even though yarn.scheduler.capacity..minimum-user-limit-percent 
> is set to 100 , users in queue A can't get more resource.
>  
> the reason for this is that  when {color:#d04437}*computeUserLimit*{color}, 
> partitionResource is used for comparing consumed, queueCapacity, so in the 
> example (75G memory, 25 vcores) is the user limit. 
> Resource currentCapacity = Resources.lessThan(resourceCalculator,
>  partitionResource, consumed, queueCapacity)
>  ? queueCapacity
>  : Resources.add(consumed, required);
> Resource userLimitResource = Resources.max(resourceCalculator, 
> partitionResource,Resources.divideAndCeil(resourceCalculator, resourceUsed,
> usersSummedByWeight),Resources.divideAndCeil(resourceCalculator,Resources.multiplyAndRoundDown(currentCapacity,
>  getUserLimit()),100));
>  
> but when *{color:#d04437}canAssignToUser{color}* = 
> Resources.greaterThan(resourceCalculator, clusterResource,
>  user.getUsed(nodePartition), limit)
> *{color:#d04437}clusterResource{color}* {color:#33}is used for for 
> comparing  *used and limit, the result is false.*{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-10-14 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213697#comment-17213697
 ] 

Akira Ajisaka commented on YARN-10244:
--

+1 for the 003 patch. Thank you [~Steven Rand], [~sunilg], and [~hexiaoqiao].

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org