[jira] [Commented] (YARN-10442) RM should make sure node label file highly available
[ https://issues.apache.org/jira/browse/YARN-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214316#comment-17214316 ] Hadoop QA commented on YARN-10442: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 13s{color} | | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 11s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 23s{color} | | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 22s{color} | | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 48s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 21s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 23m 49s{color} | | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 38s{color} | | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 35s{color} | | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 12s{color} | | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 57s{color} | | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s{color} | | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 47s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 8s{color} | | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 11m 8s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 0s{color} | | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 0s{color} | | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} blanks {color} | {color:red} 0m 0s{color} | [/blanks-eol.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/235/artifact/out/blanks-eol.txt] | {color:red} The patch has 2 line(s) that end in blanks. Use git apply --blanks=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 45s{color} | [/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/235/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt] | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 3 new + 210 unchanged - 0 fixed = 213 total (was 210) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 12s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 25m 10s{color} | | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{colo
[jira] [Updated] (YARN-10442) RM should make sure node label file highly available
[ https://issues.apache.org/jira/browse/YARN-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated YARN-10442: -- Attachment: YARN-10442.002.patch > RM should make sure node label file highly available > > > Key: YARN-10442 > URL: https://issues.apache.org/jira/browse/YARN-10442 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: YARN-10442.001.patch, YARN-10442.002.patch > > > One of my cluster RM failed transition to Active because node label file > blocks are missing. I think RM should to make sure important files are highly > available. > {noformat} > Caused by: com.google.protobuf.InvalidProtocolBufferException: Could not > obtain block: BP-2121803626-10.0.0.22-1597301807397:blk_1073832522_91774 > file=/yarn/node-labels/nodelabel.mirror > at > com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:238) > at > com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253) > at > com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259) > at > com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49) > at > org.apache.hadoop.yarn.proto.YarnServerResourceManagerServiceProtos$AddToClusterNodeLabelsRequestProto.parseDelimitedFrom(YarnServerResourceManagerServiceProtos.java:7493) > at > org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.loadFromMirror(FileSystemNodeLabelsStore.java:168) > at > org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:205) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:254) > at > org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:268) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)(AbstractService.java:194){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries
[ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214128#comment-17214128 ] Hadoop QA commented on YARN-10421: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 14s{color} | | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 58s{color} | | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 32s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 27s{color} | | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 44s{color} | | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 0s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 16s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 47s{color} | | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s{color} | | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 0m 40s{color} | | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 21s{color} | | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 11s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 20s{color} | | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 20s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 40s{color} | | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 40s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} blanks {color} | {color:green} 0m 0s{color} | | {color:green} The patch has no blanks issues. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 54s{color} | [/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/234/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server.txt] | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: The patch generated 1 new + 16 unchanged - 0 fixed = 17 total (was 16) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s{color} | | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} pylint {color} | {color:orange} 0m 2s{color} | [/diff-pylint.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/234/artifact/out/diff-pylint.txt] | {color:orange} The patch generated 18 new + 0 unchanged - 0 fixed = 18 total (was 0) {color} | | {color:green}+1{color} | {color:green}
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214050#comment-17214050 ] Eric Badger commented on YARN-10244: I'm pretty confused with all of the JIRAs on this. In the future, I think we should revert the JIRA using the JIRA that was committed. Let me summarize what I think happened and you all can let me know if I have it right. YARN-4946 committed to 3.2, so it is in 3.2, 3.3, and trunk YARN-9848 reverted YARN-4946 from 3.3, so YARN-4946 only remains in 3.2 YARN-10244 reverted YARN-4946 from 3.2, so YARN-4946 has been completely reverted It's really confusing to me because YARN-4946 has the Fix Version set as 3.2. And then this JIRA says it is backporting YARN-9848, instead of saying it's reverting YARN-4946. Anyway, like I said above, if we're going to revert stuff, I think it is better to do it on the JIRA where it was committed so that we have a clear linear log of where it was committed to and reverted from. We can also then look at the Fix Version for that particular JIRA and know where it is actually committed > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10461) Update the ResourceManagerRest.md with the introduced endpoints
[ https://issues.apache.org/jira/browse/YARN-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reassigned YARN-10461: Assignee: Benjamin Teke > Update the ResourceManagerRest.md with the introduced endpoints > --- > > Key: YARN-10461 > URL: https://issues.apache.org/jira/browse/YARN-10461 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > > The new APIs introduced in YARN-10421 should be added to the file > hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10461) Update the ResourceManagerRest.md with the introduced endpoints
Benjamin Teke created YARN-10461: Summary: Update the ResourceManagerRest.md with the introduced endpoints Key: YARN-10461 URL: https://issues.apache.org/jira/browse/YARN-10461 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke The new APIs introduced in YARN-10421 should be added to the file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10433) Extend the DiagnosticService to initiate the diagnostic bundle collection
[ https://issues.apache.org/jira/browse/YARN-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-10433. -- Resolution: Abandoned Migrated to YARN-10421 > Extend the DiagnosticService to initiate the diagnostic bundle collection > - > > Key: YARN-10433 > URL: https://issues.apache.org/jira/browse/YARN-10433 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > > YARN-10421 introduces the new DiagnosticService class, the two new endpoints > for listing the available actions and starting the diagnostic script collect > method, and a basic diagnostic collector script. After the scripts form is > finalized (YARN-10422) the DiagnosticService should be extended to spawn the > requested collection method based on the input parameters and return the > collected bundle as a response. > To ease the load on the RM, the servlet should allow only one HTTP request at > a time. If a new request comes in while serving another an appropriate > response code should be returned, with the message "Diagnostics Collection in > Progress”. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries
[ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214005#comment-17214005 ] Benjamin Teke commented on YARN-10421: -- Hi [~snemeth], Thanks for the insights! 1-2-7-9-11-12-15: corrected them. 3. Other similar methods of this class use the same message so I didn't want to introduce something different. 4. I added some TODO comments. This class is used for testing purposes and the REST testing related tasks are part of YARN-10434. 5. Yes, I tried to follow the convention. 6. Thanks, I would have missed this document. As the overall implementation is subject to changes I'll create a new Jira for the documentation updates. 8. This item is to be changed in the tests, and in YARN-10432 it will be replaced by a configuration entry. Should I uppercase it regardless? 10. It is currently a placeholder, in YARN-10422 I'll bundle the script and after that I'll update the location. 13-14. I refactored the tests. > Create YarnDiagnosticsService to serve diagnostic queries > -- > > Key: YARN-10421 > URL: https://issues.apache.org/jira/browse/YARN-10421 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10421.001.patch, YARN-10421.002.patch, > YARN-10421.003.patch, YARN-10421.004.patch > > > YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet > forks a separate process, which executes a shell/Python/etc script. Based on > the use-cases listed below the script collects information, bundles it and > sends it to UI2. The diagnostic options are the following: > # Application hanging: > ** Application logs > ** Find the hanging container and get multiple Jstacks > ** ResourceManager logs during job lifecycle > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL > # Application failed: > ** Application logs > ** ResourceManager logs during job lifecycle. > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL. > ** Job related metrics like container, attempts. > # Scheduler related issue: > ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes. > ** Multiple Jstacks of ResourceManager > ** YARN and Scheduler Configuration > ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API > _/ws/v1/cluster/nodes response_ > ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response > (YARN-10319) > # ResourceManager / NodeManager daemon fails to start: > ** ResourceManager and NodeManager out and log file > ** YARN and Scheduler Configuration > Two new endpoints should be added to the RM web service: one for listing the > available diagnostic options (_/common-issue/list_), and one for calling a > selected option with the user provided parameters (_/common-issue/collect_). > The service should be transparent to the script changes to help with the > (on-the-fly) extensibility of the diagnostic tool. To split the changes to > smaller chunks the implementation behind _collect_ endpoint is to be provided > in YARN-10433. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries
[ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10421: - Description: YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet forks a separate process, which executes a shell/Python/etc script. Based on the use-cases listed below the script collects information, bundles it and sends it to UI2. The diagnostic options are the following: # Application hanging: ** Application logs ** Find the hanging container and get multiple Jstacks ** ResourceManager logs during job lifecycle ** NodeManager logs from NodeManager where the hanging containers of the jobs ran ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL # Application failed: ** Application logs ** ResourceManager logs during job lifecycle. ** NodeManager logs from NodeManager where the hanging containers of the jobs ran ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL. ** Job related metrics like container, attempts. # Scheduler related issue: ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes. ** Multiple Jstacks of ResourceManager ** YARN and Scheduler Configuration ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API _/ws/v1/cluster/nodes response_ ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response (YARN-10319) # ResourceManager / NodeManager daemon fails to start: ** ResourceManager and NodeManager out and log file ** YARN and Scheduler Configuration Two new endpoints should be added to the RM web service: one for listing the available diagnostic options (_/common-issue/list_), and one for calling a selected option with the user provided parameters (_/common-issue/collect_). was: YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet forks a separate process, which executes a shell/Python/etc script. Based on the use-cases listed below the script collects information, bundles it and sends it to UI2. The diagnostic options are the following: # Application hanging: ** Application logs ** Find the hanging container and get multiple Jstacks ** ResourceManager logs during job lifecycle ** NodeManager logs from NodeManager where the hanging containers of the jobs ran ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL # Application failed: ** Application logs ** ResourceManager logs during job lifecycle. ** NodeManager logs from NodeManager where the hanging containers of the jobs ran ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez History URL. ** Job related metrics like container, attempts. # Scheduler related issue: ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes. ** Multiple Jstacks of ResourceManager ** YARN and Scheduler Configuration ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API _/ws/v1/cluster/nodes response_ ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response (YARN-10319) # ResourceManager / NodeManager daemon fails to start: ** ResourceManager and NodeManager out and log file ** YARN and Scheduler Configuration Two new endpoints should be added to the RM web service: one for listing the available diagnostic options (_/common-issue/list_), and one for calling a selected option with the user provided parameters (_/common-issue/collect_). The service should be transparent to the script changes to help with the (on-the-fly) extensibility of the diagnostic tool. To split the changes to smaller chunks the implementation behind _collect_ endpoint is to be provided in YARN-10433. > Create YarnDiagnosticsService to serve diagnostic queries > -- > > Key: YARN-10421 > URL: https://issues.apache.org/jira/browse/YARN-10421 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10421.001.patch, YARN-10421.002.patch, > YARN-10421.003.patch, YARN-10421.004.patch > > > YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet > forks a separate process, which executes a shell/Python/etc script. Based on > the use-cases listed below the script collects information, bundles it and > sends it to UI2. The diagnostic options are the following: > # Application hanging: > ** Application logs > ** Find the hanging container and get multiple Jstacks > ** ResourceManager logs during job lifecycle > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL > # Applicatio
[jira] [Updated] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries
[ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10421: - Attachment: YARN-10421.004.patch > Create YarnDiagnosticsService to serve diagnostic queries > -- > > Key: YARN-10421 > URL: https://issues.apache.org/jira/browse/YARN-10421 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10421.001.patch, YARN-10421.002.patch, > YARN-10421.003.patch, YARN-10421.004.patch > > > YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet > forks a separate process, which executes a shell/Python/etc script. Based on > the use-cases listed below the script collects information, bundles it and > sends it to UI2. The diagnostic options are the following: > # Application hanging: > ** Application logs > ** Find the hanging container and get multiple Jstacks > ** ResourceManager logs during job lifecycle > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL > # Application failed: > ** Application logs > ** ResourceManager logs during job lifecycle. > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL. > ** Job related metrics like container, attempts. > # Scheduler related issue: > ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes. > ** Multiple Jstacks of ResourceManager > ** YARN and Scheduler Configuration > ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API > _/ws/v1/cluster/nodes response_ > ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response > (YARN-10319) > # ResourceManager / NodeManager daemon fails to start: > ** ResourceManager and NodeManager out and log file > ** YARN and Scheduler Configuration > Two new endpoints should be added to the RM web service: one for listing the > available diagnostic options (_/common-issue/list_), and one for calling a > selected option with the user provided parameters (_/common-issue/collect_). > The service should be transparent to the script changes to help with the > (on-the-fly) extensibility of the diagnostic tool. To split the changes to > smaller chunks the implementation behind _collect_ endpoint is to be provided > in YARN-10433. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10460: Attachment: YARN-10460-POC.patch > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10460-POC.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the > client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as > long as they're needed. But since the backing thr
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10460: Attachment: YARN-10460-POC.patch > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the > client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as > long as they're needed. But since the backing thread group is destroyed in > the previous test
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10460: Attachment: (was: YARN-10460-POC.patch) > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the > client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as > long as they're needed. But since the backing thread group is destroyed in > the pr
[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213830#comment-17213830 ] Peter Bacsko commented on YARN-10460: - cc [~Jim_Brennan] [~ebadger] [~weichiu] [~aajisaka] what do you guys think? Sooner or later we'll bump the JUnit version and this will happen. > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the > client ob
[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10460: Description: In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater. The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets: 4.12 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask task = new FutureTask(callable); threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } {noformat} 4.13 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask task = new FutureTask(callable); ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); try { thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } finally { try { thread.join(1); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } try { threadGroup.destroy(); < This } catch (IllegalThreadStateException e) { // If a thread from the group is still alive, the ThreadGroup cannot be destroyed. // Swallow the exception to keep the same behavior prior to this change. } } } {noformat} The change comes from [https://github.com/junit-team/junit4/pull/1517]. Unfortunately, destroying the thread group causes an issue because there are all sorts of object caching in the IPC layer. The exception is: {noformat} java.lang.IllegalThreadStateException at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) at java.lang.Thread.init(Thread.java:402) at java.lang.Thread.init(Thread.java:349) at java.lang.Thread.(Thread.java:675) at java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) at com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) at java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) at org.apache.hadoop.ipc.Client.call(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy81.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) {noformat} Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as long as they're needed. But since the backing thread group is destroyed in the previous test, it's no longer possible to create new threads. A quick workaround is to stop the clients and completely clear the {{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and it solves the problem but it feels hacky. Not sure if there is a better approach. was: In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater. The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets: 4.12 {noformat} @Override public void evaluate() throws Throwable { Ca
[jira] [Created] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
Peter Bacsko created YARN-10460: --- Summary: Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail Key: YARN-10460 URL: https://issues.apache.org/jira/browse/YARN-10460 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, test Reporter: Peter Bacsko Assignee: Peter Bacsko In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater. The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets: 4.12 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask task = new FutureTask(callable); threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } {noformat} 4.13 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask task = new FutureTask(callable); ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); try { thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } finally { try { thread.join(1); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } try { threadGroup.destroy(); < This } catch (IllegalThreadStateException e) { // If a thread from the group is still alive, the ThreadGroup cannot be destroyed. // Swallow the exception to keep the same behavior prior to this change. } } } {noformat} The change comes from [https://github.com/junit-team/junit4/pull/1517]. Unfortunately, destroying the thread group causes an issue because there are all sorts of object caching in the IPC layer. The exception is: {noformat} java.lang.IllegalThreadStateException at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) at java.lang.Thread.init(Thread.java:402) at java.lang.Thread.init(Thread.java:349) at java.lang.Thread.(Thread.java:675) at java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) at com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) at java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) at org.apache.hadoop.ipc.Client.call(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy81.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) {noformat} Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} is stored as long as they're needed. But since the backing thread group is destroyed in the previous test, it's no longer possible to create new threads. A quick workaround is to stop the clients and completely clear the {{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and it solves the problem but it feels hacky. Not sure if there is a better approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) ---
[jira] [Updated] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock
[ https://issues.apache.org/jira/browse/YARN-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Wu updated YARN-10459: --- Description: Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler. {code:java} // public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) { // Inform the container writelock.lock try { RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } } {code} was: Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler. {code:java} // code placeholder {code} public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) \{ // Inform the container writelock.lock try { RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally \{ writeLock.unlock(); } } > containerLaunchedOnNode method not need to hold schedulerApptemt lock > -- > > Key: YARN-10459 > URL: https://issues.apache.org/jira/browse/YARN-10459 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0, 3.1.3 >Reporter: Ryan Wu >Priority: Major > Fix For: 3.2.1 > > > > Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt > writelock, but looking at the method, it does not change any field. And more > seriously, this will affect the scheduler. > {code:java} > // public void containerLaunchedOnNode(ContainerId containerId, NodeId > nodeId) { > // Inform the container > writelock.lock > try { > RMContainer rmContainer = getRMContainer(containerId); > if (rmContainer == null) { > // Some unknown container sneaked into the system. Kill it. > rmContext.getDispatcher().getEventHandler().handle( new > RMNodeCleanContainerEvent(nodeId, containerId)); return; > } > rmContainer.handle( new RMContainerEvent(containerId, > RMContainerEventType.LAUNCHED)); >}finally { > writeLock.unlock(); >} > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock
[ https://issues.apache.org/jira/browse/YARN-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Wu updated YARN-10459: --- Description: Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler. {code:java} public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) { // Inform the container writelock.lock try { RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } } {code} was: Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler. {code:java} // public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) { // Inform the container writelock.lock try { RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } } {code} > containerLaunchedOnNode method not need to hold schedulerApptemt lock > -- > > Key: YARN-10459 > URL: https://issues.apache.org/jira/browse/YARN-10459 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0, 3.1.3 >Reporter: Ryan Wu >Priority: Major > Fix For: 3.2.1 > > > > Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt > writelock, but looking at the method, it does not change any field. And more > seriously, this will affect the scheduler. > {code:java} > public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) > { > // Inform the container > writelock.lock > try { > RMContainer rmContainer = getRMContainer(containerId); > if (rmContainer == null) { > // Some unknown container sneaked into the system. Kill it. > rmContext.getDispatcher().getEventHandler().handle( new > RMNodeCleanContainerEvent(nodeId, containerId)); return; > } > rmContainer.handle( new RMContainerEvent(containerId, > RMContainerEventType.LAUNCHED)); >}finally { > writeLock.unlock(); >} > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock
[ https://issues.apache.org/jira/browse/YARN-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Wu updated YARN-10459: --- Description: Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler. {code:java} // code placeholder {code} public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) \{ // Inform the container writelock.lock try { RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally \{ writeLock.unlock(); } } was: Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler. {code:java} // public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) { // Inform the container writelock.lock try { RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } } {code} > containerLaunchedOnNode method not need to hold schedulerApptemt lock > -- > > Key: YARN-10459 > URL: https://issues.apache.org/jira/browse/YARN-10459 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0, 3.1.3 >Reporter: Ryan Wu >Priority: Major > Fix For: 3.2.1 > > > > Now, the containerLaunchedOnNode method hold the SchedulerApplicationAttempt > writelock, but looking at the method, it does not change any field. And more > seriously, this will affect the scheduler. > {code:java} > // code placeholder > {code} > public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) > \{ // Inform the container writelock.lock try { RMContainer rmContainer = > getRMContainer(containerId); if (rmContainer == null) { // Some unknown > container sneaked into the system. Kill it. > rmContext.getDispatcher().getEventHandler().handle( new > RMNodeCleanContainerEvent(nodeId, containerId)); return; } > rmContainer.handle( new RMContainerEvent(containerId, > RMContainerEventType.LAUNCHED)); }finally \{ writeLock.unlock(); } } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock
[ https://issues.apache.org/jira/browse/YARN-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Wu updated YARN-10459: --- Description: Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler. {code:java} // public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) { // Inform the container writelock.lock try { RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } } {code} was: Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) { // Inform the container writelock.lock try \{ RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } } > containerLaunchedOnNode method not need to hold schedulerApptemt lock > -- > > Key: YARN-10459 > URL: https://issues.apache.org/jira/browse/YARN-10459 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0, 3.1.3 >Reporter: Ryan Wu >Priority: Major > Fix For: 3.2.1 > > > > Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt > writelock, but looking at the method, it does not change any field. And more > seriously, this will affect the scheduler. > {code:java} > // public void containerLaunchedOnNode(ContainerId containerId, NodeId > nodeId) { // Inform the container writelock.lock try { RMContainer > rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some > unknown container sneaked into the system. Kill it. > rmContext.getDispatcher().getEventHandler().handle( new > RMNodeCleanContainerEvent(nodeId, containerId)); return; } > rmContainer.handle( new RMContainerEvent(containerId, > RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10459) containerLaunchedOnNode method not need to hold schedulerApptemt lock
Ryan Wu created YARN-10459: -- Summary: containerLaunchedOnNode method not need to hold schedulerApptemt lock Key: YARN-10459 URL: https://issues.apache.org/jira/browse/YARN-10459 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.1.3, 3.2.0 Reporter: Ryan Wu Fix For: 3.2.1 Now, the containerLaunchedOnNode hold the SchedulerApplicationAttempt writelock, but looking at the method, it does not change any field. And more seriously, this will affect the scheduler public void containerLaunchedOnNode(ContainerId containerId, NodeId nodeId) { // Inform the container writelock.lock try \{ RMContainer rmContainer = getRMContainer(containerId); if (rmContainer == null) { // Some unknown container sneaked into the system. Kill it. rmContext.getDispatcher().getEventHandler().handle( new RMNodeCleanContainerEvent(nodeId, containerId)); return; } rmContainer.handle( new RMContainerEvent(containerId, RMContainerEventType.LAUNCHED)); }finally { writeLock.unlock(); } } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213786#comment-17213786 ] Hadoop QA commented on YARN-10244: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 15s{color} | | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 29m 2s{color} | | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 46s{color} | | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | | {color:green} branch-3.2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 39s{color} | | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 38s{color} | | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} blanks {color} | {color:green} 0m 0s{color} | | {color:green} The patch has no blanks issues. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 33s{color} | [/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/233/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt] | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 197 unchanged - 6 fixed = 199 total (was 203) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 18s{color} | | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 41s{color} | | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 22s{color} | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/233/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt] | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}147m 26s{color} | | {color:black}{color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 | | | hadoop.y
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213762#comment-17213762 ] Hadoop QA commented on YARN-10244: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m 1s{color} | | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} branch-3.2 Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 28s{color} | | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | | {color:green} branch-3.2 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 58s{color} | | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | | {color:green} branch-3.2 passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 50s{color} | | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 48s{color} | | {color:green} branch-3.2 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 50s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} blanks {color} | {color:green} 0m 0s{color} | | {color:green} The patch has no blanks issues. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 32s{color} | [/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/232/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt] | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 197 unchanged - 6 fixed = 199 total (was 203) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 30s{color} | | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 49s{color} | | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 31s{color} | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/232/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt] | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}158m 39s{color} | | {color:black}{color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisherForV2 | | | hadoop.y
[jira] [Updated] (YARN-9351) user can't use total resources of one partition even when yarn.scheduler.capacity..minimum-user-limit-percent is set to 100
[ https://issues.apache.org/jira/browse/YARN-9351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juanjuan Tian updated YARN-9351: - Summary: user can't use total resources of one partition even when yarn.scheduler.capacity..minimum-user-limit-percent is set to 100 (was: user can't use total resources of one partition even yarn.scheduler.capacity..minimum-user-limit-percent is set to 100 ) > user can't use total resources of one partition even when > yarn.scheduler.capacity..minimum-user-limit-percent is set to 100 > > > Key: YARN-9351 > URL: https://issues.apache.org/jira/browse/YARN-9351 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Juanjuan Tian >Assignee: Juanjuan Tian >Priority: Major > > if we configure queue capacity in absolute term, users can't use total > resource of one partition even > yarn.scheduler.capacity..minimum-user-limit-percent is set to 100 > for example there are two partition A,B, partition A has (120G memory,30 > vcores), and partition B has (180G memory,60 vcores), and Queue Prod is > configured with (75G memory, 25 vcores) partition A resource, like > yarn.scheduler.capacity.root.Prod.accessible-node-labels.A.capacity=[memory=75Gi,vcores=25], > and > yarn.scheduler.capacity.root.Prod.accessible-node-labels.A.maximum-capacity=[memory=120Gi,vcores=30] > yarn.scheduler.capacity.root.Prod.minimum-user-limit-percent=100, and at one > point the used resource of queue Prod is (90G memory,10 vcores), at this time > even though yarn.scheduler.capacity..minimum-user-limit-percent > is set to 100 , users in queue A can't get more resource. > > the reason for this is that when {color:#d04437}*computeUserLimit*{color}, > partitionResource is used for comparing consumed, queueCapacity, so in the > example (75G memory, 25 vcores) is the user limit. > Resource currentCapacity = Resources.lessThan(resourceCalculator, > partitionResource, consumed, queueCapacity) > ? queueCapacity > : Resources.add(consumed, required); > Resource userLimitResource = Resources.max(resourceCalculator, > partitionResource,Resources.divideAndCeil(resourceCalculator, resourceUsed, > usersSummedByWeight),Resources.divideAndCeil(resourceCalculator,Resources.multiplyAndRoundDown(currentCapacity, > getUserLimit()),100)); > > but when *{color:#d04437}canAssignToUser{color}* = > Resources.greaterThan(resourceCalculator, clusterResource, > user.getUsed(nodePartition), limit) > *{color:#d04437}clusterResource{color}* {color:#33}is used for for > comparing *used and limit, the result is false.*{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213697#comment-17213697 ] Akira Ajisaka commented on YARN-10244: -- +1 for the 003 patch. Thank you [~Steven Rand], [~sunilg], and [~hexiaoqiao]. > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org