[jira] [Commented] (YARN-10282) CLONE - hadoop-yarn-server-nodemanager build failed: make failed with error code 2
[ https://issues.apache.org/jira/browse/YARN-10282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337038#comment-17337038 ] Wenhui Xu commented on YARN-10282: -- i have same problem 3.3.0 in mac, can anyone help? input: mvn package -Pdist,native -DskipTests -Dmaven.javadoc.skip -e -X output: ... [*INFO*] ** [*INFO*] *BUILD FAILURE* [*INFO*] ** [*INFO*] Total time: 01:48 min [*INFO*] Finished at: 2021-04-30T10:43:40+08:00 [*INFO*] ** [*ERROR*] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.3.0:cmake-compile *(cmake-compile)* on project hadoop-yarn-server-nodemanager: *make failed with error code 2* -> *[Help 1]* *org.apache.maven.lifecycle.LifecycleExecutionException*: *Failed to execute goal* *org.apache.hadoop:hadoop-maven-plugins:3.3.0:cmake-compile* *(cmake-compile)* on project hadoop-yarn-server-nodemanager: *make failed with error code 2* *at* org.apache.maven.lifecycle.internal.MojoExecutor.execute (*MojoExecutor.java:215*) *at* org.apache.maven.lifecycle.internal.MojoExecutor.execute (*MojoExecutor.java:156*) *at* org.apache.maven.lifecycle.internal.MojoExecutor.execute (*MojoExecutor.java:148*) *at* org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (*LifecycleModuleBuilder.java:117*) *at* org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (*LifecycleModuleBuilder.java:81*) *at* org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (*SingleThreadedBuilder.java:56*) *at* org.apache.maven.lifecycle.internal.LifecycleStarter.execute (*LifecycleStarter.java:128*) *at* org.apache.maven.DefaultMaven.doExecute (*DefaultMaven.java:305*) *at* org.apache.maven.DefaultMaven.doExecute (*DefaultMaven.java:192*) *at* org.apache.maven.DefaultMaven.execute (*DefaultMaven.java:105*) *at* org.apache.maven.cli.MavenCli.execute (*MavenCli.java:957*) *at* org.apache.maven.cli.MavenCli.doMain (*MavenCli.java:289*) *at* org.apache.maven.cli.MavenCli.main (*MavenCli.java:193*) *at* jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (*Native Method*) *at* jdk.internal.reflect.NativeMethodAccessorImpl.invoke (*NativeMethodAccessorImpl.java:64*) *at* jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (*DelegatingMethodAccessorImpl.java:43*) *at* java.lang.reflect.Method.invoke (*Method.java:564*) *at* org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (*Launcher.java:282*) *at* org.codehaus.plexus.classworlds.launcher.Launcher.launch (*Launcher.java:225*) *at* org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (*Launcher.java:406*) *at* org.codehaus.plexus.classworlds.launcher.Launcher.main (*Launcher.java:347*) *Caused by*: org.apache.maven.plugin.MojoExecutionException: *make failed with error code 2* *at* org.apache.hadoop.maven.plugin.cmakebuilder.CompileMojo.runMake (*CompileMojo.java:229*) *at* org.apache.hadoop.maven.plugin.cmakebuilder.CompileMojo.execute (*CompileMojo.java:98*) *at* org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (*DefaultBuildPluginManager.java:137*) *at* org.apache.maven.lifecycle.internal.MojoExecutor.execute (*MojoExecutor.java:210*) *at* org.apache.maven.lifecycle.internal.MojoExecutor.execute (*MojoExecutor.java:156*) *at* org.apache.maven.lifecycle.internal.MojoExecutor.execute (*MojoExecutor.java:148*) *at* org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (*LifecycleModuleBuilder.java:117*) *at* org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (*LifecycleModuleBuilder.java:81*) *at* org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (*SingleThreadedBuilder.java:56*) *at* org.apache.maven.lifecycle.internal.LifecycleStarter.execute (*LifecycleStarter.java:128*) *at* org.apache.maven.DefaultMaven.doExecute (*DefaultMaven.java:305*) *at* org.apache.maven.DefaultMaven.doExecute (*DefaultMaven.java:192*) *at* org.apache.maven.DefaultMaven.execute (*DefaultMaven.java:105*) *at* org.apache.maven.cli.MavenCli.execute (*MavenCli.java:957*) *at* org.apache.maven.cli.MavenCli.doMain (*MavenCli.java:289*) *at* org.apache.maven.cli.MavenCli.main (*MavenCli.java:193*) *at* jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (*Native Method*) *at* jdk.internal.reflect.NativeMethodAccessorImpl.invoke (*NativeMethodAccessorImpl.java:64*) *at* jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (*DelegatingMethodAccessorImpl
[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335845#comment-17335845 ] Hadoop QA commented on YARN-10571: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 58s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 31m 12s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/946/artifact/out/branch-mvninstall-root.txt{color} | {color:red} root in trunk failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 28s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/946/artifact/out/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 27s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/946/artifact/out/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 29s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/946/artifact/out/buildtool-branch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} The patch fails to run checkstyle in hadoop-yarn-server-resourcemanager {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 31s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/946/artifact/out/branch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 1m 41s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 33s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/946/artifact/out/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 30s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/946/artifact/out/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08. {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 17s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:red}-1{color} | {color:red} spotbugs {color} | {color:red} 0m 31s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/946/artifact/out/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed. {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 22s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-
[jira] [Commented] (YARN-9927) RM multi-thread event processing mechanism
[ https://issues.apache.org/jira/browse/YARN-9927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335804#comment-17335804 ] Eric Badger commented on YARN-9927: --- {noformat} +// Test multi thread dispatcher +conf.setBoolean(YarnConfiguration. +MULTI_THREAD_DISPATCHER_ENABLED, true); {noformat} If this is a feature that is disabled by default, I don't think we should have it enabled by default in all of the RM tests. I would be happier running it as a parameterized test with both multi and single thread dispatchers. In general I think the patch looks reasonable, but I would like to see testing done to see if this makes the problem better or worse. I would think it would make things better, but until we run some real tests on it, we won't really know. So getting something similar to what [~hcarrot] provided originally would be good. That way we can merge this with confidence. > RM multi-thread event processing mechanism > -- > > Key: YARN-9927 > URL: https://issues.apache.org/jira/browse/YARN-9927 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0, 2.9.2 >Reporter: hcarrot >Assignee: Qi Zhu >Priority: Major > Attachments: RM multi-thread event processing mechanism.pdf, > YARN-9927.001.patch, YARN-9927.002.patch, YARN-9927.003.patch, > YARN-9927.004.patch, YARN-9927.005.patch > > > Recently, we have observed serious event blocking in RM event dispatcher > queue. After analysis of RM event monitoring data and RM event processing > logic, we found that > 1) environment: a cluster with thousands of nodes > 2) RMNodeStatusEvent dominates 90% time consumption of RM event scheduler > 3) Meanwhile, RM event processing is in a single-thread mode, and It results > in the low headroom of RM event scheduler, thus performance of RM. > So we proposed a RM multi-thread event processing mechanism to improve RM > performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10571: Attachment: (was: YARN-10571.003.patch) > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch, YARN-10571.002.patch, > YARN-10571.003.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10571: Attachment: YARN-10571.003.patch > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch, YARN-10571.002.patch, > YARN-10571.003.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10760) Number of allocated OPPORTUNISTIC containers can dip below 0
[ https://issues.apache.org/jira/browse/YARN-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335761#comment-17335761 ] Andrew Chung commented on YARN-10760: - [~inigoiri] Sure thing! > Number of allocated OPPORTUNISTIC containers can dip below 0 > > > Key: YARN-10760 > URL: https://issues.apache.org/jira/browse/YARN-10760 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.2 >Reporter: Andrew Chung >Assignee: Andrew Chung >Priority: Minor > > {{AbstractYarnScheduler.completedContainers}} can potentially be called from > multiple sources, yet it appears that there are scenarios in which the caller > does not hold the appropriate lock, which can lead to the count of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0. > To prevent double counting when releasing allocated O containers, a simple > fix might be to check if the {{RMContainer}} has already been removed > beforehand, though that may not fix the underlying issue that causes the race > condition. > Following is "capture" of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a > JMX query: > {noformat} > { > "name" : > "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics", > "modelerType" : "OpportunisticSchedulerMetrics", > "tag.OpportunisticSchedulerMetrics" : "ResourceManager", > "tag.Context" : "yarn", > "tag.Hostname" : "", > "AllocatedOContainers" : -2716, > "AggregateOContainersAllocated" : 306020, > "AggregateOContainersReleased" : 308736, > "AggregateNodeLocalOContainersAllocated" : 0, > "AggregateRackLocalOContainersAllocated" : 0, > "AggregateOffSwitchOContainersAllocated" : 306020, > "AllocateLatencyOQuantilesNumOps" : 0, > "AllocateLatencyOQuantiles50thPercentileTime" : 0, > "AllocateLatencyOQuantiles75thPercentileTime" : 0, > "AllocateLatencyOQuantiles90thPercentileTime" : 0, > "AllocateLatencyOQuantiles95thPercentileTime" : 0, > "AllocateLatencyOQuantiles99thPercentileTime" : 0 > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.
[ https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335726#comment-17335726 ] Eric Badger commented on YARN-10707: Thanks for the updates, [~zhuqi]! +1 I've committed this to trunk (3.4) and branch-3.3. There are conflicts backporting back further than that > Support custom resources in ResourceUtilization, and update Node GPU > Utilization to use. > > > Key: YARN-10707 > URL: https://issues.apache.org/jira/browse/YARN-10707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10707.001.patch, YARN-10707.002.patch, > YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, > YARN-10707.006.patch, YARN-10707.007.patch, YARN-10707.008.patch, > YARN-10707.009.patch, YARN-10707.010.patch, YARN-10707.011.patch > > > Support gpu in ResourceUtilization, and update Node GPU Utilization to use > first. > It will be very helpful for other use cases about GPU utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.
[ https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10707: --- Fix Version/s: 3.3.1 3.4.0 > Support custom resources in ResourceUtilization, and update Node GPU > Utilization to use. > > > Key: YARN-10707 > URL: https://issues.apache.org/jira/browse/YARN-10707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10707.001.patch, YARN-10707.002.patch, > YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, > YARN-10707.006.patch, YARN-10707.007.patch, YARN-10707.008.patch, > YARN-10707.009.patch, YARN-10707.010.patch, YARN-10707.011.patch > > > Support gpu in ResourceUtilization, and update Node GPU Utilization to use > first. > It will be very helpful for other use cases about GPU utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10760) Number of allocated OPPORTUNISTIC containers can dip below 0
[ https://issues.apache.org/jira/browse/YARN-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335605#comment-17335605 ] Íñigo Goiri commented on YARN-10760: Thanks [~afchung90], could you create a PR for this? > Number of allocated OPPORTUNISTIC containers can dip below 0 > > > Key: YARN-10760 > URL: https://issues.apache.org/jira/browse/YARN-10760 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.2 >Reporter: Andrew Chung >Assignee: Andrew Chung >Priority: Minor > > {{AbstractYarnScheduler.completedContainers}} can potentially be called from > multiple sources, yet it appears that there are scenarios in which the caller > does not hold the appropriate lock, which can lead to the count of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0. > To prevent double counting when releasing allocated O containers, a simple > fix might be to check if the {{RMContainer}} has already been removed > beforehand, though that may not fix the underlying issue that causes the race > condition. > Following is "capture" of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a > JMX query: > {noformat} > { > "name" : > "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics", > "modelerType" : "OpportunisticSchedulerMetrics", > "tag.OpportunisticSchedulerMetrics" : "ResourceManager", > "tag.Context" : "yarn", > "tag.Hostname" : "", > "AllocatedOContainers" : -2716, > "AggregateOContainersAllocated" : 306020, > "AggregateOContainersReleased" : 308736, > "AggregateNodeLocalOContainersAllocated" : 0, > "AggregateRackLocalOContainersAllocated" : 0, > "AggregateOffSwitchOContainersAllocated" : 306020, > "AllocateLatencyOQuantilesNumOps" : 0, > "AllocateLatencyOQuantiles50thPercentileTime" : 0, > "AllocateLatencyOQuantiles75thPercentileTime" : 0, > "AllocateLatencyOQuantiles90thPercentileTime" : 0, > "AllocateLatencyOQuantiles95thPercentileTime" : 0, > "AllocateLatencyOQuantiles99thPercentileTime" : 0 > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10760) Number of allocated OPPORTUNISTIC containers can dip below 0
[ https://issues.apache.org/jira/browse/YARN-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri reassigned YARN-10760: -- Assignee: Andrew Chung > Number of allocated OPPORTUNISTIC containers can dip below 0 > > > Key: YARN-10760 > URL: https://issues.apache.org/jira/browse/YARN-10760 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.2 >Reporter: Andrew Chung >Assignee: Andrew Chung >Priority: Minor > > {{AbstractYarnScheduler.completedContainers}} can potentially be called from > multiple sources, yet it appears that there are scenarios in which the caller > does not hold the appropriate lock, which can lead to the count of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0. > To prevent double counting when releasing allocated O containers, a simple > fix might be to check if the {{RMContainer}} has already been removed > beforehand, though that may not fix the underlying issue that causes the race > condition. > Following is "capture" of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a > JMX query: > {noformat} > { > "name" : > "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics", > "modelerType" : "OpportunisticSchedulerMetrics", > "tag.OpportunisticSchedulerMetrics" : "ResourceManager", > "tag.Context" : "yarn", > "tag.Hostname" : "", > "AllocatedOContainers" : -2716, > "AggregateOContainersAllocated" : 306020, > "AggregateOContainersReleased" : 308736, > "AggregateNodeLocalOContainersAllocated" : 0, > "AggregateRackLocalOContainersAllocated" : 0, > "AggregateOffSwitchOContainersAllocated" : 306020, > "AllocateLatencyOQuantilesNumOps" : 0, > "AllocateLatencyOQuantiles50thPercentileTime" : 0, > "AllocateLatencyOQuantiles75thPercentileTime" : 0, > "AllocateLatencyOQuantiles90thPercentileTime" : 0, > "AllocateLatencyOQuantiles95thPercentileTime" : 0, > "AllocateLatencyOQuantiles99thPercentileTime" : 0 > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10760) Number of allocated OPPORTUNISTIC containers can dip below 0
[ https://issues.apache.org/jira/browse/YARN-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Chung updated YARN-10760: Affects Version/s: 3.1.2 > Number of allocated OPPORTUNISTIC containers can dip below 0 > > > Key: YARN-10760 > URL: https://issues.apache.org/jira/browse/YARN-10760 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.2 >Reporter: Andrew Chung >Priority: Minor > > {{AbstractYarnScheduler.completedContainers}} can potentially be called from > multiple sources, yet it appears that there are scenarios in which the caller > does not hold the appropriate lock, which can lead to the count of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0. > To prevent double counting when releasing allocated O containers, a simple > fix might be to check if the {{RMContainer}} has already been removed > beforehand, though that may not fix the underlying issue that causes the race > condition. > Following is "capture" of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a > JMX query: > {noformat} > { > "name" : > "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics", > "modelerType" : "OpportunisticSchedulerMetrics", > "tag.OpportunisticSchedulerMetrics" : "ResourceManager", > "tag.Context" : "yarn", > "tag.Hostname" : "", > "AllocatedOContainers" : -2716, > "AggregateOContainersAllocated" : 306020, > "AggregateOContainersReleased" : 308736, > "AggregateNodeLocalOContainersAllocated" : 0, > "AggregateRackLocalOContainersAllocated" : 0, > "AggregateOffSwitchOContainersAllocated" : 306020, > "AllocateLatencyOQuantilesNumOps" : 0, > "AllocateLatencyOQuantiles50thPercentileTime" : 0, > "AllocateLatencyOQuantiles75thPercentileTime" : 0, > "AllocateLatencyOQuantiles90thPercentileTime" : 0, > "AllocateLatencyOQuantiles95thPercentileTime" : 0, > "AllocateLatencyOQuantiles99thPercentileTime" : 0 > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10760) Number of allocated OPPORTUNISTIC containers can dip below 0
[ https://issues.apache.org/jira/browse/YARN-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Chung updated YARN-10760: Description: {{AbstractYarnScheduler.completedContainers}} can potentially be called from multiple sources, yet it appears that there are scenarios in which the caller does not hold the appropriate lock, which can lead to the count of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0. To prevent double counting when releasing allocated O containers, a simple fix might be to check if the {{RMContainer}} has already been removed beforehand, though that may not fix the underlying issue that causes the race condition. Following is "capture" of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a JMX query: {noformat} { "name" : "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics", "modelerType" : "OpportunisticSchedulerMetrics", "tag.OpportunisticSchedulerMetrics" : "ResourceManager", "tag.Context" : "yarn", "tag.Hostname" : "", "AllocatedOContainers" : -2716, "AggregateOContainersAllocated" : 306020, "AggregateOContainersReleased" : 308736, "AggregateNodeLocalOContainersAllocated" : 0, "AggregateRackLocalOContainersAllocated" : 0, "AggregateOffSwitchOContainersAllocated" : 306020, "AllocateLatencyOQuantilesNumOps" : 0, "AllocateLatencyOQuantiles50thPercentileTime" : 0, "AllocateLatencyOQuantiles75thPercentileTime" : 0, "AllocateLatencyOQuantiles90thPercentileTime" : 0, "AllocateLatencyOQuantiles95thPercentileTime" : 0, "AllocateLatencyOQuantiles99thPercentileTime" : 0 } {noformat} was: {{AbstractYarnScheduler.completedContainers}} can potentially be called from multiple sources, yet it appears that there are scenarios in which the caller does not hold the appropriate lock, which can lead to the count of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0. To prevent double counting when releasing allocated O containers, a simple fix might be to check if the {{RMContainer}} has already been removed beforehand, though that may not fix the underlying issue that causes the race condition. Following is a screenshot of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a JMX query: {noformat} { "name" : "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics", "modelerType" : "OpportunisticSchedulerMetrics", "tag.OpportunisticSchedulerMetrics" : "ResourceManager", "tag.Context" : "yarn", "tag.Hostname" : "", "AllocatedOContainers" : -2716, "AggregateOContainersAllocated" : 306020, "AggregateOContainersReleased" : 308736, "AggregateNodeLocalOContainersAllocated" : 0, "AggregateRackLocalOContainersAllocated" : 0, "AggregateOffSwitchOContainersAllocated" : 306020, "AllocateLatencyOQuantilesNumOps" : 0, "AllocateLatencyOQuantiles50thPercentileTime" : 0, "AllocateLatencyOQuantiles75thPercentileTime" : 0, "AllocateLatencyOQuantiles90thPercentileTime" : 0, "AllocateLatencyOQuantiles95thPercentileTime" : 0, "AllocateLatencyOQuantiles99thPercentileTime" : 0 } {noformat} > Number of allocated OPPORTUNISTIC containers can dip below 0 > > > Key: YARN-10760 > URL: https://issues.apache.org/jira/browse/YARN-10760 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Andrew Chung >Priority: Minor > > {{AbstractYarnScheduler.completedContainers}} can potentially be called from > multiple sources, yet it appears that there are scenarios in which the caller > does not hold the appropriate lock, which can lead to the count of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0. > To prevent double counting when releasing allocated O containers, a simple > fix might be to check if the {{RMContainer}} has already been removed > beforehand, though that may not fix the underlying issue that causes the race > condition. > Following is "capture" of > {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a > JMX query: > {noformat} > { > "name" : > "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics", > "modelerType" : "OpportunisticSchedulerMetrics", > "tag.OpportunisticSchedulerMetrics" : "ResourceManager", > "tag.Context" : "yarn", > "tag.Hostname" : "", > "AllocatedOContainers" : -2716, > "AggregateOContainersAllocated" : 306020, > "AggregateOContainersReleased" : 308736, > "AggregateNodeLocalOContainersAllocated" : 0, > "AggregateRackLocalOContainersAllocated" : 0, > "AggregateOffSwitchOContainersAllocated" : 306020, > "AllocateLatencyOQuantilesNumOps" : 0, > "AllocateLatencyOQu
[jira] [Created] (YARN-10760) Number of allocated OPPORTUNISTIC containers can dip below 0
Andrew Chung created YARN-10760: --- Summary: Number of allocated OPPORTUNISTIC containers can dip below 0 Key: YARN-10760 URL: https://issues.apache.org/jira/browse/YARN-10760 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Andrew Chung {{AbstractYarnScheduler.completedContainers}} can potentially be called from multiple sources, yet it appears that there are scenarios in which the caller does not hold the appropriate lock, which can lead to the count of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0. To prevent double counting when releasing allocated O containers, a simple fix might be to check if the {{RMContainer}} has already been removed beforehand, though that may not fix the underlying issue that causes the race condition. Following is a screenshot of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a JMX query: {noformat} { "name" : "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics", "modelerType" : "OpportunisticSchedulerMetrics", "tag.OpportunisticSchedulerMetrics" : "ResourceManager", "tag.Context" : "yarn", "tag.Hostname" : "", "AllocatedOContainers" : -2716, "AggregateOContainersAllocated" : 306020, "AggregateOContainersReleased" : 308736, "AggregateNodeLocalOContainersAllocated" : 0, "AggregateRackLocalOContainersAllocated" : 0, "AggregateOffSwitchOContainersAllocated" : 306020, "AllocateLatencyOQuantilesNumOps" : 0, "AllocateLatencyOQuantiles50thPercentileTime" : 0, "AllocateLatencyOQuantiles75thPercentileTime" : 0, "AllocateLatencyOQuantiles90thPercentileTime" : 0, "AllocateLatencyOQuantiles95thPercentileTime" : 0, "AllocateLatencyOQuantiles99thPercentileTime" : 0 } {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10745) Change Log level from info to debug for few logs and remove unnecessary debuglog checks
[ https://issues.apache.org/jira/browse/YARN-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335401#comment-17335401 ] Bilwa S T commented on YARN-10745: -- Hi [~dmmkr] Thanks for the patch. I have few minor comments * In ProportionalCapacityPreemptionPolicy.java LOG.isDebugEnabled() check can be removed for below log {quote} LOG.debug("Send to scheduler: in app={} " + "#containers-to-be-preemptionCandidates={}", appAttemptId, e.getValue().size()); {quote} * Why do we need LOG.isDebugEnabled() check in AsyncDispatcher.java Few suggestions * In NodesListManager.java we can print below log only if either of the sets is not empty {quote} LOG.info("hostsReader include:\{" +StringUtils.join(",", hostsReader.getHosts()) +"} exclude:{" + StringUtils.join(",", hostsReader.getExcludedHosts()) + "}"); {quote} > Change Log level from info to debug for few logs and remove unnecessary > debuglog checks > --- > > Key: YARN-10745 > URL: https://issues.apache.org/jira/browse/YARN-10745 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Minor > Attachments: YARN-10745.001.patch > > > Change the info log level to debug for few logs so that the load on the > logger decreases in large cluster and improves the performance. > Remove the unnecessary isDebugEnabled() checks for printing strings without > any string concatenation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10505) Extend the maximum-capacity property to react to weight mode changes
[ https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori resolved YARN-10505. - Resolution: Duplicate > Extend the maximum-capacity property to react to weight mode changes > > > Key: YARN-10505 > URL: https://issues.apache.org/jira/browse/YARN-10505 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Priority: Major > > The property root.users.maximum-capacity could mean the following things: > * Relative Percentage: maximum capacity relative to its parent. If it’s set > to 50, then it means that the capacity is capped with respect to the parent. > * Absolute Percentage: maximum capacity expressed as a percentage of the > overall cluster capacity. > * Percentages of different resource types: this would refer to vCores, > memory, GPU, etc... Similarly to the single percentage value, this could > either mean percentage of the parent or percentage of the overall cluster > resource. > * Absolute limit: explicit definition of vCores and memory like vcores=20, > memory-mb=16384. > > Note that Fair Scheduler supports the following settings: > * Single percentage (absolute) > * Two percentages (absolute) > * Absolute resources > > It is recommended that all three formats are supported for maximum-capacity > after introducing weight mode. The final form of the configuration for > example could look like this: > root.users.maximum-capacity = 100% - single percentage > root.users.maximum-capacity = (vcores=100%, memory-mb=100%) - two percentages > root.users.maximum-capacity = (vcores=10, memory-mb=1mb) - absolute -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10505) Extend the maximum-capacity property to react to weight mode changes
[ https://issues.apache.org/jira/browse/YARN-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335286#comment-17335286 ] Andras Gyori commented on YARN-10505: - This will be covered in YARN-9936. Closing it. > Extend the maximum-capacity property to react to weight mode changes > > > Key: YARN-10505 > URL: https://issues.apache.org/jira/browse/YARN-10505 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Priority: Major > > The property root.users.maximum-capacity could mean the following things: > * Relative Percentage: maximum capacity relative to its parent. If it’s set > to 50, then it means that the capacity is capped with respect to the parent. > * Absolute Percentage: maximum capacity expressed as a percentage of the > overall cluster capacity. > * Percentages of different resource types: this would refer to vCores, > memory, GPU, etc... Similarly to the single percentage value, this could > either mean percentage of the parent or percentage of the overall cluster > resource. > * Absolute limit: explicit definition of vCores and memory like vcores=20, > memory-mb=16384. > > Note that Fair Scheduler supports the following settings: > * Single percentage (absolute) > * Two percentages (absolute) > * Absolute resources > > It is recommended that all three formats are supported for maximum-capacity > after introducing weight mode. The final form of the configuration for > example could look like this: > root.users.maximum-capacity = 100% - single percentage > root.users.maximum-capacity = (vcores=100%, memory-mb=100%) - two percentages > root.users.maximum-capacity = (vcores=10, memory-mb=1mb) - absolute -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9443) Fast RM Failover using Ratis (Raft protocol)
[ https://issues.apache.org/jira/browse/YARN-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335205#comment-17335205 ] Qi Zhu commented on YARN-9443: -- [~prabhujoseph] [~ztang] [~ebadger] [~epayne] Is this going on, now the state store is used to store in ZK, but in large cluster will not run very well. YARN-5123 use sql based to store the state, but it is also not a hot HA like NameNode in HDFS. If we want to realize the hot HA for resourcemanager, it's a very good choice to use ratis(raft) to make the state consistent in HA mode (the actvie RM state consistent with standby RM state, use log commit in raft), when we transform to standby we don't need fence to load the large state from ZK, we can realize the hot HA. Thanks. > Fast RM Failover using Ratis (Raft protocol) > > > Key: YARN-9443 > URL: https://issues.apache.org/jira/browse/YARN-9443 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > During Failover, the RM Standby will have a lag as it has to recover from > Zookeeper / FileSystem StateStore. RM HA using Ratis (Raft Protocol) can > achieve Fast failover as all RMs are in sync already. This is used by Ozone - > HDDS-505. > > cc [~nandakumar131] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org