[jira] [Created] (YARN-5487) Can't kill nodemanager process when running in the foreground
Allen Wittenauer created YARN-5487: -- Summary: Can't kill nodemanager process when running in the foreground Key: YARN-5487 URL: https://issues.apache.org/jira/browse/YARN-5487 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer It looks like there is a misconfigured signal handler somewhere in the mix. Hitting ctrl-c results in this message: 16/08/08 20:26:37 ERROR nodemanager.NodeManager: RECEIVED SIGNAL 2: SIGINT ... which is then summarily ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
Re: [DISCUSS] Release numbering semantics with concurrent (>2) releases [Was Setting JIRA fix versions for 3.0.0 releases]
I think that incompatible API between 3.0.0-alpha and 3.1.0-beta is something less confusing than incompatible between 2.8/2.9 and 2.98.x alphas/2.99.x betas. Why not just follow our previous practice in the beginning of branch-2? we can have 3.0.0-alpha, 3.1.0-alpha/beta, but once when we are finalizing our APIs, we should bump up trunk version to 4.x for landing new incompatible changes. Thanks, Junping From: Karthik KambatlaSent: Monday, August 08, 2016 6:54 PM Cc: common-...@hadoop.apache.org; yarn-dev@hadoop.apache.org; hdfs-...@hadoop.apache.org; mapreduce-...@hadoop.apache.org Subject: Re: [DISCUSS] Release numbering semantics with concurrent (>2) releases [Was Setting JIRA fix versions for 3.0.0 releases] I like the 3.0.0-alphaX approach primarily for simpler understanding of compatibility guarantees. Calling 3.0.0 alpha and 3.1.0 beta is confusing because, it is not immediately clear that 3.0.0 and 3.1.0 could be incompatible in APIs. I am open to something like 2.98.x for alphas and 2.99.x for betas leading to a 3.0.0 GA. I have seen other projects use this without causing much confusion. On Thu, Aug 4, 2016 at 6:01 PM, Konstantin Shvachko wrote: > On Thu, Aug 4, 2016 at 11:20 AM, Andrew Wang > wrote: > > > Hi Konst, thanks for commenting, > > > > On Wed, Aug 3, 2016 at 11:29 PM, Konstantin Shvachko < > shv.had...@gmail.com > > > wrote: > > > >> 1. I probably missed something but I didn't get it how "alpha"s made > >> their way into release numbers again. This was discussed on several > >> occasions and I thought the common perception was to use just three > level > >> numbers for release versioning and avoid branding them. > >> It is particularly confusing to have 3.0.0-alpha1 and 3.0.0-alpha2. What > >> is alphaX - fourth level? I think releasing 3.0.0 and setting trunk to > >> 3.1.0 would be perfectly in line with our current release practices. > >> > > > > We discussed release numbering a while ago when discussing the release > > plan for 3.0.0, and agreed on this scheme. "-alphaX" is essentially a > > fourth level as you say, but the intent is to only use it (and "-betaX") > in > > the leadup to 3.0.0. > > > > The goal here is clarity for end users, since most other enterprise > > software uses a a.0.0 version to denote the GA of a new major version. > Same > > for a.b.0 for a new minor version, though we haven't talked about that > yet. > > The alphaX and betaX scheme also shares similarity to release versioning > of > > other enterprise software. > > > > As you remember we did this (alpha, beta) for Hadoop-2 and I don't think it > went well with user perception. > Say release 2.0.5-alpha turned out to be quite good even though still > branded "alpha", while 2.2 was not and not branded. > We should move a release to stable, when people ran it and agree it is GA > worthy. Otherwise you never know. > > > > > >> 2. I do not see any confusions with releasing 2.8.0 after 3.0.0. > >> The release number is not intended to reflect historical release > >> sequence, but rather the point in the source tree, which it was branched > >> off. So one can release 2.8, 2.9, etc. after or before 3.0. > >> > > > > As described earlier in this thread, the issue here is setting the fix > > versions such that the changelog is a useful diff from a previous > version, > > and also clear about what changes are present in each branch. If we do > not > > order a specific 2.x before 3.0, then we don't know what 2.x to diff > from. > > > > So the problem is in determining the latest commit, which was not present > in the last release, when the last release bears higher number than the one > being released. > Interesting problem. Don't have a strong opinion on that. I guess it's OK > to have overlapping in changelogs. > As long as we keep following the rule that commits should be made to trunk > first and them propagated to lower branches until the target branch is > reached. > > > > > >> 3. I agree that current 3.0.0 branch can be dropped and re-cut. We may > >> think of another rule that if a release branch is not released in 3 > month > >> it should be abandoned. Which is applicable to branch 2.8.0 and it is > too > >> much work syncing it with branch-2. > >> > >> Time-based rules are tough here. I'd prefer we continue to leave this up > > to release managers. If you think we should recut branch-2.8, recommend > > pinging Vinod and discussing on a new thread. > > > > Not recut, but abandon 2.8.0. And Vinod (or anybody who volunteers to RM) > can recut from the desired point. > People were committing to branch-2 and branch-2.8 for months. And they are > out of sync anyways. So what's the point of the extra commit. > Probably still a different thread. > > Thanks, > --Konst > - To unsubscribe, e-mail:
[jira] [Created] (YARN-5486) Update OpportunisticConatinerAllocatioAMService allocate method to handle OPPORTUNISTIC container requests
Arun Suresh created YARN-5486: - Summary: Update OpportunisticConatinerAllocatioAMService allocate method to handle OPPORTUNISTIC container requests Key: YARN-5486 URL: https://issues.apache.org/jira/browse/YARN-5486 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun Suresh Assignee: Arun Suresh YARN-5457 refactors the Distributed Scheduling framework to move the container allocator to yarn-server-common. This JIRA proposes to update the allocate method in the new AM service to use the OpportunisticContainerAllocator to allocate opportunistic containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
Re: [DISCUSS] Release numbering semantics with concurrent (>2) releases [Was Setting JIRA fix versions for 3.0.0 releases]
I like the 3.0.0-alphaX approach primarily for simpler understanding of compatibility guarantees. Calling 3.0.0 alpha and 3.1.0 beta is confusing because, it is not immediately clear that 3.0.0 and 3.1.0 could be incompatible in APIs. I am open to something like 2.98.x for alphas and 2.99.x for betas leading to a 3.0.0 GA. I have seen other projects use this without causing much confusion. On Thu, Aug 4, 2016 at 6:01 PM, Konstantin Shvachkowrote: > On Thu, Aug 4, 2016 at 11:20 AM, Andrew Wang > wrote: > > > Hi Konst, thanks for commenting, > > > > On Wed, Aug 3, 2016 at 11:29 PM, Konstantin Shvachko < > shv.had...@gmail.com > > > wrote: > > > >> 1. I probably missed something but I didn't get it how "alpha"s made > >> their way into release numbers again. This was discussed on several > >> occasions and I thought the common perception was to use just three > level > >> numbers for release versioning and avoid branding them. > >> It is particularly confusing to have 3.0.0-alpha1 and 3.0.0-alpha2. What > >> is alphaX - fourth level? I think releasing 3.0.0 and setting trunk to > >> 3.1.0 would be perfectly in line with our current release practices. > >> > > > > We discussed release numbering a while ago when discussing the release > > plan for 3.0.0, and agreed on this scheme. "-alphaX" is essentially a > > fourth level as you say, but the intent is to only use it (and "-betaX") > in > > the leadup to 3.0.0. > > > > The goal here is clarity for end users, since most other enterprise > > software uses a a.0.0 version to denote the GA of a new major version. > Same > > for a.b.0 for a new minor version, though we haven't talked about that > yet. > > The alphaX and betaX scheme also shares similarity to release versioning > of > > other enterprise software. > > > > As you remember we did this (alpha, beta) for Hadoop-2 and I don't think it > went well with user perception. > Say release 2.0.5-alpha turned out to be quite good even though still > branded "alpha", while 2.2 was not and not branded. > We should move a release to stable, when people ran it and agree it is GA > worthy. Otherwise you never know. > > > > > >> 2. I do not see any confusions with releasing 2.8.0 after 3.0.0. > >> The release number is not intended to reflect historical release > >> sequence, but rather the point in the source tree, which it was branched > >> off. So one can release 2.8, 2.9, etc. after or before 3.0. > >> > > > > As described earlier in this thread, the issue here is setting the fix > > versions such that the changelog is a useful diff from a previous > version, > > and also clear about what changes are present in each branch. If we do > not > > order a specific 2.x before 3.0, then we don't know what 2.x to diff > from. > > > > So the problem is in determining the latest commit, which was not present > in the last release, when the last release bears higher number than the one > being released. > Interesting problem. Don't have a strong opinion on that. I guess it's OK > to have overlapping in changelogs. > As long as we keep following the rule that commits should be made to trunk > first and them propagated to lower branches until the target branch is > reached. > > > > > >> 3. I agree that current 3.0.0 branch can be dropped and re-cut. We may > >> think of another rule that if a release branch is not released in 3 > month > >> it should be abandoned. Which is applicable to branch 2.8.0 and it is > too > >> much work syncing it with branch-2. > >> > >> Time-based rules are tough here. I'd prefer we continue to leave this up > > to release managers. If you think we should recut branch-2.8, recommend > > pinging Vinod and discussing on a new thread. > > > > Not recut, but abandon 2.8.0. And Vinod (or anybody who volunteers to RM) > can recut from the desired point. > People were committing to branch-2 and branch-2.8 for months. And they are > out of sync anyways. So what's the point of the extra commit. > Probably still a different thread. > > Thanks, > --Konst >
[jira] [Resolved] (YARN-5484) YARN client will still retry for many times on failover even though RM server throw AccessControlException
[ https://issues.apache.org/jira/browse/YARN-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R resolved YARN-5484. - Resolution: Duplicate Yes guys its duplicate and we were discussing the solution and planned to post the same > YARN client will still retry for many times on failover even though RM > server throw AccessControlException > --- > > Key: YARN-5484 > URL: https://issues.apache.org/jira/browse/YARN-5484 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bob.zhao > > 1. Enable yarn.acl.enable > 2. Set up some queue for users on yarn, for example queue1 only for user1 > 3. Using user1 to submit app to queue1, it can run successfully. > 4. Change user2 to submit app to queue1, it will not be permit submit to > queue1. > So, At RM server side, it will throw > IOException(ClientRMService.java#submitApplication), which is the parent > implementation of the > AccessControlException(RMAppManager.java#createAndPopulateNewRMApp), this > IOException will be threw to client that cause yarn client frequently > failover for many times. > We'b better avoid this behavior that if client got permission deny from > server, it should try once and exit, no need to retry. > This issue was introduced by YARN-4522. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86
For more details, see https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/ [Aug 7, 2016 9:29:26 PM] (shv) HDFS-10693. metaSave should print blocks, not LightWeightHashSet. -1 overall The following subsystems voted -1: asflicense unit The following subsystems voted -1 but were configured to be filtered/ignored: cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace The following subsystems are considered long running: (runtime bigger than 1h 0m 0s) unit Specific tests: Failed junit tests : hadoop.tracing.TestTracing hadoop.security.TestRefreshUserMappings hadoop.yarn.logaggregation.TestAggregatedLogFormat hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager hadoop.yarn.server.applicationhistoryservice.webapp.TestAHSWebServices hadoop.yarn.server.TestMiniYarnClusterNodeUtilization hadoop.yarn.server.TestContainerManagerSecurity hadoop.yarn.client.api.impl.TestYarnClient hadoop.mapreduce.v2.hs.server.TestHSAdminServer cc: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/diff-compile-cc-root.txt [4.0K] javac: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/diff-compile-javac-root.txt [172K] checkstyle: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/diff-checkstyle-root.txt [16M] pylint: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/diff-patch-pylint.txt [16K] shellcheck: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/diff-patch-shellcheck.txt [20K] shelldocs: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/diff-patch-shelldocs.txt [16K] whitespace: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/whitespace-eol.txt [12M] https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/whitespace-tabs.txt [1.3M] javadoc: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/diff-javadoc-javadoc-root.txt [2.2M] unit: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt [312K] https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt [24K] https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt [36K] https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt [12K] https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-tests.txt [268K] https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client.txt [12K] https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-hs.txt [12K] https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-nativetask.txt [124K] asflicense: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/127/artifact/out/patch-asflicense-problems.txt [4.0K] Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5485) YARN client will still retry for many times on failover even though RM server throw AccessControlException(IOException)
[ https://issues.apache.org/jira/browse/YARN-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G resolved YARN-5485. --- Resolution: Duplicate Dup of YARN-5484 ? > YARN client will still retry for many times on failover even though RM > server throw AccessControlException(IOException) > > > Key: YARN-5485 > URL: https://issues.apache.org/jira/browse/YARN-5485 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bob.zhao > > Issue reproduced steps > 1. Enable yarn.acl.enable > 2. Set up some queue for users on yarn, for example queue1 onle for user1 > 3. Using user1 to submit app to queue1, it can run successfully. > 4. Change user2 to submit app to queue1, it will not be permit submit to > queue1. > So, At RM server side, it will throw > IOException(ClientRMService.java#submitApplication), which is the parent > implementation of the > AccessControlException(RMAppManager.java#createAndPopulateNewRMApp), this > IOException will be threw to client that cause yarn client frequently > failover for many times. > We'b better avoid this behavior that if client got permission deny from > server, it should try once and exit, no need to retry. > This issue was introduced by YARN-4522. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5485) YARN client will still retry for many times on failover even though RM server throw AccessControlException
Bob.zhao created YARN-5485: -- Summary: YARN client will still retry for many times on failover even though RM server throw AccessControlException Key: YARN-5485 URL: https://issues.apache.org/jira/browse/YARN-5485 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.9.0 Reporter: Bob.zhao 1. Enable yarn.acl.enable 2. Set up some queue for users on yarn, for example queue1 onle for user1 3. Using user1 to submit app to queue1, it can run successfully. 4. Change user2 to submit app to queue1, it will not be permit submit to queue1. So, At RM server side, it will throw IOException(ClientRMService.java#submitApplication), which is the parent implementation of the AccessControlException(RMAppManager.java#createAndPopulateNewRMApp), this IOException will be threw to client that cause yarn client frequently failover for many times. We'b better avoid this behavior that if client got permission deny from server, it should try once and exit, no need to retry. This issue was introduced by YARN-4522. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5484) YARN client will still retry for many times on failover even though RM server throw AccessControlException
Bob.zhao created YARN-5484: -- Summary: YARN client will still retry for many times on failover even though RM server throw AccessControlException Key: YARN-5484 URL: https://issues.apache.org/jira/browse/YARN-5484 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.9.0 Reporter: Bob.zhao 1. Enable yarn.acl.enable 2. Set up some queue for users on yarn, for example queue1 onle for user1 3. Using user1 to submit app to queue1, it can run successfully. 4. Change user2 to submit app to queue1, it will not be permit submit to queue1. So, At RM server side, it will throw IOException(ClientRMService.java#submitApplication), which is the parent implementation of the AccessControlException(RMAppManager.java#createAndPopulateNewRMApp), this IOException will be threw to client that cause yarn client frequently failover for many times. We'b better avoid this behavior that if client got permission deny from server, it should try once and exit, no need to retry. This issue was introduced by YARN-4522. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5483) Optimize RMAppAttempt#pullJustFinishedContainers
sandflee created YARN-5483: -- Summary: Optimize RMAppAttempt#pullJustFinishedContainers Key: YARN-5483 URL: https://issues.apache.org/jira/browse/YARN-5483 Project: Hadoop YARN Issue Type: Improvement Reporter: sandflee Assignee: sandflee about 1000 app running on cluster, jprofiler found pullJustFinishedContainers cost too much cpu. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5482) ContainerMetric Lead to memory leaks
[ https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen resolved YARN-5482. Resolution: Duplicate > ContainerMetric Lead to memory leaks > > > Key: YARN-5482 > URL: https://issues.apache.org/jira/browse/YARN-5482 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: oom1.png, oom2.png > > > In our cluster, I often find NodeManager OOM, I dump the heap file and found > ContainerMetric takes up a lot of memory > {code} > export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M > -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote > -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution > -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5482) ContainerMetric Lead to memory leaks
[ https://issues.apache.org/jira/browse/YARN-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangshangwen resolved YARN-5482. Resolution: Fixed > ContainerMetric Lead to memory leaks > > > Key: YARN-5482 > URL: https://issues.apache.org/jira/browse/YARN-5482 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: tangshangwen >Assignee: tangshangwen > Attachments: oom1.png, oom2.png > > > In our cluster, I often find NodeManager OOM, I dump the heap file and found > ContainerMetric takes up a lot of memory > {code} > export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M > -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote > -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime > -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution > -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5482) ContainerMetric Lead to memory leaks
tangshangwen created YARN-5482: -- Summary: ContainerMetric Lead to memory leaks Key: YARN-5482 URL: https://issues.apache.org/jira/browse/YARN-5482 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: tangshangwen Assignee: tangshangwen In our cluster, I often find NodeManager OOM, I dump the heap file and found ContainerMetric takes up a lot of memory {code} export YARN_NODEMANAGER_OPTS="-Xmx2g -Xms2g -Xmn1g -XX:PermSize=128M -XX:MaxPermSize=128M -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data1/yarn-logs/nm_dump.log -Dcom.sun.management.jmxremote -Xloggc:/data1/yarn-logs/nm_gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:ErrorFile=/data1/yarn-logs/nm_err_pid" {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org