[jira] [Commented] (YARN-4434) NodeManager Disk Checker parameter documentation is not correct
[ https://issues.apache.org/jira/browse/YARN-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048225#comment-15048225 ] ASF GitHub Bot commented on YARN-4434: -- Github user aajisaka commented on the pull request: https://github.com/apache/hadoop/pull/62#issuecomment-163142562 Thank you for the pull request. I reviewed the patch (A) and the another patch in YARN-4434 jira (B) and decided to commit the patch (B) because the patch (B) replaces "i.e. the entire disk" with "i.e. 90% of the disk" as well. > NodeManager Disk Checker parameter documentation is not correct > --- > > Key: YARN-4434 > URL: https://issues.apache.org/jira/browse/YARN-4434 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, nodemanager >Affects Versions: 2.6.0, 2.7.1 >Reporter: Takashi Ohnishi >Assignee: Weiwei Yang >Priority: Minor > Fix For: 2.8.0, 2.6.3, 2.7.3 > > Attachments: YARN-4434.001.patch, YARN-4434.branch-2.6.patch > > > In the description of > yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage, > it says > {noformat} > The default value is 100 i.e. the entire disk can be used. > {noformat} > But, in yarn-default.xml and source code, the default value is 90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4434) NodeManager Disk Checker parameter documentation is not correct
[ https://issues.apache.org/jira/browse/YARN-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048226#comment-15048226 ] ASF GitHub Bot commented on YARN-4434: -- Github user aajisaka commented on the pull request: https://github.com/apache/hadoop/pull/62#issuecomment-163142634 I've committed the patch (B), so would you close this pull request? > NodeManager Disk Checker parameter documentation is not correct > --- > > Key: YARN-4434 > URL: https://issues.apache.org/jira/browse/YARN-4434 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, nodemanager >Affects Versions: 2.6.0, 2.7.1 >Reporter: Takashi Ohnishi >Assignee: Weiwei Yang >Priority: Minor > Fix For: 2.8.0, 2.6.3, 2.7.3 > > Attachments: YARN-4434.001.patch, YARN-4434.branch-2.6.patch > > > In the description of > yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage, > it says > {noformat} > The default value is 100 i.e. the entire disk can be used. > {noformat} > But, in yarn-default.xml and source code, the default value is 90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4301: - Assignee: Akihiro Suda > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda >Assignee: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048221#comment-15048221 ] Tsuyoshi Ozawa commented on YARN-4301: -- [~suda] thank you for the point. I have some comments about v2 patch - could you update them? 1. About the synchronization of DirectoryCollection, I got the point you mentioned. The change, however, causes race condition between states in the class(localDirs, fullDirs, errorDirs, and numFailures) - e.g. {{DirectoryCollection.concat(errorDirs, fullDirs))}}, {{createNonExistentDirs}} and other functions cannot work well without synchronization. I think the root cause of the problem is to calling {{DC.testDirs}} with synchronization in {{DC.checkDirs}}. How about releasing lock before calling {{testDirs}} and acquiring lock after calling {{testDirs}}? {quote} synchronized DC.getFailedDirs() can be blocked by synchronized DC.checkDirs(), when File.mkdir() (called from DC.checkDirs(), via DC.testDirs()) does not return in a moderate timeout. Hence NodeHealthCheckerServer.isHealthy() gets also blocked. So I would like to make DC.getXXXs unsynchronized. {quote} 2. If the thread is preempted by OS and moves to another CPU in multicore environment, gap can be negative value. Hence I prefer not to abort NodeManager here. {code:title=NodeHealthCheckerService.java} +long diskCheckTime = dirsHandler.getLastDisksCheckTime(); +long now = System.currentTimeMillis(); +long gap = now - diskCheckTime; +if (gap < 0) { + throw new AssertionError("implementation error - now=" + now + + ", diskCheckTime=" + diskCheckTime); +} {code} 3. Please move validations of configuration to serviceInit to avoid aborting at runtime. {code:title=NodeHealthCheckerService.java} +long allowedGap = this.diskHealthCheckInterval + this.diskHealthCheckTimeout; +if (allowedGap <= 0) { + throw new AssertionError("implementation error - interval=" + this.diskHealthCheckInterval + + ", timeout=" + this.diskHealthCheckTimeout); +} {code} > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4421) Remove dead code in RmAppImpl.RMAppRecoveredTransition
[ https://issues.apache.org/jira/browse/YARN-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048215#comment-15048215 ] Hudson commented on YARN-4421: -- FAILURE: Integrated in Hadoop-trunk-Commit #8946 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8946/]) YARN-4421. Remove dead code in RmAppImpl.RMAppRecoveredTransition. (rohithsharmaks: rev a5e2e1ecb06a3942903cb79f61f0f4bb02480f19) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt > Remove dead code in RmAppImpl.RMAppRecoveredTransition > -- > > Key: YARN-4421 > URL: https://issues.apache.org/jira/browse/YARN-4421 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Trivial > Fix For: 2.8.0 > > Attachments: YARN-4421.001.patch > > > The {{transition()}} method contains the following: > {code} > // Last attempt is in final state, return ACCEPTED waiting for last > // RMAppAttempt to send finished or failed event back. > if (app.currentAttempt != null > && (app.currentAttempt.getState() == RMAppAttemptState.KILLED > || app.currentAttempt.getState() == RMAppAttemptState.FINISHED > || (app.currentAttempt.getState() == RMAppAttemptState.FAILED > && app.getNumFailedAppAttempts() == app.maxAppAttempts))) { > return RMAppState.ACCEPTED; > } > // YARN-1507 is saving the application state after the application is > // accepted. So after YARN-1507, an app is saved meaning it is accepted. > // Thus we return ACCECPTED state on recovery. > return RMAppState.ACCEPTED; > {code} > The {{if}} statement is fully redundant and can be eliminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4434) NodeManager Disk Checker parameter documentation is not correct
[ https://issues.apache.org/jira/browse/YARN-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-4434: Attachment: YARN-4434.branch-2.6.patch I had to rebase the patch for branch-2.6. Attaching the rebased patch. > NodeManager Disk Checker parameter documentation is not correct > --- > > Key: YARN-4434 > URL: https://issues.apache.org/jira/browse/YARN-4434 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, nodemanager >Affects Versions: 2.6.0, 2.7.1 >Reporter: Takashi Ohnishi >Assignee: Weiwei Yang >Priority: Minor > Attachments: YARN-4434.001.patch, YARN-4434.branch-2.6.patch > > > In the description of > yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage, > it says > {noformat} > The default value is 100 i.e. the entire disk can be used. > {noformat} > But, in yarn-default.xml and source code, the default value is 90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4434) NodeManager Disk Checker parameter documentation is not correct
[ https://issues.apache.org/jira/browse/YARN-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-4434: Affects Version/s: 2.6.0 Labels: (was: documentation) Hadoop Flags: Reviewed Component/s: documentation > NodeManager Disk Checker parameter documentation is not correct > --- > > Key: YARN-4434 > URL: https://issues.apache.org/jira/browse/YARN-4434 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, nodemanager >Affects Versions: 2.6.0, 2.7.1 >Reporter: Takashi Ohnishi >Assignee: Weiwei Yang >Priority: Minor > Attachments: YARN-4434.001.patch > > > In the description of > yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage, > it says > {noformat} > The default value is 100 i.e. the entire disk can be used. > {noformat} > But, in yarn-default.xml and source code, the default value is 90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4434) NodeManager Disk Checker parameter documentation is not correct
[ https://issues.apache.org/jira/browse/YARN-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048194#comment-15048194 ] Akira AJISAKA commented on YARN-4434: - Thanks [~bwtakacy] and [~cheersyang]. I'm +1 for the Weiwei's patch. Committing this. > NodeManager Disk Checker parameter documentation is not correct > --- > > Key: YARN-4434 > URL: https://issues.apache.org/jira/browse/YARN-4434 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.1 >Reporter: Takashi Ohnishi >Assignee: Weiwei Yang >Priority: Minor > Labels: documentation > Attachments: YARN-4434.001.patch > > > In the description of > yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage, > it says > {noformat} > The default value is 100 i.e. the entire disk can be used. > {noformat} > But, in yarn-default.xml and source code, the default value is 90. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4421) Remove dead code in RmAppImpl.RMAppRecoveredTransition
[ https://issues.apache.org/jira/browse/YARN-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4421: Priority: Trivial (was: Minor) Issue Type: Bug (was: Improvement) > Remove dead code in RmAppImpl.RMAppRecoveredTransition > -- > > Key: YARN-4421 > URL: https://issues.apache.org/jira/browse/YARN-4421 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Trivial > Fix For: 2.8.0 > > Attachments: YARN-4421.001.patch > > > The {{transition()}} method contains the following: > {code} > // Last attempt is in final state, return ACCEPTED waiting for last > // RMAppAttempt to send finished or failed event back. > if (app.currentAttempt != null > && (app.currentAttempt.getState() == RMAppAttemptState.KILLED > || app.currentAttempt.getState() == RMAppAttemptState.FINISHED > || (app.currentAttempt.getState() == RMAppAttemptState.FAILED > && app.getNumFailedAppAttempts() == app.maxAppAttempts))) { > return RMAppState.ACCEPTED; > } > // YARN-1507 is saving the application state after the application is > // accepted. So after YARN-1507, an app is saved meaning it is accepted. > // Thus we return ACCECPTED state on recovery. > return RMAppState.ACCEPTED; > {code} > The {{if}} statement is fully redundant and can be eliminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048106#comment-15048106 ] Li Lu commented on YARN-4356: - Latest patch LGTM. +1 pending Jenkins. I'll wait for one more day and if there's no objection I'll commit it. > ensure the timeline service v.2 is disabled cleanly and has no impact when > it's turned off > -- > > Key: YARN-4356 > URL: https://issues.apache.org/jira/browse/YARN-4356 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-4356-feature-YARN-2928.002.patch, > YARN-4356-feature-YARN-2928.003.patch, YARN-4356-feature-YARN-2928.004.patch, > YARN-4356-feature-YARN-2928.poc.001.patch > > > For us to be able to merge the first milestone drop to trunk, we want to > ensure that once disabled the timeline service v.2 has no impact from the > server side to the client side. If the timeline service is not enabled, no > action should be done. If v.1 is enabled but not v.2, v.1 should behave the > same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4421) Remove dead code in RmAppImpl.RMAppRecoveredTransition
[ https://issues.apache.org/jira/browse/YARN-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048093#comment-15048093 ] Rohith Sharma K S commented on YARN-4421: - Initially in RM restart feature, there was some code that doing functionality handling between those 2 lines. Later on because of improvements/bug, it has been removed which looking now as dead code. It can be removed now. > Remove dead code in RmAppImpl.RMAppRecoveredTransition > -- > > Key: YARN-4421 > URL: https://issues.apache.org/jira/browse/YARN-4421 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Minor > Attachments: YARN-4421.001.patch > > > The {{transition()}} method contains the following: > {code} > // Last attempt is in final state, return ACCEPTED waiting for last > // RMAppAttempt to send finished or failed event back. > if (app.currentAttempt != null > && (app.currentAttempt.getState() == RMAppAttemptState.KILLED > || app.currentAttempt.getState() == RMAppAttemptState.FINISHED > || (app.currentAttempt.getState() == RMAppAttemptState.FAILED > && app.getNumFailedAppAttempts() == app.maxAppAttempts))) { > return RMAppState.ACCEPTED; > } > // YARN-1507 is saving the application state after the application is > // accepted. So after YARN-1507, an app is saved meaning it is accepted. > // Thus we return ACCECPTED state on recovery. > return RMAppState.ACCEPTED; > {code} > The {{if}} statement is fully redundant and can be eliminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4431) Not necessary to do unRegisterNM() if NM get stop due to failed to connect to RM
[ https://issues.apache.org/jira/browse/YARN-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048088#comment-15048088 ] Hudson commented on YARN-4431: -- FAILURE: Integrated in Hadoop-trunk-Commit #8945 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8945/]) YARN-4431. Not necessary to do unRegisterNM() if NM get stop due to (rohithsharmaks: rev 15c3e7ffe3d1c57ad36afd993f09fc47889c93bd) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java > Not necessary to do unRegisterNM() if NM get stop due to failed to connect to > RM > > > Key: YARN-4431 > URL: https://issues.apache.org/jira/browse/YARN-4431 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du > Fix For: 2.8.0 > > Attachments: YARN-4431.patch > > > {noformat} > 2015-12-07 12:16:57,873 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 12:16:58,874 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 12:16:58,876 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: > Unregistration of the Node 10.200.10.53:25454 failed. > java.net.ConnectException: Call From jduMBP.local/10.200.10.53 to > 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: > Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1385) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy74.unRegisterNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.unRegisterNodeManager(ResourceTrackerPBClientImpl.java:98) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:255) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at com.sun.proxy.$Proxy75.unRegisterNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:267) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStop(NodeStatusUpdaterImpl.java:245) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at > org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) > at > org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:377) > {noformat} > If RM down for some reason, NM's NodeStatusUpdaterImpl will retry the > connection with proper retry policy. After retry the maximum times (15 > minutes by default), it will send NodeManagerEventType.SHUTDOWN to shutdown > NM. But NM shutdown will call NodeStatusUpdaterImpl.serviceStop() which will > call unRegisterNM() to unregister NM from RM and get retry again (another 15 > minutes). This is completely unnecessary and we should skip unRegisterNM when > NM get shutdown because of connection issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048082#comment-15048082 ] Hadoop QA commented on YARN-4225: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 52s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 0s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 52s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 26s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 25s {color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 0s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 53s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 10s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 10s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 28s {color} | {color:red} Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 50, now 50). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 59s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 52s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 40s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common introduced 1 new FindBugs issues. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 24s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 2s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 3s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 3s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 22s {color} | {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s {color} | {color:green} hadoop-yarn-api in the patch passed with
[jira] [Updated] (YARN-4431) Not necessary to do unRegisterNM() if NM get stop due to failed to connect to RM
[ https://issues.apache.org/jira/browse/YARN-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4431: Component/s: nodemanager > Not necessary to do unRegisterNM() if NM get stop due to failed to connect to > RM > > > Key: YARN-4431 > URL: https://issues.apache.org/jira/browse/YARN-4431 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du > Fix For: 2.8.0 > > Attachments: YARN-4431.patch > > > {noformat} > 2015-12-07 12:16:57,873 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 12:16:58,874 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 12:16:58,876 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: > Unregistration of the Node 10.200.10.53:25454 failed. > java.net.ConnectException: Call From jduMBP.local/10.200.10.53 to > 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: > Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1385) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy74.unRegisterNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.unRegisterNodeManager(ResourceTrackerPBClientImpl.java:98) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:255) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at com.sun.proxy.$Proxy75.unRegisterNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:267) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStop(NodeStatusUpdaterImpl.java:245) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at > org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) > at > org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:377) > {noformat} > If RM down for some reason, NM's NodeStatusUpdaterImpl will retry the > connection with proper retry policy. After retry the maximum times (15 > minutes by default), it will send NodeManagerEventType.SHUTDOWN to shutdown > NM. But NM shutdown will call NodeStatusUpdaterImpl.serviceStop() which will > call unRegisterNM() to unregister NM from RM and get retry again (another 15 > minutes). This is completely unnecessary and we should skip unRegisterNM when > NM get shutdown because of connection issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-4356: -- Attachment: YARN-4356-feature-YARN-2928.004.patch Posted patch v.4 that addresses Li's comments. > ensure the timeline service v.2 is disabled cleanly and has no impact when > it's turned off > -- > > Key: YARN-4356 > URL: https://issues.apache.org/jira/browse/YARN-4356 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-4356-feature-YARN-2928.002.patch, > YARN-4356-feature-YARN-2928.003.patch, YARN-4356-feature-YARN-2928.004.patch, > YARN-4356-feature-YARN-2928.poc.001.patch > > > For us to be able to merge the first milestone drop to trunk, we want to > ensure that once disabled the timeline service v.2 has no impact from the > server side to the client side. If the timeline service is not enabled, no > action should be done. If v.1 is enabled but not v.2, v.1 should behave the > same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4194) Extend Reservation Definition Langauge (RDL) extensions to support node labels
[ https://issues.apache.org/jira/browse/YARN-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048069#comment-15048069 ] Hadoop QA commented on YARN-4194: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 10m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 37s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 2s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 33s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 49s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 57s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 18s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 39s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 3m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 3s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 3m 3s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 3s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s {color} | {color:red} Patch generated 2 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 17, now 19). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 27s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 43s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 6s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 36s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 51s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 25s {color} | {color:red} Patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 60m 49s {color} | {color:black} {color} | \\ \\ || Subsys
[jira] [Commented] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048067#comment-15048067 ] Sangjin Lee commented on YARN-4356: --- Oh I see. Yes, there is no NM metrics publisher in ATS v.1.x, so it should be fine. Thanks for that. > ensure the timeline service v.2 is disabled cleanly and has no impact when > it's turned off > -- > > Key: YARN-4356 > URL: https://issues.apache.org/jira/browse/YARN-4356 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-4356-feature-YARN-2928.002.patch, > YARN-4356-feature-YARN-2928.003.patch, > YARN-4356-feature-YARN-2928.poc.001.patch > > > For us to be able to merge the first milestone drop to trunk, we want to > ensure that once disabled the timeline service v.2 has no impact from the > server side to the client side. If the timeline service is not enabled, no > action should be done. If v.1 is enabled but not v.2, v.1 should behave the > same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048057#comment-15048057 ] Li Lu commented on YARN-4356: - Thanks [~sjlee0]! bq. The v.1 behavior should be essentially the same as today. The existing v.1 behavior is to check TIMELINE_SERVICE_ENABLED and then RM_SYSTEM_METRICS_PUBLISHER_ENABLED. You'll see that the current patch checks TIMELINE_SERVICE_ENABLED, TIMELINE_SERVICE_VERSION == 1, and RM_SYSTEM_METRICS_PUBLISHER_ENABLED. I do see that it's checking strictly for version = 1. I'll change it to check for version < 2 so it can match 1.5 as well. Sorry about the confusion here but I was talking about this part of the code: {code} 223 // initialize the metrics publisher if the timeline service v.2 is enabled 224 // and the system publisher is enabled 225 Configuration conf = context.getConf(); 226 if (YarnConfiguration.timelineServiceV2Enabled(conf) && 227 YarnConfiguration.systemMetricsPublisherEnabled(conf)) { 228 LOG.info("YARN system metrics publishing service is enabled"); 229 nmMetricsPublisher = createNMTimelinePublisher(context); 230 context.setNMTimelinePublisher(nmMetricsPublisher); 231 } {code} Looks like in ATS v1.x branch we don't have the nmMetricsPublisher so it's fine? Just want to double check this part. > ensure the timeline service v.2 is disabled cleanly and has no impact when > it's turned off > -- > > Key: YARN-4356 > URL: https://issues.apache.org/jira/browse/YARN-4356 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-4356-feature-YARN-2928.002.patch, > YARN-4356-feature-YARN-2928.003.patch, > YARN-4356-feature-YARN-2928.poc.001.patch > > > For us to be able to merge the first milestone drop to trunk, we want to > ensure that once disabled the timeline service v.2 has no impact from the > server side to the client side. If the timeline service is not enabled, no > action should be done. If v.1 is enabled but not v.2, v.1 should behave the > same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4431) Not necessary to do unRegisterNM() if NM get stop due to failed to connect to RM
[ https://issues.apache.org/jira/browse/YARN-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048052#comment-15048052 ] Rohith Sharma K S commented on YARN-4431: - committing shortly > Not necessary to do unRegisterNM() if NM get stop due to failed to connect to > RM > > > Key: YARN-4431 > URL: https://issues.apache.org/jira/browse/YARN-4431 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4431.patch > > > {noformat} > 2015-12-07 12:16:57,873 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 12:16:58,874 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 12:16:58,876 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: > Unregistration of the Node 10.200.10.53:25454 failed. > java.net.ConnectException: Call From jduMBP.local/10.200.10.53 to > 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: > Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1385) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy74.unRegisterNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.unRegisterNodeManager(ResourceTrackerPBClientImpl.java:98) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:255) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at com.sun.proxy.$Proxy75.unRegisterNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:267) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStop(NodeStatusUpdaterImpl.java:245) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at > org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) > at > org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:377) > {noformat} > If RM down for some reason, NM's NodeStatusUpdaterImpl will retry the > connection with proper retry policy. After retry the maximum times (15 > minutes by default), it will send NodeManagerEventType.SHUTDOWN to shutdown > NM. But NM shutdown will call NodeStatusUpdaterImpl.serviceStop() which will > call unRegisterNM() to unregister NM from RM and get retry again (another 15 > minutes). This is completely unnecessary and we should skip unRegisterNM when > NM get shutdown because of connection issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048049#comment-15048049 ] Sangjin Lee commented on YARN-4356: --- Thanks for your review [~gtCarrera9]. bq. I noticed in some files we're verifying v2 in a hard-coded fashion (version == 2). Why do we still need this especially when we have timelineServiceV2Enabled()? The only reason for them is because timelineServiceV2Enabled() is timelineServiceEnabled() + (timelineServiceVersion == 2). In those cases, timelineServiceEnabled() was already checked. Thus, as a (small) optimization I just checked the version directly. Having said that, I'm comfortable with changing them to call timelineServiceV2Enabled() even though it may check timelineServiceEnabled() one extra time. I'll make those changes. bq. That is, if the timeline-service.version is set to 2.x in future, are the applications allowed to use other versions of ATS? It should be possible in principle with the assumption that some compatibility mechanism is in place so an old API invocation can succeed. The config is there to discover what's running on the cluster. If there is a compatibility mechanism, applications may invoke a different API (it's entirely up to them at that point). bq. ApplicationMaster, function names "...OnNewTimelineService" can be more specific like "...V2"? Sounds good. I didn't rename methods as part of this work, but let me see if I can rename them to use "v2". bq. ContainerManagerImpl, I just want to double check one behavior: the SMP is enabled for the NM only when timeline version is v2 and SMP is enabled in the config? What about v1.x versions? If this is a v2 only feature, shall we clarify that in the log message? The v.1 behavior should be essentially the same as today. The existing v.1 behavior is to check TIMELINE_SERVICE_ENABLED and then RM_SYSTEM_METRICS_PUBLISHER_ENABLED. You'll see that the current patch checks TIMELINE_SERVICE_ENABLED, TIMELINE_SERVICE_VERSION == 1, and RM_SYSTEM_METRICS_PUBLISHER_ENABLED. I do see that it's checking strictly for version = 1. I'll change it to check for version < 2 so it can match 1.5 as well. > ensure the timeline service v.2 is disabled cleanly and has no impact when > it's turned off > -- > > Key: YARN-4356 > URL: https://issues.apache.org/jira/browse/YARN-4356 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-4356-feature-YARN-2928.002.patch, > YARN-4356-feature-YARN-2928.003.patch, > YARN-4356-feature-YARN-2928.poc.001.patch > > > For us to be able to merge the first milestone drop to trunk, we want to > ensure that once disabled the timeline service v.2 has no impact from the > server side to the client side. If the timeline service is not enabled, no > action should be done. If v.1 is enabled but not v.2, v.1 should behave the > same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4431) Not necessary to do unRegisterNM() if NM get stop due to failed to connect to RM
[ https://issues.apache.org/jira/browse/YARN-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048043#comment-15048043 ] Rohith Sharma K S commented on YARN-4431: - +1 lgtm > Not necessary to do unRegisterNM() if NM get stop due to failed to connect to > RM > > > Key: YARN-4431 > URL: https://issues.apache.org/jira/browse/YARN-4431 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-4431.patch > > > {noformat} > 2015-12-07 12:16:57,873 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 12:16:58,874 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 12:16:58,876 WARN > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: > Unregistration of the Node 10.200.10.53:25454 failed. > java.net.ConnectException: Call From jduMBP.local/10.200.10.53 to > 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: > Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1385) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy74.unRegisterNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.unRegisterNodeManager(ResourceTrackerPBClientImpl.java:98) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:255) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at com.sun.proxy.$Proxy75.unRegisterNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:267) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStop(NodeStatusUpdaterImpl.java:245) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) > at > org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) > at > org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:377) > {noformat} > If RM down for some reason, NM's NodeStatusUpdaterImpl will retry the > connection with proper retry policy. After retry the maximum times (15 > minutes by default), it will send NodeManagerEventType.SHUTDOWN to shutdown > NM. But NM shutdown will call NodeStatusUpdaterImpl.serviceStop() which will > call unRegisterNM() to unregister NM from RM and get retry again (another 15 > minutes). This is completely unnecessary and we should skip unRegisterNM when > NM get shutdown because of connection issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4424) Fix deadlock in RMAppImpl
[ https://issues.apache.org/jira/browse/YARN-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048029#comment-15048029 ] Hudson commented on YARN-4424: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #677 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/677/]) YARN-4424. Fix deadlock in RMAppImpl. (Jian he via wangda) (wangda: rev 7e4715186d31ac889fba26d453feedcebb11fc70) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt > Fix deadlock in RMAppImpl > - > > Key: YARN-4424 > URL: https://issues.apache.org/jira/browse/YARN-4424 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Jian He >Priority: Blocker > Fix For: 2.7.2, 2.6.3 > > Attachments: YARN-4424.1.patch > > > {code} > yarn@XXX:/mnt/hadoopqe$ /usr/hdp/current/hadoop-yarn-client/bin/yarn > application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING > 15/12/04 21:59:54 INFO impl.TimelineClientImpl: Timeline service address: > http://XXX:8188/ws/v1/timeline/ > 15/12/04 21:59:54 INFO client.RMProxy: Connecting to ResourceManager at > XXX/0.0.0.0:8050 > 15/12/04 21:59:55 INFO client.AHSProxy: Connecting to Application History > server at XXX/0.0.0.0:10200 > {code} > {code:title=RM log} > 2015-12-04 21:59:19,744 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 237000 > 2015-12-04 22:00:50,945 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 238000 > 2015-12-04 22:02:22,416 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 239000 > 2015-12-04 22:03:53,593 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 24 > 2015-12-04 22:05:24,856 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 241000 > 2015-12-04 22:06:56,235 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 242000 > 2015-12-04 22:08:27,510 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 243000 > 2015-12-04 22:09:58,786 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 244000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048030#comment-15048030 ] Hudson commented on YARN-4248: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #677 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/677/]) YARN-4248. Followup patch adding asf-licence exclusions for json test (cdouglas: rev 9f50e13d5dc329c3a6df7f9bcaf2f29b35adc52b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, > YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4417) Make RM and Timeline-server REST APIs more consistent
[ https://issues.apache.org/jira/browse/YARN-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047995#comment-15047995 ] Hadoop QA commented on YARN-4417: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 56s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 24s {color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s {color} | {color:red} Patch generated 5 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 49, now 52). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 25s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 24s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 22s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 8s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 159m 29s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Not
[jira] [Commented] (YARN-4234) New put APIs in TimelineClient for ats v1.5
[ https://issues.apache.org/jira/browse/YARN-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047987#comment-15047987 ] Li Lu commented on YARN-4234: - Patch generally LGTM. The only issue is that if NameNode is unavailable and retry is not set, the timeline client will quickly retry and then fail. This will cause either application attempts to fail, or the RM to fail to start. Maybe we can try some mechanisms like in FileSystemRMStateStore#startInternal, where we explicitly change related retry policy config? Other than this corner case issue I'm fine with this patch. Right now people are reaching agreements on YARN-3623, so probably YARN-3623 can go in very soon. This said, could some committers please review the current patch? Thanks! > New put APIs in TimelineClient for ats v1.5 > --- > > Key: YARN-4234 > URL: https://issues.apache.org/jira/browse/YARN-4234 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-4234-2015-11-13.1.patch, > YARN-4234-2015-11-16.1.patch, YARN-4234-2015-11-16.2.patch, > YARN-4234-2015.2.patch, YARN-4234.1.patch, YARN-4234.2.patch, > YARN-4234.2015-11-12.1.patch, YARN-4234.2015-11-12.1.patch, > YARN-4234.2015-11-18.1.patch, YARN-4234.2015-11-18.2.patch, > YARN-4234.20151109.patch, YARN-4234.20151110.1.patch, > YARN-4234.2015.1.patch, YARN-4234.3.patch > > > In this ticket, we will add new put APIs in timelineClient to let > clients/applications have the option to use ATS v1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047931#comment-15047931 ] Akihiro Suda commented on YARN-4301: The warning is for {{concept-async-diskchecker.txt}}, which is just a concept document, not a patch. I didn't know that Yetus recognizes {{*.txt}} file as a patch. > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047912#comment-15047912 ] Li Lu commented on YARN-4356: - Hi [~sjlee0], thanks for the work! Mostly LGTM, just a few thing to check: 1. I noticed in some files we're verifying v2 in a hard-coded fashion (version == 2). Why do we still need this especially when we have timelineServiceV2Enabled()? 2. MapRed will use the timeline.version config as the current active API version. I'm fine with this design. One thing to check: do we allow other applications to customize the active API version for themselves? That is, if the timeline-service.version is set to 2.x in future, are the applications allowed to use other versions of ATS? (I think in this case the compatibility story should be made by the application itself? ) 3. ApplicationMaster, function names "...OnNewTimelineService" can be more specific like "...V2"? 4. ContainerManagerImpl, I just want to double check one behavior: the SMP is enabled for the NM only when timeline version is v2 and SMP is enabled in the config? What about v1.x versions? If this is a v2 only feature, shall we clarify that in the log message? Thanks! > ensure the timeline service v.2 is disabled cleanly and has no impact when > it's turned off > -- > > Key: YARN-4356 > URL: https://issues.apache.org/jira/browse/YARN-4356 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-4356-feature-YARN-2928.002.patch, > YARN-4356-feature-YARN-2928.003.patch, > YARN-4356-feature-YARN-2928.poc.001.patch > > > For us to be able to merge the first milestone drop to trunk, we want to > ensure that once disabled the timeline service v.2 has no impact from the > server side to the client side. If the timeline service is not enabled, no > action should be done. If v.1 is enabled but not v.2, v.1 should behave the > same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4415) Scheduler Web Ui shows max capacity for the queue is 100% but when we submit application doesnt get assigned
[ https://issues.apache.org/jira/browse/YARN-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047909#comment-15047909 ] Xianyin Xin commented on YARN-4415: --- +1 for the general idea from [~Naganarasimha]. I think there exists many improper in the code, especially when dealing with label "*", like {{setupQueueConfigs()}} in {{AbstractCSQueue}} and, the {{PartitionQueueCapacitiesInfo}} and {{QueueCapacitiesInfo}} when return the actually capacities. I suggest you just upload a preview patch so that the problem can exposed in another way, sounds feasible? > Scheduler Web Ui shows max capacity for the queue is 100% but when we submit > application doesnt get assigned > > > Key: YARN-4415 > URL: https://issues.apache.org/jira/browse/YARN-4415 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, resourcemanager >Affects Versions: 2.7.2 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: App info with diagnostics info.png, > capacity-scheduler.xml, screenshot-1.png > > > Steps to reproduce the issue : > Scenario 1: > # Configure a queue(default) with accessible node labels as * > # create a exclusive partition *xxx* and map a NM to it > # ensure no capacities are configured for default for label xxx > # start an RM app with queue as default and label as xxx > # application is stuck but scheduler ui shows 100% as max capacity for that > queue > Scenario 2: > # create a nonexclusive partition *sharedPartition* and map a NM to it > # ensure no capacities are configured for default queue > # start an RM app with queue as *default* and label as *sharedPartition* > # application is stuck but scheduler ui shows 100% as max capacity for that > queue for *sharedPartition* > For both issues cause is the same default max capacity and abs max capacity > is set to Zero % -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047902#comment-15047902 ] Tsuyoshi Ozawa commented on YARN-4301: -- [~suda] thank you for updating. The warning by findbugs looks related to the change. Could you fix it? > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4340) Add "list" API to reservation system
[ https://issues.apache.org/jira/browse/YARN-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Po updated YARN-4340: -- Attachment: YARN-4340.v7.patch This patch addresses the remaining findbugs, checkstyle errors. The following unit tests are failing and have associated Jira tickets: hadoop.yarn.server.resourcemanager.TestClientRMTokens -- YARN-4306 hadoop.yarn.server.resourcemanager.TestAMAuthorization -- YARN-4318 hadoop.yarn.client.TestGetGroups -- YARN-4351 The following unit tests are passing locally and flakiness may be related to YARN-4352: org.apache.hadoop.yarn.client.api.impl.TestYarnClient -- testShouldNotRetryForeverForNonNetworkExceptions also fails locally on trunk -- testAMMRToken passes locally on trunk org.apache.hadoop.yarn.client.api.impl.TestAMRMClient org.apache.hadoop.yarn.client.api.impl.TestNMClient The following tests are passing locally: hadoop.mapreduce.v2.TestMRJobsWithProfiler > Add "list" API to reservation system > > > Key: YARN-4340 > URL: https://issues.apache.org/jira/browse/YARN-4340 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Carlo Curino >Assignee: Sean Po > Attachments: YARN-4340.v1.patch, YARN-4340.v2.patch, > YARN-4340.v3.patch, YARN-4340.v4.patch, YARN-4340.v5.patch, > YARN-4340.v6.patch, YARN-4340.v7.patch > > > This JIRA tracks changes to the APIs of the reservation system, and enables > querying the reservation system on which reservation exists by "time-range, > reservation-id". > YARN-4420 has a dependency on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4341) add doc about timeline performance tool usage
[ https://issues.apache.org/jira/browse/YARN-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047879#comment-15047879 ] Sangjin Lee commented on YARN-4341: --- Sorry [~lichangleo] it took me a while to get to this. Some corrections and suggestions below. (Highlights) - "Timeline..." -> "The timeline..." - "help measure..." -> "helps measure..." - "Test will launch..." -> "The test launches..." - "JobHistoryFileReplay mapper" -> "JobHistoryFileReplay mappers" - ".. to timeline server." -> "... to the timeline server." - "In the end," -> "At the end," - "transaction rate(ops/s)" -> "the transaction rate (ops/s)" - "and transaction rate in total" -> "and the total transaction rate" - "print out" -> "printed out" - "To run the test..." -> "Running the test..." - "IO rate(KB/s)" -> "the I/O rate (KB/s)" - "IO rate total." -> "the total I/O rate." (Usages) - "Usages" -> "Usage" - "Each mapper write user specified number of timeline entities to timelineserver and each timeline entity is created with user specified size." -> "Each mapper writes a user-specified number of timeline entities with a user-specified size to the timeline server." - "Each mappe replay..." -> "Each mapper replays..." - "... to be replayed. suggest to launch mappers no more than..." -> "...to be replayed; the number of mappers should be no more than..." > add doc about timeline performance tool usage > - > > Key: YARN-4341 > URL: https://issues.apache.org/jira/browse/YARN-4341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4341.2.patch, YARN-4341.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4194) Extend Reservation Definition Langauge (RDL) extensions to support node labels
[ https://issues.apache.org/jira/browse/YARN-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Tumanov updated YARN-4194: - Attachment: YARN-4194-v1.patch patch now attached. > Extend Reservation Definition Langauge (RDL) extensions to support node labels > -- > > Key: YARN-4194 > URL: https://issues.apache.org/jira/browse/YARN-4194 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Carlo Curino >Assignee: Alexey Tumanov > Attachments: YARN-4194-v1.patch > > > This JIRA tracks changes to the APIs to the reservation system to support > the expressivity of node-labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3368) [Umbrella] Improve YARN web UI
[ https://issues.apache.org/jira/browse/YARN-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047794#comment-15047794 ] Wangda Tan commented on YARN-3368: -- Also I've renamed this JIRA to an umbrella JIRA for efforts of YARN web UI improvements. Please feel free to file tickets for bugs/features. > [Umbrella] Improve YARN web UI > -- > > Key: YARN-3368 > URL: https://issues.apache.org/jira/browse/YARN-3368 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He > Attachments: (Dec 3 2015) yarn-ui-screenshots.zip, (POC, Aug-2015)) > yarn-ui-screenshots.zip > > > The goal is to improve YARN UI for better usability. > We may take advantage of some existing front-end frameworks to build a > fancier, easier-to-use UI. > The old UI continue to exist until we feel it's ready to flip to the new UI. > This serves as an umbrella jira to track the tasks. we can do this in a > branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3368) [Umbrella] Improve YARN web UI
[ https://issues.apache.org/jira/browse/YARN-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047793#comment-15047793 ] Wangda Tan commented on YARN-3368: -- Thanks, I've created YARN-3368 branch and committed patches to it. You can follow steps in {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/README.md}} to try this patch. > [Umbrella] Improve YARN web UI > -- > > Key: YARN-3368 > URL: https://issues.apache.org/jira/browse/YARN-3368 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He > Attachments: (Dec 3 2015) yarn-ui-screenshots.zip, (POC, Aug-2015)) > yarn-ui-screenshots.zip > > > The goal is to improve YARN UI for better usability. > We may take advantage of some existing front-end frameworks to build a > fancier, easier-to-use UI. > The old UI continue to exist until we feel it's ready to flip to the new UI. > This serves as an umbrella jira to track the tasks. we can do this in a > branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3368) [Umbrella] Improve YARN web UI
[ https://issues.apache.org/jira/browse/YARN-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3368: - Summary: [Umbrella] Improve YARN web UI (was: Improve YARN web UI) > [Umbrella] Improve YARN web UI > -- > > Key: YARN-3368 > URL: https://issues.apache.org/jira/browse/YARN-3368 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He > Attachments: (Dec 3 2015) yarn-ui-screenshots.zip, (POC, Aug-2015)) > yarn-ui-screenshots.zip > > > The goal is to improve YARN UI for better usability. > We may take advantage of some existing front-end frameworks to build a > fancier, easier-to-use UI. > The old UI continue to exist until we feel it's ready to flip to the new UI. > This serves as an umbrella jira to track the tasks. we can do this in a > branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047788#comment-15047788 ] Ivan Mitic commented on YARN-4309: -- Thanks [~vvasudev]. Latest patch looks good, I am +1 on the Windows side changes. Please also have someone actively working on Yarn to +1 on the overall approach and Linux side. > Add debug information to application logs when a container fails > > > Key: YARN-4309 > URL: https://issues.apache.org/jira/browse/YARN-4309 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4309.001.patch, YARN-4309.002.patch, > YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch, > YARN-4309.006.patch, YARN-4309.007.patch, YARN-4309.008.patch, > YARN-4309.009.patch > > > Sometimes when a container fails, it can be pretty hard to figure out why it > failed. > My proposal is that if a container fails, we collect information about the > container local dir and dump it into the container log dir. Ideally, I'd like > to tar up the directory entirely, but I'm not sure of the security and space > implications of such a approach. At the very least, we can list all the files > in the container local dir, and dump the contents of launch_container.sh(into > the container log dir). > When log aggregation occurs, all this information will automatically get > collected and make debugging such failures much easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4417) Make RM and Timeline-server REST APIs more consistent
[ https://issues.apache.org/jira/browse/YARN-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4417: - Attachment: YARN-4417.2.patch Attached ver.2 patch fixed test failures. > Make RM and Timeline-server REST APIs more consistent > - > > Key: YARN-4417 > URL: https://issues.apache.org/jira/browse/YARN-4417 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4417.1.patch, YARN-4417.2.patch > > > There're some differences between RM and timeline-server's REST APIs, for > example, RM REST API doesn't support get application attempt info by app-id > and attempt-id but timeline server supports. We could make them more > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3623) We should have a config to indicate the Timeline Service version
[ https://issues.apache.org/jira/browse/YARN-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047758#comment-15047758 ] Li Lu commented on YARN-3623: - bq. it may be sufficient to note it here and carry on that discussion on a v.2 subtask. Sound good? Agree. We can further investigate this issue in the v2 branch. I'm also fine with the current config name in [~xgong]'s patch. > We should have a config to indicate the Timeline Service version > > > Key: YARN-3623 > URL: https://issues.apache.org/jira/browse/YARN-3623 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Xuan Gong > Attachments: YARN-3623-2015-11-19.1.patch > > > So far RM, MR AM, DA AM added/changed new config to enable the feature to > write the timeline data to v2 server. It's good to have a YARN > timeline-service.version config like timeline-service.enable to indicate the > version of the running timeline service with the given YARN cluster. It's > beneficial for users to more smoothly move from v1 to v2, as they don't need > to change the existing config, but switch this config from v1 to v2. And each > framework doesn't need to have their own v1/v2 config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4381) Add container launchEvent and container localizeFailed metrics in container
[ https://issues.apache.org/jira/browse/YARN-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047742#comment-15047742 ] Lin Yiqun commented on YARN-4381: - [~djp], the jenkin report shows that checkstyle warnings is not need to modify and license warnings likes not related. Could you review my patch again? > Add container launchEvent and container localizeFailed metrics in container > --- > > Key: YARN-4381 > URL: https://issues.apache.org/jira/browse/YARN-4381 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Attachments: YARN-4381.001.patch, YARN-4381.002.patch > > > Recently, I found a issue on nodemanager metrics.That's > {{NodeManagerMetrics#containersLaunched}} is not actually means the container > succeed launched times.Because in some time, it will be failed when receiving > the killing command or happening container-localizationFailed.This will lead > to a failed container.But now,this counter value will be increased in these > code whenever the container is started successfully or failed. > {code} > Credentials credentials = parseCredentials(launchContext); > Container container = > new ContainerImpl(getConfig(), this.dispatcher, > context.getNMStateStore(), launchContext, > credentials, metrics, containerTokenIdentifier); > ApplicationId applicationID = > containerId.getApplicationAttemptId().getApplicationId(); > if (context.getContainers().putIfAbsent(containerId, container) != null) { > NMAuditLogger.logFailure(user, AuditConstants.START_CONTAINER, > "ContainerManagerImpl", "Container already running on this node!", > applicationID, containerId); > throw RPCUtil.getRemoteException("Container " + containerIdStr > + " already is running on this node!!"); > } > this.readLock.lock(); > try { > if (!serviceStopped) { > // Create the application > Application application = > new ApplicationImpl(dispatcher, user, applicationID, credentials, > context); > if (null == context.getApplications().putIfAbsent(applicationID, > application)) { > LOG.info("Creating a new application reference for app " + > applicationID); > LogAggregationContext logAggregationContext = > containerTokenIdentifier.getLogAggregationContext(); > Map appAcls = > container.getLaunchContext().getApplicationACLs(); > context.getNMStateStore().storeApplication(applicationID, > buildAppProto(applicationID, user, credentials, appAcls, > logAggregationContext)); > dispatcher.getEventHandler().handle( > new ApplicationInitEvent(applicationID, appAcls, > logAggregationContext)); > } > this.context.getNMStateStore().storeContainer(containerId, request); > dispatcher.getEventHandler().handle( > new ApplicationContainerInitEvent(container)); > > this.context.getContainerTokenSecretManager().startContainerSuccessful( > containerTokenIdentifier); > NMAuditLogger.logSuccess(user, AuditConstants.START_CONTAINER, > "ContainerManageImpl", applicationID, containerId); > // TODO launchedContainer misplaced -> doesn't necessarily mean a > container > // launch. A finished Application will not launch containers. > metrics.launchedContainer(); > metrics.allocateContainer(containerTokenIdentifier.getResource()); > } else { > throw new YarnException( > "Container start failed as the NodeManager is " + > "in the process of shutting down"); > } > {code} > In addition, we are lack of localzationFailed metric in container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047738#comment-15047738 ] Hadoop QA commented on YARN-4356: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 12 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 45s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 13s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 40s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 9s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 5m 17s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 2m 33s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 21s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in feature-YARN-2928 has 3 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 32s {color} | {color:red} hadoop-yarn-common in feature-YARN-2928 failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 25s {color} | {color:red} hadoop-yarn-server-resourcemanager in feature-YARN-2928 failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 6m 39s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 51s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 24m 41s {color} | {color:red} root-jdk1.8.0_66 with JDK v1.8.0_66 generated 5 new issues (was 779, now 779). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 30s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 34m 11s {color} | {color:red} root-jdk1.7.0_85 with JDK v1.7.0_85 generated 5 new issues (was 772, now 772). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 30s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 3s {color} | {color:red} Patch generated 26 new checkstyle issues in root (total was 1937, now 1931). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 5m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 2m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 43s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 35s {color} | {color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 27s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 6m 27s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s {color} | {color:green} hadoop-yarn-ap
[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS
[ https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047702#comment-15047702 ] Hadoop QA commented on YARN-3946: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 6 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 25s {color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s {color} | {color:red} Patch generated 28 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager (total was 654, now 678). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 28s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 24s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 55s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 28s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 155m 36s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | | JDK v1.7.0_91 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | ha
[jira] [Commented] (YARN-3623) We should have a config to indicate the Timeline Service version
[ https://issues.apache.org/jira/browse/YARN-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047695#comment-15047695 ] Sangjin Lee commented on YARN-3623: --- I agree the rolling upgrade use case from v.1 to v.2 should be addressed. We had some offline discussion on this too. Since it is a pretty major item in and of itself and somewhat separate (being v.2-specific) from this specific JIRA, it may be sufficient to note it here and carry on that discussion on a v.2 subtask. Sound good? I'm fine with the current name "yarn.timeline-service.version". I just want to clarify the interpretation of this config on the cluster side and on the client side. On the cluster side, it *should* always be interpreted as precisely which version of the timeline service should be up. If "yarn.timeline-service.version" is 1.5, and "yarn.timeline-service.enabled" is true, it should be understood as the cluster should bring up the timeline service v.1.5 (and nothing else), and the client can expect that to be the case. On the client side, clearly a client that uses the same version should expect to succeed. If a client chooses to use a smaller version in spite of this, then depending on how robust the compatibility story is between versions, the results may vary (part of the rolling upgrade discussion included). > We should have a config to indicate the Timeline Service version > > > Key: YARN-3623 > URL: https://issues.apache.org/jira/browse/YARN-3623 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Xuan Gong > Attachments: YARN-3623-2015-11-19.1.patch > > > So far RM, MR AM, DA AM added/changed new config to enable the feature to > write the timeline data to v2 server. It's good to have a YARN > timeline-service.version config like timeline-service.enable to indicate the > version of the running timeline service with the given YARN cluster. It's > beneficial for users to more smoothly move from v1 to v2, as they don't need > to change the existing config, but switch this config from v1 to v2. And each > framework doesn't need to have their own v1/v2 config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-4356: -- Attachment: YARN-4356-feature-YARN-2928.003.patch Posted patch v.3. Addressed the javadoc, findbugs, and checkstyle errors. The unit tests are tests that are known to fail in the trunk or on our branch (e.g. YARN-4350, MAPREDUCE-6533, MAPREDUCE-6540, etc.). > ensure the timeline service v.2 is disabled cleanly and has no impact when > it's turned off > -- > > Key: YARN-4356 > URL: https://issues.apache.org/jira/browse/YARN-4356 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-4356-feature-YARN-2928.002.patch, > YARN-4356-feature-YARN-2928.003.patch, > YARN-4356-feature-YARN-2928.poc.001.patch > > > For us to be able to merge the first milestone drop to trunk, we want to > ensure that once disabled the timeline service v.2 has no impact from the > server side to the client side. If the timeline service is not enabled, no > action should be done. If v.1 is enabled but not v.2, v.1 should behave the > same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4424) Fix deadlock in RMAppImpl
[ https://issues.apache.org/jira/browse/YARN-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047679#comment-15047679 ] Hudson commented on YARN-4424: -- FAILURE: Integrated in Hadoop-trunk-Commit #8943 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8943/]) YARN-4424. Fix deadlock in RMAppImpl. (Jian he via wangda) (wangda: rev 7e4715186d31ac889fba26d453feedcebb11fc70) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java > Fix deadlock in RMAppImpl > - > > Key: YARN-4424 > URL: https://issues.apache.org/jira/browse/YARN-4424 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Jian He >Priority: Blocker > Fix For: 2.7.2, 2.6.3 > > Attachments: YARN-4424.1.patch > > > {code} > yarn@XXX:/mnt/hadoopqe$ /usr/hdp/current/hadoop-yarn-client/bin/yarn > application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING > 15/12/04 21:59:54 INFO impl.TimelineClientImpl: Timeline service address: > http://XXX:8188/ws/v1/timeline/ > 15/12/04 21:59:54 INFO client.RMProxy: Connecting to ResourceManager at > XXX/0.0.0.0:8050 > 15/12/04 21:59:55 INFO client.AHSProxy: Connecting to Application History > server at XXX/0.0.0.0:10200 > {code} > {code:title=RM log} > 2015-12-04 21:59:19,744 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 237000 > 2015-12-04 22:00:50,945 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 238000 > 2015-12-04 22:02:22,416 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 239000 > 2015-12-04 22:03:53,593 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 24 > 2015-12-04 22:05:24,856 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 241000 > 2015-12-04 22:06:56,235 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 242000 > 2015-12-04 22:08:27,510 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 243000 > 2015-12-04 22:09:58,786 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 244000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3623) We should have a config to indicate the Timeline Service version
[ https://issues.apache.org/jira/browse/YARN-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047675#comment-15047675 ] Li Lu commented on YARN-3623: - Thanks for the review [~djp]! The ATS v1.5 introduces some new API on top of ATS v1 APIs. However, ATS v2 is not compatible with either versions. I agree that a config would suffice to specify the "active" ATS version or the version of the writer API a client should use. Right now I think a config with name "yarn.timeline-service.version" is fine because this leaves flexibility to allow a set of active ATS writer API versions in the system. Marking a latest version may not be quite useful since ATS 1.x is not API-compatible with ATS v2.x. On the other hand, I totally agree there should be a comprehensive story for ATS rolling upgrade. IIUC, ATS v1 can be upgraded in a rolling fashion to v1.5. Meanwhile, if the ATS v1/1.5 server is available in the system, v1.x server should be able to work with v2.x clients (since the v1 server won't be touched by ATS v2 client). Therefore, I think the rolling upgrade story, from ATS v1.x to ATS v2, can be reduced to the ability for ATS v1 servers and ATS v2 can co-exist in the cluster? We can certainly have more discussion on the rolling upgrade in ATS v2 JIRAs. > We should have a config to indicate the Timeline Service version > > > Key: YARN-3623 > URL: https://issues.apache.org/jira/browse/YARN-3623 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Xuan Gong > Attachments: YARN-3623-2015-11-19.1.patch > > > So far RM, MR AM, DA AM added/changed new config to enable the feature to > write the timeline data to v2 server. It's good to have a YARN > timeline-service.version config like timeline-service.enable to indicate the > version of the running timeline service with the given YARN cluster. It's > beneficial for users to more smoothly move from v1 to v2, as they don't need > to change the existing config, but switch this config from v1 to v2. And each > framework doesn't need to have their own v1/v2 config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4225) Add preemption status to yarn queue -status for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4225: - Attachment: YARN-4225.004.patch Thanks very much [~leftnoteasy], for your review and helpful comments. {quote} I'm OK with both approach - existing one in latest patch or simply return false if there's no such field in proto. {quote} So, if I understand correctly, you are okay with {{QueueInfo#getPreemptionDisabled}} returning {{Boolean}} with the possibility of returning {{null}} if the field doesn't exist. With that understanding, I'm leaving that in the latest patch. {quote} 2) For QueueCLI, is it better to print "preemption is disabled/enabled" instead of "preemption status: disabled/enabled"? {quote} Actually, I think that leaving it as "Preemption : disabled/enabled" is more consistent with the way the other properties are displayed. What do you think? {quote} 3) Is it possible to add a simple test to verify end-to-end behavior? {quote} I added a couple of tests to {{TestYarnCLI}}. Good suggestion. > Add preemption status to yarn queue -status for capacity scheduler > -- > > Key: YARN-4225 > URL: https://issues.apache.org/jira/browse/YARN-4225 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.7.1 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: YARN-4225.001.patch, YARN-4225.002.patch, > YARN-4225.003.patch, YARN-4225.004.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4436) DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled
Daniel Templeton created YARN-4436: -- Summary: DistShell ApplicationMaster.ExecBatScripStringtPath is misspelled Key: YARN-4436 URL: https://issues.apache.org/jira/browse/YARN-4436 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.7.1 Reporter: Daniel Templeton Assignee: Devon Michaels Priority: Trivial It should be ExecBatScriptStringPath. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4424) Fix deadlock in RMAppImpl
[ https://issues.apache.org/jira/browse/YARN-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4424: - Summary: Fix deadlock in RMAppImpl (was: YARN CLI command hangs) > Fix deadlock in RMAppImpl > - > > Key: YARN-4424 > URL: https://issues.apache.org/jira/browse/YARN-4424 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Jian He >Priority: Blocker > Attachments: YARN-4424.1.patch > > > {code} > yarn@XXX:/mnt/hadoopqe$ /usr/hdp/current/hadoop-yarn-client/bin/yarn > application -list -appStates NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING > 15/12/04 21:59:54 INFO impl.TimelineClientImpl: Timeline service address: > http://XXX:8188/ws/v1/timeline/ > 15/12/04 21:59:54 INFO client.RMProxy: Connecting to ResourceManager at > XXX/0.0.0.0:8050 > 15/12/04 21:59:55 INFO client.AHSProxy: Connecting to Application History > server at XXX/0.0.0.0:10200 > {code} > {code:title=RM log} > 2015-12-04 21:59:19,744 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 237000 > 2015-12-04 22:00:50,945 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 238000 > 2015-12-04 22:02:22,416 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 239000 > 2015-12-04 22:03:53,593 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 24 > 2015-12-04 22:05:24,856 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 241000 > 2015-12-04 22:06:56,235 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 242000 > 2015-12-04 22:08:27,510 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 243000 > 2015-12-04 22:09:58,786 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(243)) - Size of event-queue is 244000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4435) Add RM Delegation Token DtFetcher Implementation for DtUtil
[ https://issues.apache.org/jira/browse/YARN-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-4435: --- Assignee: Matthew Paduano > Add RM Delegation Token DtFetcher Implementation for DtUtil > --- > > Key: YARN-4435 > URL: https://issues.apache.org/jira/browse/YARN-4435 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Matthew Paduano >Assignee: Matthew Paduano > Attachments: proposed_solution > > > Add a class to yarn project that implements the DtFetcher interface to return > a RM delegation token object. > I attached a proposed class implementation that does this, but it cannot be > added as a patch until the interface is merged in HADOOP-12563 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047576#comment-15047576 ] Chris Douglas commented on YARN-4248: - Thanks, Chris. > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, > YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Moved] (YARN-4435) Add RM Delegation Token DtFetcher Implementation for DtUtil
[ https://issues.apache.org/jira/browse/YARN-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Paduano moved HADOOP-12599 to YARN-4435: Assignee: (was: Matthew Paduano) Key: YARN-4435 (was: HADOOP-12599) Project: Hadoop YARN (was: Hadoop Common) > Add RM Delegation Token DtFetcher Implementation for DtUtil > --- > > Key: YARN-4435 > URL: https://issues.apache.org/jira/browse/YARN-4435 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Matthew Paduano > Attachments: proposed_solution > > > Add a class to yarn project that implements the DtFetcher interface to return > a RM delegation token object. > I attached a proposed class implementation that does this, but it cannot be > added as a patch until the interface is merged in HADOOP-12563 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047553#comment-15047553 ] Chris Nauroth commented on YARN-4248: - bq. Not sure why it wasn't flagged by test-patch. I decided to dig into this. At the time that pre-commit ran for YARN-4248, there was an unrelated license warning present in HDFS, introduced by HDFS-9414. https://builds.apache.org/job/PreCommit-YARN-Build/9872/artifact/patchprocess/patch-asflicense-problems.txt Unfortunately, if there is a pre-existing license warning, then the {{mvn apache-rat:check}} build halts at that first failing module. Since hadoop-hdfs-client builds before hadoop-yarn-server-resourcemanager, it masked the new license warnings introduced by this patch. This is visible here if you scroll to the bottom and notice module Apache Hadoop HDFS Client failed, followed by skipping all subsequent modules. https://builds.apache.org/job/PreCommit-YARN-Build/9872/artifact/patchprocess/patch-asflicense-root.txt Maybe we can do better when there are pre-existing license warnings, perhaps by using the {{--fail-at-end}} option to make sure we check all modules. I filed YETUS-221. > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, > YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047418#comment-15047418 ] Hudson commented on YARN-4248: -- FAILURE: Integrated in Hadoop-trunk-Commit #8941 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8941/]) YARN-4248. Followup patch adding asf-licence exclusions for json test (cdouglas: rev 9f50e13d5dc329c3a6df7f9bcaf2f29b35adc52b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, > YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047400#comment-15047400 ] Vinod Kumar Vavilapalli commented on YARN-1856: --- Quick comments on the patch: General - Should add all the configs to yarn-default.xml, saying they are still early configs? - Should update the documentation of pmem-check-enabled, vmem-check-enabled configs in code and yarn-default.xml to denote their relation to resource.memory.enabled. - Actually, given existing memory monitoring mechanism, NM_MEMORY_RESOURCE_ENABLED is in reality is already true when pmem/vmem checks are enabled. We need to reconcile the old and new configs some how. May be memory is always enabled, but if vmem/pmem configs are enabled, use old handler, otherwise use the new one? Thinking out aloud. - Does the soft and hard limits also some-how logically relate to pmem-vmem-ratio? If so, we should hint at that in the documentation. - Swappiness seems like a cluster configuration defaulting to zero. So far, this has been an implicit contract with our users, good to document this also in yarn-default.xml Code comments - ResourceHandlerModule -- Formatting of new code is a little off: the declaration of {{getCgroupsMemoryResourceHandler()}}. There are other occurrences like this in that class before in this patch, you may want to fix those. -- BUG! getCgroupsMemoryResourceHandler() incorrectly locks DiskResourceHandler instead of MemoryResourceHandler. - CGroupsMemoryResourceHandlerImpl -- What is this doing? {{ CGroupsHandler.CGroupController MEMORY = CGroupsHandler.CGroupController.MEMORY; }} Is it forcing a class-load or something? Not sure if this is needed. If this is needed, you may want to add a comment here. - NM_MEMORY_RESOURCE_CGROUPS_SOFT_LIMIT_PERC -> NM_MEMORY_RESOURCE_CGROUPS_SOFT_LIMIT_PERCENTAGE. Similarly the default constant. - CGROUP_PARAM_MEMORY_HARD_LIMIT_BYTES / CGROUP_PARAM_MEMORY_SOFT_LIMIT_BYTES / CGROUP_PARAM_MEMORY_SWAPPINESS can all be static and final. > cgroups based memory monitoring for containers > -- > > Key: YARN-1856 > URL: https://issues.apache.org/jira/browse/YARN-1856 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Karthik Kambatla >Assignee: Varun Vasudev > Attachments: YARN-1856.001.patch, YARN-1856.002.patch, > YARN-1856.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047395#comment-15047395 ] Chris Douglas commented on YARN-4248: - Pushed to trunk, branch-2, branch-2.8. Sorry to have missed these in review. Not sure why it wasn't flagged by test-patch. > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, > YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS
[ https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3946: Attachment: YARN-3946.v1.007.patch Hi [~wangda], I have incorporated the changes suggested by you. Please take a look > Allow fetching exact reason as to why a submitted app is in ACCEPTED state in > CS > > > Key: YARN-3946 > URL: https://issues.apache.org/jira/browse/YARN-3946 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Sumit Nigam >Assignee: Naganarasimha G R > Attachments: 3946WebImages.zip, YARN-3946.v1.001.patch, > YARN-3946.v1.002.patch, YARN-3946.v1.003.Images.zip, YARN-3946.v1.003.patch, > YARN-3946.v1.004.patch, YARN-3946.v1.005.patch, YARN-3946.v1.006.patch, > YARN-3946.v1.007.patch > > > Currently there is no direct way to get the exact reason as to why a > submitted app is still in ACCEPTED state. It should be possible to know > through RM REST API as to what aspect is not being met - say, queue limits > being reached, or core/ memory requirement not being met, or AM limit being > reached, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047396#comment-15047396 ] Carlo Curino commented on YARN-4248: Thanks for spotting this and to [~chris.douglas] for the zero-latency fix. I spoke with him and he will commit it soon (as I am travelling at the moment). > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, > YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047389#comment-15047389 ] Chris Nauroth commented on YARN-4248: - Hi [~curino]. It looks like [~chris.douglas] just uploaded a patch to set up an exclusion of the json files from the license check. +1 for this. Thanks, Chris. > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, > YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4100) Add Documentation for Distributed Node Labels feature
[ https://issues.apache.org/jira/browse/YARN-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047386#comment-15047386 ] Hadoop QA commented on YARN-4100: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 13s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 10s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 3s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 9s {color} | {color:green} hadoop-yarn-site in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 19s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 9s {color} | {color:green} hadoop-yarn-site in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 20s {color} | {color:red} Patch generated 3 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 46m 16s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12776358/YARN-4100.v1.001.patch | | JIRA Issue | YARN-4100 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit xml | | uname | Linux f60b5fcdd61e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 9676774 | | JDK v1.7.0_9
[jira] [Updated] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated YARN-4248: Attachment: YARN-4248-asflicense.patch > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, > YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047378#comment-15047378 ] Carlo Curino commented on YARN-4248: Chris, I am happy to fix it, but if I am not mistaken json doesn't allow comments... Any advise? > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248.2.patch, YARN-4248.3.patch, YARN-4248.5.patch, > YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4350) TestDistributedShell fails
[ https://issues.apache.org/jira/browse/YARN-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047361#comment-15047361 ] Sangjin Lee commented on YARN-4350: --- I think either way is fine, although all things being equal I would slightly prefer the dynamic port. It's your call. :) > TestDistributedShell fails > -- > > Key: YARN-4350 > URL: https://issues.apache.org/jira/browse/YARN-4350 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-4350-feature-YARN-2928.001.patch > > > Currently TestDistributedShell does not pass on the feature-YARN-2928 branch. > There seem to be 2 distinct issues. > (1) testDSShellWithoutDomainV2* tests fail sporadically > These test fail more often than not if tested by themselves: > {noformat} > testDSShellWithoutDomainV2DefaultFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) > Time elapsed: 30.998 sec <<< FAILURE! > java.lang.AssertionError: Application created event should be published > atleast once expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:451) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:326) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow(TestDistributedShell.java:207) > {noformat} > They start happening after YARN-4129. I suspect this might have to do with > some timing issue. > (2) the whole test times out > If you run the whole TestDistributedShell test, it times out without fail. > This may or may not have to do with the port change introduced by YARN-2859 > (just a hunch). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4350) TestDistributedShell fails
[ https://issues.apache.org/jira/browse/YARN-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047285#comment-15047285 ] Naganarasimha G R commented on YARN-4350: - Thanks [~sjlee0] & [~vrushalic], bq. Can we go back to before YARN-2859 and restore this unit test for the time being Would it be better to totally revert or apply the change which i was mentioning {{ServerSocketUtil.getPort}} so that we avoid the fixed ports (which YARN-2859 was trying to solve) and also get the current test case to be solved? i can mark a comment in YARN-4372 to take care of the fix done here temporarily ! > TestDistributedShell fails > -- > > Key: YARN-4350 > URL: https://issues.apache.org/jira/browse/YARN-4350 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-4350-feature-YARN-2928.001.patch > > > Currently TestDistributedShell does not pass on the feature-YARN-2928 branch. > There seem to be 2 distinct issues. > (1) testDSShellWithoutDomainV2* tests fail sporadically > These test fail more often than not if tested by themselves: > {noformat} > testDSShellWithoutDomainV2DefaultFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) > Time elapsed: 30.998 sec <<< FAILURE! > java.lang.AssertionError: Application created event should be published > atleast once expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:451) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:326) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow(TestDistributedShell.java:207) > {noformat} > They start happening after YARN-4129. I suspect this might have to do with > some timing issue. > (2) the whole test times out > If you run the whole TestDistributedShell test, it times out without fail. > This may or may not have to do with the port change introduced by YARN-2859 > (just a hunch). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4350) TestDistributedShell fails
[ https://issues.apache.org/jira/browse/YARN-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047254#comment-15047254 ] Vrushali C commented on YARN-4350: -- I see, thanks [~Naganarasimha] for the clarification. +1 on going back to before YARN-2859 and restoring this unit test for the time being. > TestDistributedShell fails > -- > > Key: YARN-4350 > URL: https://issues.apache.org/jira/browse/YARN-4350 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-4350-feature-YARN-2928.001.patch > > > Currently TestDistributedShell does not pass on the feature-YARN-2928 branch. > There seem to be 2 distinct issues. > (1) testDSShellWithoutDomainV2* tests fail sporadically > These test fail more often than not if tested by themselves: > {noformat} > testDSShellWithoutDomainV2DefaultFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) > Time elapsed: 30.998 sec <<< FAILURE! > java.lang.AssertionError: Application created event should be published > atleast once expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:451) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:326) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow(TestDistributedShell.java:207) > {noformat} > They start happening after YARN-4129. I suspect this might have to do with > some timing issue. > (2) the whole test times out > If you run the whole TestDistributedShell test, it times out without fail. > This may or may not have to do with the port change introduced by YARN-2859 > (just a hunch). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4415) Scheduler Web Ui shows max capacity for the queue is 100% but when we submit application doesnt get assigned
[ https://issues.apache.org/jira/browse/YARN-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4415: Attachment: capacity-scheduler.xml Hi [~wangda], bq. I think QueueCapacitiesInfo should not assume maxCapacity will be > eps. We have normalizations while setting values to QueueCapacities, so we should copy exactly same value from QueueCapacities to QueueCapacitiesInfo (cap it between 0 and 1 is fine). Point i am trying to make here is that none of the capacities are configured for a given queue and partition. and hence Queue Capacities will not be having configured capacities for the given label and when QueueCapacitiesInfo is queried for the non existent label then it returns the default capacities as 0 and max as 100 (though this can be corrected to be 1) bq. It's a valid use case that a queue has max capacity = 0, for example, reservation system (YARN-1051) could dynamically adjust queue capacities. I am not against to the concept of configuring the max capacity to zero but the default should not be zero, if not we will not be able to make benifit of accessible node labels as {{*}} bq. I may not fully understand why we need to fetch parent queue's capacities while setting QueueCapacitiesInfo. As I mentioned above, QueueCapacities should have everything considered and calculated at QueueCapacities (including parent queue's capacities), correct In the example scenarios which i have mentioned, queue can access a particular particular partition, but the capacities for it is not configured. So in that case QueueCapacities will not be have the label. Also when accessible nodelabel is configured as {{*}} then any new label can be added to the cluster and NM can be mapped to it, but as the capacities are not configured for the queue, allocations can not happen Hope i am clear if not, i have uploaded my capacity scheduler xml . just create a new partition label xxx and try to submit a job for it in default Queue (default queue is configured with accessible nodelabels as {{*}} ). Job will not be able to proceed. > Scheduler Web Ui shows max capacity for the queue is 100% but when we submit > application doesnt get assigned > > > Key: YARN-4415 > URL: https://issues.apache.org/jira/browse/YARN-4415 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, resourcemanager >Affects Versions: 2.7.2 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: App info with diagnostics info.png, > capacity-scheduler.xml, screenshot-1.png > > > Steps to reproduce the issue : > Scenario 1: > # Configure a queue(default) with accessible node labels as * > # create a exclusive partition *xxx* and map a NM to it > # ensure no capacities are configured for default for label xxx > # start an RM app with queue as default and label as xxx > # application is stuck but scheduler ui shows 100% as max capacity for that > queue > Scenario 2: > # create a nonexclusive partition *sharedPartition* and map a NM to it > # ensure no capacities are configured for default queue > # start an RM app with queue as *default* and label as *sharedPartition* > # application is stuck but scheduler ui shows 100% as max capacity for that > queue for *sharedPartition* > For both issues cause is the same default max capacity and abs max capacity > is set to Zero % -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047232#comment-15047232 ] Hadoop QA commented on YARN-4356: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 12 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 5s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 15s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 0s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 5m 0s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 2m 27s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 17s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in feature-YARN-2928 has 3 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 28s {color} | {color:red} hadoop-yarn-common in feature-YARN-2928 failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 21s {color} | {color:red} hadoop-yarn-server-resourcemanager in feature-YARN-2928 failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 58s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 50s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 22m 2s {color} | {color:red} root-jdk1.8.0_66 with JDK v1.8.0_66 generated 5 new issues (was 780, now 780). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 46s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 30m 49s {color} | {color:red} root-jdk1.7.0_91 with JDK v1.7.0_91 generated 5 new issues (was 772, now 772). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 46s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 1s {color} | {color:red} Patch generated 26 new checkstyle issues in root (total was 1938, now 1932). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 4m 55s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 2m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 6s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 4m 11s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-api-jdk1.8.0_66 with JDK v1.8.0_66 generated 6 new issues (was 100, now 100). {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 27s {color} | {color:red} hadoop-yarn-common in the patch failed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 21s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} j
[jira] [Updated] (YARN-4368) Support Multiple versions of the timeline service at the same time
[ https://issues.apache.org/jira/browse/YARN-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4368: - Labels: yarn-2928-1st-milestone (was: ) > Support Multiple versions of the timeline service at the same time > -- > > Key: YARN-4368 > URL: https://issues.apache.org/jira/browse/YARN-4368 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > During rolling updgrade it will be helpfull to have the older version of the > timeline server to be also running so that the existing apps can submit to > the older version of ATS . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4240) Add documentation for delegated-centralized node labels feature
[ https://issues.apache.org/jira/browse/YARN-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R resolved YARN-4240. - Resolution: Duplicate Will be handled as part of YARN-4100 itself > Add documentation for delegated-centralized node labels feature > --- > > Key: YARN-4240 > URL: https://issues.apache.org/jira/browse/YARN-4240 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Dian Fu >Assignee: Dian Fu > > As a follow up of YARN-3964, we should add documentation for > delegated-centralized node labels feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4100) Add Documentation for Distributed Node Labels feature
[ https://issues.apache.org/jira/browse/YARN-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4100: Attachment: NodeLabel.html YARN-4100.v1.001.patch Hi [~wangda],[~dian.fu], [~devaraj.k] & [~rohithsharma], Please review the the attached patch for the documentation update for different configuration types of the Node Labels. This covers the scope of the YARN-4240 > Add Documentation for Distributed Node Labels feature > - > > Key: YARN-4100 > URL: https://issues.apache.org/jira/browse/YARN-4100 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, client, resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: NodeLabel.html, YARN-4100.v1.001.patch > > > Add Documentation for Distributed Node Labels feature -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations
[ https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047165#comment-15047165 ] Chris Nauroth commented on YARN-4248: - This patch introduced license warnings on the testing json files. Here is an example from the latest pre-commit run on HADOOP-11505. https://builds.apache.org/job/PreCommit-HADOOP-Build/8202/artifact/patchprocess/patch-asflicense-problems.txt Would you please either revert or quickly correct the license warning? Thank you. > REST API for submit/update/delete Reservations > -- > > Key: YARN-4248 > URL: https://issues.apache.org/jira/browse/YARN-4248 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Carlo Curino >Assignee: Carlo Curino > Fix For: 2.8.0 > > Attachments: YARN-4248.2.patch, YARN-4248.3.patch, YARN-4248.5.patch, > YARN-4248.6.patch, YARN-4248.patch > > > This JIRA tracks work to extend the RMWebService to support REST APIs to > submit/update/delete reservations. This will ease integration with external > tools that are not java-based. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047156#comment-15047156 ] Sidharta Seethana commented on YARN-1856: - Ugh. IDE Snafu - someone how ended looking at an older version of the patch. +1 on the latest version of the patch. > cgroups based memory monitoring for containers > -- > > Key: YARN-1856 > URL: https://issues.apache.org/jira/browse/YARN-1856 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Karthik Kambatla >Assignee: Varun Vasudev > Attachments: YARN-1856.001.patch, YARN-1856.002.patch, > YARN-1856.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047143#comment-15047143 ] Hadoop QA commented on YARN-4403: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 36s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 24s {color} | {color:red} hadoop-yarn-server-resourcemanager in trunk failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 6m 26s {color} | {color:red} hadoop-yarn-project_hadoop-yarn-jdk1.8.0_66 with JDK v1.8.0_66 generated 1 new issues (was 14, now 14). {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 4s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 0s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 24s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 3s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 0s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 0s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 20s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 26s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 23s {color} | {color:red} Patch generated 3 ASF License warnings
[jira] [Commented] (YARN-4427) NPE on handleNMContainerStatus when NM is registering to RM
[ https://issues.apache.org/jira/browse/YARN-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047139#comment-15047139 ] Sunil G commented on YARN-4427: --- Thanks [~brahma] for the details. During recovery of AppAttempt, {{masterContainer}} can be null only if the AttemptState doesn't have and its very unlikely. if ZK was unstable, do you mean a partial recovery has happened here for AppAttempt where {{masterContainer}} is null? Could you also pls share the final state of that attempt during recovery. > NPE on handleNMContainerStatus when NM is registering to RM > --- > > Key: YARN-4427 > URL: https://issues.apache.org/jira/browse/YARN-4427 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula >Priority: Critical > > *Seen the following in one of our environment when AM got allocated > container but failed to updated in the ZK Where cluster is having network > problem for sometime(up and down).* > {noformat} > 2015-12-07 16:39:38,489 | WARN | IPC Server handler 49 on 26003 | IPC Server > handler 49 on 26003, call > org.apache.hadoop.yarn.server.api.ResourceTrackerPB.registerNodeManager from > 9.91.8.220:52169 Call#17 Retry#0 | Server.java:2107 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.handleNMContainerStatus(ResourceTrackerService.java:286) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:395) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.registerNodeManager(ResourceTrackerPBServiceImpl.java:54) > at > org.apache.hadoop.yarn.proto.ResourceTracker$ResourceTrackerService$2.callBlockingMethod(ResourceTracker.java:79) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1673) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082) > {noformat} > Corresponding code, it might not match with {{branch-2.7/Trunk}} since we had > modified internally. > {code} > 284 RMAppAttempt rmAppAttempt = rmApp.getRMAppAttempt(appAttemptId); > 285 Container masterContainer = rmAppAttempt.getMasterContainer(); > 286 if (masterContainer.getId().equals(containerStatus.getContainerId()) > 287 && containerStatus.getContainerState() == ContainerState.COMPLETE) > { > 288 ContainerStatus status = > 289 ContainerStatus.newInstance(containerStatus.getContainerId(), > 290 containerStatus.getContainerState(), > containerStatus.getDiagnostics(), > 291 containerStatus.getContainerExitStatus()); > 292 // sending master container finished event. > 293 RMAppAttemptContainerFinishedEvent evt = > 294 new RMAppAttemptContainerFinishedEvent(appAttemptId, status, > 295 nodeId); > 296 rmContext.getDispatcher().getEventHandler().handle(evt); > 297 } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047125#comment-15047125 ] Hudson commented on YARN-4348: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #675 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/675/]) Update CHANGES.txt for commit of YARN-4348 to branch-2.7 and branch-2.6. (ozawa: rev d7b3f8dbe818cff5fee4f4c0c70d306776aa318e) * hadoop-yarn-project/CHANGES.txt > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3623) We should have a config to indicate the Timeline Service version
[ https://issues.apache.org/jira/browse/YARN-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047084#comment-15047084 ] Junping Du commented on YARN-3623: -- I don't quite familiar with requirement of ATS v1.5. However, in stands of ATS v2, I would agree with [~sjlee0]'s comments above to make this effect works on writer side only (TimelineClient). More clarifications: 1. This version configuration is to benefit application/framework to have flexibility to run on top of YARN cluster with ATS v1 or v2 running with indicating the latest stable version ATS service that the cluster can support. ATS v1 and v2 client are different binary bits and use different incompatible APIs to put information like event, metrics, etc. so far. With getting proper configuration from YARN, the application can aware the ATS service version when landing on YARN cluster and can choose different TimelineClient to push info and get rid of our pains in doing TestDistributedCache for v1/v2 timeline service. 2. We shouldn't break rolling upgrade scenario, or it could be seen as incompatible feature which cannot land on 2.x branch. That also means, we should support ATS v1 and v2 services at the same time during cluster upgrade so legacy/existing applications can still access their old ATS service which is the same as many rollup stories. 2 clarification is more related to this change: we'd better change "yarn.timeline-service.version" to "yarn.timeline-service.latest.version" and use "indicate to clients what is the latest stable version of the running timeline service" to get rid of any confusion here. Also it is better to explicitly mention that our support range for ATS is: [X-1, X] for rolling upgrade (assume X is latest stable ATS version). > We should have a config to indicate the Timeline Service version > > > Key: YARN-3623 > URL: https://issues.apache.org/jira/browse/YARN-3623 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Xuan Gong > Attachments: YARN-3623-2015-11-19.1.patch > > > So far RM, MR AM, DA AM added/changed new config to enable the feature to > write the timeline data to v2 server. It's good to have a YARN > timeline-service.version config like timeline-service.enable to indicate the > version of the running timeline service with the given YARN cluster. It's > beneficial for users to more smoothly move from v1 to v2, as they don't need > to change the existing config, but switch this config from v1 to v2. And each > framework doesn't need to have their own v1/v2 config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047082#comment-15047082 ] Sidharta Seethana commented on YARN-1856: - [~vvasudev] , there is still an issue with the handling of the soft limit percentage. Isn't there a divide by 100 missing? {code} long softLimit = (long) (container.getResource().getMemory() * softLimitPerc); {code} The test code below needs to be updated too - instead of specifying the value of the soft limit percentage here in test code, maybe we should use DEFAULT_NM_MEMORY_RESOURCE_CGROUPS_SOFT_LIMIT_PERC ? It also looks like the validation of the memory value is not happening correctly below. You could use Mockito's {{eq()}} to verify argument values. {code} verify(mockCGroupsHandler, times(1)) .updateCGroupParam(CGroupsHandler.CGroupController.MEMORY, id, CGroupsHandler.CGROUP_PARAM_MEMORY_SOFT_LIMIT_BYTES, String.valueOf((int) (memory * 0.9)) + "M"); {code} > cgroups based memory monitoring for containers > -- > > Key: YARN-1856 > URL: https://issues.apache.org/jira/browse/YARN-1856 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Karthik Kambatla >Assignee: Varun Vasudev > Attachments: YARN-1856.001.patch, YARN-1856.002.patch, > YARN-1856.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047066#comment-15047066 ] Hadoop QA commented on YARN-4309: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 59s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 49s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 4s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 19s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 28s {color} | {color:red} Patch generated 3 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 359, now 359). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 49s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 1s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 48s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 16s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 17s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85. {color} | | {col
[jira] [Commented] (YARN-4356) ensure the timeline service v.2 is disabled cleanly and has no impact when it's turned off
[ https://issues.apache.org/jira/browse/YARN-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047055#comment-15047055 ] Sangjin Lee commented on YARN-4356: --- The jenkins build didn't fire automatically. Kicking off a manual build. > ensure the timeline service v.2 is disabled cleanly and has no impact when > it's turned off > -- > > Key: YARN-4356 > URL: https://issues.apache.org/jira/browse/YARN-4356 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Sangjin Lee >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: YARN-4356-feature-YARN-2928.002.patch, > YARN-4356-feature-YARN-2928.poc.001.patch > > > For us to be able to merge the first milestone drop to trunk, we want to > ensure that once disabled the timeline service v.2 has no impact from the > server side to the client side. If the timeline service is not enabled, no > action should be done. If v.1 is enabled but not v.2, v.1 should behave the > same as it does before the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4350) TestDistributedShell fails
[ https://issues.apache.org/jira/browse/YARN-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047052#comment-15047052 ] Sangjin Lee commented on YARN-4350: --- [~Naganarasimha]: {quote} I am not sure how to proceed with this jira, as its introduced by YARN-2859 but actual cause is YARN-4372. YARN-4372 not sure we have any definitive solution. as temporary fix shall i revert YARN-2859 solution and have my solution so that we can proceed smoothly till YARN-4372 has some proper solution? {quote} I agree. Can we go back to before YARN-2859 and restore this unit test for the time being? It's not clear if YARN-4372 has a quick solution. > TestDistributedShell fails > -- > > Key: YARN-4350 > URL: https://issues.apache.org/jira/browse/YARN-4350 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-4350-feature-YARN-2928.001.patch > > > Currently TestDistributedShell does not pass on the feature-YARN-2928 branch. > There seem to be 2 distinct issues. > (1) testDSShellWithoutDomainV2* tests fail sporadically > These test fail more often than not if tested by themselves: > {noformat} > testDSShellWithoutDomainV2DefaultFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) > Time elapsed: 30.998 sec <<< FAILURE! > java.lang.AssertionError: Application created event should be published > atleast once expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:451) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:326) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow(TestDistributedShell.java:207) > {noformat} > They start happening after YARN-4129. I suspect this might have to do with > some timing issue. > (2) the whole test times out > If you run the whole TestDistributedShell test, it times out without fail. > This may or may not have to do with the port change introduced by YARN-2859 > (just a hunch). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046995#comment-15046995 ] Sunil G commented on YARN-4403: --- Yes. That's definitely a valid reason as per the current usage. Thank you very much for clarifying. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403-v2.patch, YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4413) Nodes in the includes list should not be listed as decommissioned in the UI
[ https://issues.apache.org/jira/browse/YARN-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046994#comment-15046994 ] Sunil G commented on YARN-4413: --- Hi [~templedf] Thank you for the updated patch. I have some doubts on the updated patch. I am not very sure about the move from DECOMMISSIONED to SHUTDOWN on RECOMMISSION event. Event doesnt sounds so clean or correct. Why could we not send SHUTDOWN event itself. I see no harm in doing that. Because after refresh, a node is found to be in valid state as per config but DECOMMISSIONED by RM. So such nodes can be moved via SHUTDOWN event. Please correct me if I am missing something here. > Nodes in the includes list should not be listed as decommissioned in the UI > --- > > Key: YARN-4413 > URL: https://issues.apache.org/jira/browse/YARN-4413 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton > Attachments: YARN-4413.001.patch > > > If I decommission a node and then move it from the excludes list back to the > includes list, but I don't restart the node, the node will still be listed by > the web UI as decomissioned until either the NM or RM is restarted. Ideally, > removing the node from the excludes list and putting it back into the > includes list should cause the node to be reported as shutdown instead. > CC [~kshukla] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4348: - Comment: was deleted (was: Sounds good. Thanks!) > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046991#comment-15046991 ] Junping Du commented on YARN-4348: -- Sounds good. Thanks! > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046990#comment-15046990 ] Junping Du commented on YARN-4348: -- Sounds good. Thanks! > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046989#comment-15046989 ] Junping Du commented on YARN-4403: -- It depends on if any consumer of SystemClock are using it to track absolute time but not for duration or interval. I didn't check other calling places in YARN/MR, also theoretically, it could be consumers outside of YARN given this a public API. We may consider to mark this API as deprecated later if we check all known calling places are for duration or interval only. But for now, it could be better to keep annotation no changed but with NOTE. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403-v2.patch, YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails
[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-4309: Attachment: YARN-4309.009.patch Thanks for the reviews [~ivanmi] and [~leftnoteasy]. bq. I think it would be helpful to document what the methods are supposed to do. Fixed. bq. Do we want to remove the error check above, to be consistent with Linux, and to avoid failing due to a "logging" failure? Also, cp command does not exist on Windows. Please use copy instead. Fixed. bq. Why do you have both "dir" and "dir /AL /S" on Windows? Can you please include an inline comment with rationale. The original intent was to try to detect broken symlinks but I'm not sure if that's possible using the dir command. I've removed the dir /AL /S command. bq. In copyDebugInformation() you are also doing a chmod() internally. Wondering this this command should be injected by the call site given that only the caller has context on what the destination is and whether special permission handling is needed. It might be possible to change the method to only accept src and copy the file to the current folder, in which case it might be fine to use chmod() given that there is an assumption on what the current folder is. Just a thought you make the call. Good point. Copying the file to the current folder doesn't work because the container launch script runs in the container work dir and we want these files to be uploaded as part of log aggregation. I've just added a check to make sure the path is absolute before attempting the chmod. bq. I meant to print comment to the generated container_launch.sh for better readability. Such as: Fixed. > Add debug information to application logs when a container fails > > > Key: YARN-4309 > URL: https://issues.apache.org/jira/browse/YARN-4309 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: YARN-4309.001.patch, YARN-4309.002.patch, > YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch, > YARN-4309.006.patch, YARN-4309.007.patch, YARN-4309.008.patch, > YARN-4309.009.patch > > > Sometimes when a container fails, it can be pretty hard to figure out why it > failed. > My proposal is that if a container fails, we collect information about the > container local dir and dump it into the container log dir. Ideally, I'd like > to tar up the directory entirely, but I'm not sure of the security and space > implications of such a approach. At the very least, we can list all the files > in the container local dir, and dump the contents of launch_container.sh(into > the container log dir). > When log aggregation occurs, all this information will automatically get > collected and make debugging such failures much easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046961#comment-15046961 ] Sunil G commented on YARN-4403: --- Hi [~djp] Thanks for the updated patch. I have one doubt here, we could see tat {{SystemClock#getTime}} is *Public* and *Stable*. Now there is a note saying that its advisable to use {{MonotonicClock}}, so any annotation change is needed for {{SystemClock}}? > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403-v2.patch, YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046959#comment-15046959 ] Hudson commented on YARN-4348: -- FAILURE: Integrated in Hadoop-trunk-Commit #8938 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8938/]) Update CHANGES.txt for commit of YARN-4348 to branch-2.7 and branch-2.6. (ozawa: rev d7b3f8dbe818cff5fee4f4c0c70d306776aa318e) * hadoop-yarn-project/CHANGES.txt > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4386) refreshNodesGracefully() looks at active RMNode list for recommissioning decommissioned nodes
[ https://issues.apache.org/jira/browse/YARN-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046954#comment-15046954 ] Sunil G commented on YARN-4386: --- Hi [~kshukla] Sorry for replying late here. bq. Unless there are 2 refreshNodes done in parallel such that the first deactivateNodeTransition has not finished and the other refreshNodes is also trying to do the same transition Since the transitions are happening under write lock, this may not happen. I have one suggestion here. I feel You could mark a node for GRACEFUL DECOMMISSION and ensure that node is in DECOMMISSIONING state. (can try to fire event to RMNodeImpl directly to do this). Later invoke {{refreshNodesGracefully}} and verify that an event named RECOMMISSION is raised to dispatcher or not. Similarly mark a node as DECOMMISSIONED and then invoke {{refreshNodesGracefully}} and verify the event RECOMMISSION is *NOT* raised. In second case, it will not enter *for* loop. but I feel this will clear cover our case here though its not direct. Pls correct me if I am wrong. > refreshNodesGracefully() looks at active RMNode list for recommissioning > decommissioned nodes > - > > Key: YARN-4386 > URL: https://issues.apache.org/jira/browse/YARN-4386 > Project: Hadoop YARN > Issue Type: Bug > Components: graceful >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: YARN-4386-v1.patch > > > In refreshNodesGracefully(), during recommissioning, the entryset from > getRMNodes() which has only active nodes (RUNNING, DECOMMISSIONING etc.) is > used for checking 'decommissioned' nodes which are present in > getInactiveRMNodes() map alone. > {code} > for (Entry entry:rmContext.getRMNodes().entrySet()) { > . > // Recommissioning the nodes > if (entry.getValue().getState() == NodeState.DECOMMISSIONING > || entry.getValue().getState() == NodeState.DECOMMISSIONED) { > this.rmContext.getDispatcher().getEventHandler() > .handle(new RMNodeEvent(nodeId, RMNodeEventType.RECOMMISSION)); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046951#comment-15046951 ] Tsuyoshi Ozawa commented on YARN-4348: -- Now I committed this to branch-2.6.3 too. Thanks! > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046944#comment-15046944 ] Tsuyoshi Ozawa commented on YARN-4348: -- Ran tests locally and pass tests on branch-2.6. Committing this to branch-2.6. > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046950#comment-15046950 ] Tsuyoshi Ozawa commented on YARN-4348: -- [~djp] I committed this to branch-2.6, which is targeting 2.6.3. Can I push this to branch-2.6.3? > ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding > blocking ZK's event thread > -- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Fix For: 2.6.3, 2.7.3 > > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4381) Add container launchEvent and container localizeFailed metrics in container
[ https://issues.apache.org/jira/browse/YARN-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046933#comment-15046933 ] Hadoop QA commented on YARN-4381: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 12s {color} | {color:red} Patch generated 4 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager (total was 116, now 120). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 59s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 27s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 20s {color} | {color:red} Patch generated 3 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 36m 3s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12776320/YARN-4381.002.patch | | JIRA Issue | YARN-4381 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 767fc930d7b8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit
[jira] [Updated] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4403: - Attachment: YARN-4403-v2.patch Update patch with incorporate Jian's comments. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403-v2.patch, YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4381) Add container launchEvent and container localizeFailed metrics in container
[ https://issues.apache.org/jira/browse/YARN-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Yiqun updated YARN-4381: Attachment: YARN-4381.002.patch Thanks [~djp] for review. I update the container metrics more fine-grained. As you said that the container failed is not only because localizationFailed and is not suitable to add the metric on launchEvent. So I add the metric {{containerLaunchedSuccess}} when container is becoming to running state and seting the {{wasLaunched=true}}. Besides this, I add the another two metric2 for container-failed cases. * one is for containerFailedBeforeLaunched * other one is for containerKilledAfterLaunched And I think these metrics will help us to know more concretely of a container. > Add container launchEvent and container localizeFailed metrics in container > --- > > Key: YARN-4381 > URL: https://issues.apache.org/jira/browse/YARN-4381 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Attachments: YARN-4381.001.patch, YARN-4381.002.patch > > > Recently, I found a issue on nodemanager metrics.That's > {{NodeManagerMetrics#containersLaunched}} is not actually means the container > succeed launched times.Because in some time, it will be failed when receiving > the killing command or happening container-localizationFailed.This will lead > to a failed container.But now,this counter value will be increased in these > code whenever the container is started successfully or failed. > {code} > Credentials credentials = parseCredentials(launchContext); > Container container = > new ContainerImpl(getConfig(), this.dispatcher, > context.getNMStateStore(), launchContext, > credentials, metrics, containerTokenIdentifier); > ApplicationId applicationID = > containerId.getApplicationAttemptId().getApplicationId(); > if (context.getContainers().putIfAbsent(containerId, container) != null) { > NMAuditLogger.logFailure(user, AuditConstants.START_CONTAINER, > "ContainerManagerImpl", "Container already running on this node!", > applicationID, containerId); > throw RPCUtil.getRemoteException("Container " + containerIdStr > + " already is running on this node!!"); > } > this.readLock.lock(); > try { > if (!serviceStopped) { > // Create the application > Application application = > new ApplicationImpl(dispatcher, user, applicationID, credentials, > context); > if (null == context.getApplications().putIfAbsent(applicationID, > application)) { > LOG.info("Creating a new application reference for app " + > applicationID); > LogAggregationContext logAggregationContext = > containerTokenIdentifier.getLogAggregationContext(); > Map appAcls = > container.getLaunchContext().getApplicationACLs(); > context.getNMStateStore().storeApplication(applicationID, > buildAppProto(applicationID, user, credentials, appAcls, > logAggregationContext)); > dispatcher.getEventHandler().handle( > new ApplicationInitEvent(applicationID, appAcls, > logAggregationContext)); > } > this.context.getNMStateStore().storeContainer(containerId, request); > dispatcher.getEventHandler().handle( > new ApplicationContainerInitEvent(container)); > > this.context.getContainerTokenSecretManager().startContainerSuccessful( > containerTokenIdentifier); > NMAuditLogger.logSuccess(user, AuditConstants.START_CONTAINER, > "ContainerManageImpl", applicationID, containerId); > // TODO launchedContainer misplaced -> doesn't necessarily mean a > container > // launch. A finished Application will not launch containers. > metrics.launchedContainer(); > metrics.allocateContainer(containerTokenIdentifier.getResource()); > } else { > throw new YarnException( > "Container start failed as the NodeManager is " + > "in the process of shutting down"); > } > {code} > In addition, we are lack of localzationFailed metric in container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4403: - Comment: was deleted (was: Ok. That sounds good. Will update the patch soon.) > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4403: - Comment: was deleted (was: Ok. That sounds good. Will update the patch soon.) > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046873#comment-15046873 ] Junping Du commented on YARN-4403: -- Ok. That sounds good. Will update the patch soon. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046871#comment-15046871 ] Junping Du commented on YARN-4403: -- Ok. That sounds good. Will update the patch soon. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period
[ https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046872#comment-15046872 ] Junping Du commented on YARN-4403: -- Ok. That sounds good. Will update the patch soon. > (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating > period > > > Key: YARN-4403 > URL: https://issues.apache.org/jira/browse/YARN-4403 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4403.patch > > > Currently, (AM/NM/Container)LivelinessMonitor use current system time to > calculate a duration of expire which could be broken by settimeofday. We > should use Time.monotonicNow() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)