[jira] [Moved] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran moved HADOOP-11826 to YARN-3539: --- Component/s: (was: documentation) documentation Affects Version/s: (was: 2.7.0) 2.7.0 Key: YARN-3539 (was: HADOOP-11826) Project: Hadoop YARN (was: Hadoop Common) Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-3480: --- Summary: Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable (was: Make AM max attempts stored in RMStateStore to be configurable) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509256#comment-14509256 ] Peng Zhang commented on YARN-3535: -- As per [~jlowe]'s thoughts, I understand here are two separated thing: # During NM reconnection, RM and NM should do sync at container level. For this issue's scenario, container 04 should not be killed and rescheduled, so AM can acquire and launch it on NM after NM registered. # Still need fix in RMContainerImpl: restore request during transition from ALLOCATED to KILLED. Because NM's real lost may cause transition from ALLOCATED to KILLED with very small possibility(AM may heartbeat and acquire container after NM heartbeats timeout). I think first thing is an improvement to save time or scheduling work done before. Or did I get any mistake? ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3480) Make AM max attempts stored in RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-3480: --- Attachment: YARN-3480.01.patch Attach an initial patch. I will add test cases later. Make AM max attempts stored in RMStateStore to be configurable -- Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3538) TimelineServer doesn't catch/translate all exceptions raised
Steve Loughran created YARN-3538: Summary: TimelineServer doesn't catch/translate all exceptions raised Key: YARN-3538 URL: https://issues.apache.org/jira/browse/YARN-3538 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Steve Loughran Priority: Minor Not all exceptions in TimelineServer are uprated to web exceptions; only IOEs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3537) NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked
[ https://issues.apache.org/jira/browse/YARN-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509158#comment-14509158 ] Jason Lowe commented on YARN-3537: -- The code checks for a null store to avoid invoking the stop method, but then a few lines later it has the potential to invoke the canRecover method. Seems like we want to avoid doing anything at all in this method if the store is null. NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked Key: YARN-3537 URL: https://issues.apache.org/jira/browse/YARN-3537 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3537.patch 2015-04-23 19:30:34,961 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:326) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.tearDown(TestNodeManagerShutdown.java:106) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509153#comment-14509153 ] Jason Lowe commented on YARN-3535: -- I think we need to fix the RMContainerImpl ALLOCATED to KILLED transition, but I think there's another bug here. I believe the container was killed in the first place because the RMNodeImpl reconnect transition makes an assumption that is racy. When the node reconnects, it checks if the node reports no applications running. If it has no applications then it sends a removed node eventfollowed by a added node event to the scheduler. This will cause the scheduler to kill all containers allocated on that node. However the node will only know about a container iff the AM acquires the container and tries to launch the container on the node. That can take minutes to transpire, so it's dangerous to assume that a node not reporting any applications on the node means it doesn't have anything pending. I think we'll have to revisit the solution to YARN-2561 to either eliminate this race or make it safe if it does occur. Ideally we shouldn't be sending a remove/add event to the scheduler if the node is reconnecting, but we need to make sure we cancel containers on the node that are no longer running. Since the node reports what containers it has when it reconnects, it seems like we can convey that information to the scheduler to correct anything that doesn't match up. Any container in the RUNNING state that no longer appears in the list of containers when registering can be killed by the scheduler, as it does when a node is removed, and I believe that will fix YARN-2561 and also avoid this race. cc: [~djp] as this also has potential ramifications for graceful decommission. If we try to graceful decommission a node that isn't currently reporting applications we may also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509152#comment-14509152 ] Thomas Graves commented on YARN-3517: - + // non-secure mode with no acls enabled + if (!isAdmin !UserGroupInformation.isSecurityEnabled() + !adminACLsManager.areACLsEnabled()) { +isAdmin = true; + } + We don't need the isSecurityEnabled check, just keep the one for areAclsEnabled. This could be combined with the previous if, make this the else if part but that isn't a big deal. in QueuesBlock we are creating the AdminACLsManager every web page load. Perhaps a better way would be to use the this.rm.getApplicationACLsManager() and extend the ApplicationAclsManager to explose an isAdmin functionality RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Affects Versions: 2.7.0 Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509237#comment-14509237 ] Sangjin Lee commented on YARN-3437: --- Thanks Junping. I initially went with ConcurrentHashMap when I first created this as that is my preference as well. But it was really preventing multiple threads from starting their collector (should that situation arise) that made ConcurrentHashMap not an option. Again, if we want both, we would need to look at the LoadingCache. But since this is really a low contention situation, it would be an overkill. The chances of this code running into a lock contention should be low. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch, YARN-3437.003.patch, YARN-3437.004.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3471) Fix timeline client retry
[ https://issues.apache.org/jira/browse/YARN-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509138#comment-14509138 ] Steve Loughran commented on YARN-3471: -- Looking at this patch, it doesn't address the issues I've encountered in YARN-3477 # when the retries time out, the exception causing the attempts to fail is not rethrown # interrupts are being swallowed, making it impossible to reliably interrupt the thread I'd like to get the YARN-3477 patches in as well as these ones, so lets see if we can do them in turn. Fix timeline client retry - Key: YARN-3471 URL: https://issues.apache.org/jira/browse/YARN-3471 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.8.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3471.1.patch, YARN-3471.2.patch I found that the client retry has some problems: 1. The new put methods will retry on all exception, but they should only do it upon ConnectException. 2. We can reuse TimelineClientConnectionRetry to simplify the retry logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3537) NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked
[ https://issues.apache.org/jira/browse/YARN-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509208#comment-14509208 ] Brahma Reddy Battula commented on YARN-3537: [~jlowe] thanks for taking look into this issue..Yes, you are correct...Updated the patch,kindly review... NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked Key: YARN-3537 URL: https://issues.apache.org/jira/browse/YARN-3537 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3537-002.patch, YARN-3537.patch 2015-04-23 19:30:34,961 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:326) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.tearDown(TestNodeManagerShutdown.java:106) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509169#comment-14509169 ] Hudson commented on YARN-3434: -- FAILURE: Integrated in Hadoop-trunk-Commit #7646 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7646/]) YARN-3434. Interaction between reservations and userlimit can result in significant ULF violation (tgraves: rev 189a63a719c63b67a1783a280bfc2f72dcb55277) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/ResourceLimits.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3537) NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked
[ https://issues.apache.org/jira/browse/YARN-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3537: --- Attachment: YARN-3537-002.patch NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked Key: YARN-3537 URL: https://issues.apache.org/jira/browse/YARN-3537 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3537-002.patch, YARN-3537.patch 2015-04-23 19:30:34,961 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:326) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.tearDown(TestNodeManagerShutdown.java:106) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3537) NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked
[ https://issues.apache.org/jira/browse/YARN-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509288#comment-14509288 ] Hadoop QA commented on YARN-3537: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 38s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 32s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 5m 19s | There were no new checkstyle issues. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 3s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 5m 56s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 46m 52s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727625/YARN-3537-002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 189a63a | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7472/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7472/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7472/console | This message was automatically generated. NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked Key: YARN-3537 URL: https://issues.apache.org/jira/browse/YARN-3537 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3537-002.patch, YARN-3537.patch 2015-04-23 19:30:34,961 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:326) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.tearDown(TestNodeManagerShutdown.java:106) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3537) NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked
[ https://issues.apache.org/jira/browse/YARN-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509303#comment-14509303 ] Jason Lowe commented on YARN-3537: -- If the store is not null then we want to close it regardless of whether the context is null or not, because that means we opened it earlier. My previous point is that if we don't have a store then this method has nothing to do. If the context is null but the store isn't then there _is_ something that still needs to be done. NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked Key: YARN-3537 URL: https://issues.apache.org/jira/browse/YARN-3537 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3537-002.patch, YARN-3537.patch 2015-04-23 19:30:34,961 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:326) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.tearDown(TestNodeManagerShutdown.java:106) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509313#comment-14509313 ] Hadoop QA commented on YARN-3539: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 2m 52s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 55s | Site still builds. | | | | 6m 15s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727643/YARN-3539-003.patch | | Optional Tests | site | | git revision | trunk / 189a63a | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7474/console | This message was automatically generated. Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, YARN-3539-003.patch The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2740) ResourceManager side should properly handle node label modifications when distributed node label configuration enabled
[ https://issues.apache.org/jira/browse/YARN-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-2740: Attachment: YARN-2740.20150423-1.patch Hi [~wangda], Patch failed to apply on top of yest checkins, hence rebased and also have corrected the trailing white space. ResourceManager side should properly handle node label modifications when distributed node label configuration enabled -- Key: YARN-2740 URL: https://issues.apache.org/jira/browse/YARN-2740 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2740-20141024-1.patch, YARN-2740.20150320-1.patch, YARN-2740.20150327-1.patch, YARN-2740.20150411-1.patch, YARN-2740.20150411-2.patch, YARN-2740.20150411-3.patch, YARN-2740.20150417-1.patch, YARN-2740.20150420-1.patch, YARN-2740.20150421-1.patch, YARN-2740.20150422-2.patch, YARN-2740.20150423-1.patch According to YARN-2495, when distributed node label configuration is enabled: - RMAdmin / REST API should reject change labels on node operations. - CommonNodeLabelsManager shouldn't persist labels on nodes when NM do heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3477) TimelineClientImpl swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-3477: - Attachment: YARN-3477-001.patch Patch -001 # rethrows last received exception on a retry count failure # caught InterruptedExceptions are converted to InterruptedIOException. This allows recipients to selectively look for that exception. # no longer swallows InterruptedExceptions during sleep There's no tests here, because there's no easy way to implement the failure paths. Close review is encouraged. There's one more thing we may want to do when handling the interrupts: re-enable the thread's interrupted flag. See [http://www.ibm.com/developerworks/library/j-jtp05236/] for the specifics here. I don't see any harm in doing this, and as it helps preserve the interrupted state, can only be a good thing TimelineClientImpl swallows exceptions -- Key: YARN-3477 URL: https://issues.apache.org/jira/browse/YARN-3477 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0, 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-3477-001.patch If timeline client fails more than the retry count, the original exception is not thrown. Instead some runtime exception is raised saying retries run out # the failing exception should be rethrown, ideally via NetUtils.wrapException to include URL of the failing endpoing # Otherwise, the raised RTE should (a) state that URL and (b) set the original fault as the inner cause -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3413) Node label attributes (like exclusivity) should settable via addToClusterNodeLabels but shouldn't be changeable at runtime
[ https://issues.apache.org/jira/browse/YARN-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509556#comment-14509556 ] Hudson commented on YARN-3413: -- FAILURE: Integrated in Hadoop-trunk-Commit #7650 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7650/]) YARN-3413. Changed Nodelabel attributes (like exclusivity) to be settable only via addToClusterNodeLabels but not changeable at runtime. (Wangda Tan via vinodkv) (vinodkv: rev f5fe35e297ed4a00a1ba93d090207ef67cebcc9d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/api/impl/pb/client/ResourceManagerAdministrationProtocolPBClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/main/java/org/apache/hadoop/mapred/ResourceMgrDelegate.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/StoreUpdateNodeLabelsEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/NullRMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueParsing.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/server/resourcemanager_administration_protocol.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/NodeLabelsStoreEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/api/impl/pb/service/ResourceManagerAdministrationProtocolPBServiceImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/YarnClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeLabel.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ClusterCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/server/yarn_server_resourcemanager_service_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/server/api/ResourceManagerAdministrationProtocol.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/TestRMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestNodeLabelContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShellWithNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestCommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/DummyCommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/UpdateNodeLabelsRequestPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetClusterNodeLabelsResponsePBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestWorkPreservingRMRestartForNodeLabel.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/RMNodeLabel.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java *
[jira] [Commented] (YARN-3537) NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked
[ https://issues.apache.org/jira/browse/YARN-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509366#comment-14509366 ] Brahma Reddy Battula commented on YARN-3537: Thanks again for your input..Updated patch.. NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked Key: YARN-3537 URL: https://issues.apache.org/jira/browse/YARN-3537 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3537-002.patch, YARN-3537-003.patch, YARN-3537.patch 2015-04-23 19:30:34,961 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:326) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.tearDown(TestNodeManagerShutdown.java:106) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3537) NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked
[ https://issues.apache.org/jira/browse/YARN-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509452#comment-14509452 ] Hadoop QA commented on YARN-3537: - (!) The patch artifact directory on has been removed! This is a fatal error for test-patch.sh. Aborting. Jenkins (node H8) information at https://builds.apache.org/job/PreCommit-YARN-Build/7477/ may provide some hints. NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked Key: YARN-3537 URL: https://issues.apache.org/jira/browse/YARN-3537 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3537-002.patch, YARN-3537-003.patch, YARN-3537.patch 2015-04-23 19:30:34,961 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:326) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.tearDown(TestNodeManagerShutdown.java:106) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2740) ResourceManager side should properly handle node label modifications when distributed node label configuration enabled
[ https://issues.apache.org/jira/browse/YARN-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509455#comment-14509455 ] Wangda Tan commented on YARN-2740: -- Seems like env issue, re-kicked Jenkins. ResourceManager side should properly handle node label modifications when distributed node label configuration enabled -- Key: YARN-2740 URL: https://issues.apache.org/jira/browse/YARN-2740 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2740-20141024-1.patch, YARN-2740.20150320-1.patch, YARN-2740.20150327-1.patch, YARN-2740.20150411-1.patch, YARN-2740.20150411-2.patch, YARN-2740.20150411-3.patch, YARN-2740.20150417-1.patch, YARN-2740.20150420-1.patch, YARN-2740.20150421-1.patch, YARN-2740.20150422-2.patch, YARN-2740.20150423-1.patch According to YARN-2495, when distributed node label configuration is enabled: - RMAdmin / REST API should reject change labels on node operations. - CommonNodeLabelsManager shouldn't persist labels on nodes when NM do heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3413) Node label attributes (like exclusivity) should settable via addToClusterNodeLabels but shouldn't be changeable at runtime
[ https://issues.apache.org/jira/browse/YARN-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509497#comment-14509497 ] Vinod Kumar Vavilapalli commented on YARN-3413: --- I tried running test-patch.sh on my own box, but ran into issues w.r.t checkstyle. I'd think some of the checkstyle issues are the same as HADOOP-11869. Will create a ticket to address all of YARN checkstyle issues once HADOOP-11869 gets addressed. Checking this in now. Node label attributes (like exclusivity) should settable via addToClusterNodeLabels but shouldn't be changeable at runtime -- Key: YARN-3413 URL: https://issues.apache.org/jira/browse/YARN-3413 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3413.1.patch, YARN-3413.2.patch, YARN-3413.3.patch, YARN-3413.4.patch, YARN-3413.5.patch, YARN-3413.6.patch, YARN-3413.7.patch As mentioned in : https://issues.apache.org/jira/browse/YARN-3345?focusedCommentId=14384947page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14384947. Changing node label exclusivity and/or other attributes may not be a real use case, and also we should support setting node label attributes whiling adding them to cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3413) Node label attributes (like exclusivity) should settable via addToClusterNodeLabels but shouldn't be changeable at runtime
[ https://issues.apache.org/jira/browse/YARN-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509559#comment-14509559 ] Wangda Tan commented on YARN-3413: -- Thanks Vinod for review commit! Node label attributes (like exclusivity) should settable via addToClusterNodeLabels but shouldn't be changeable at runtime -- Key: YARN-3413 URL: https://issues.apache.org/jira/browse/YARN-3413 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3413.1.patch, YARN-3413.2.patch, YARN-3413.3.patch, YARN-3413.4.patch, YARN-3413.5.patch, YARN-3413.6.patch, YARN-3413.7.patch As mentioned in : https://issues.apache.org/jira/browse/YARN-3345?focusedCommentId=14384947page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14384947. Changing node label exclusivity and/or other attributes may not be a real use case, and also we should support setting node label attributes whiling adding them to cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2740) ResourceManager side should properly handle node label modifications when distributed node label configuration enabled
[ https://issues.apache.org/jira/browse/YARN-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509451#comment-14509451 ] Hadoop QA commented on YARN-2740: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 42s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:red}-1{color} | javac | 7m 45s | The applied patch generated 122 additional warning messages. | | {color:red}-1{color} | javadoc | 9m 56s | The applied patch generated 9 additional warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 30s | The applied patch generated 3 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 3s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 3s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 19m 31s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 66m 32s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | | | hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler | | | hadoop.yarn.server.resourcemanager.TestApplicationACLs | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs | | | hadoop.yarn.server.resourcemanager.TestApplicationCleanup | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSLeafQueue | | | hadoop.yarn.server.resourcemanager.TestRM | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes | | | hadoop.yarn.server.resourcemanager.TestResourceTrackerService | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification | | | hadoop.yarn.server.resourcemanager.security.TestAMRMTokens | | | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerFairShare | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppAttempt | | | hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerDynamicBehavior | | | hadoop.yarn.server.resourcemanager.TestMoveApplication | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerEventLog | | | hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher | | | hadoop.yarn.server.resourcemanager.TestRMProxyUsersConf | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation | | | hadoop.yarn.server.resourcemanager.resourcetracker.TestNMReconnect | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesHttpStaticUserPermissions | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices | | | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | | | hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler | | | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | | | hadoop.yarn.server.resourcemanager.TestApplicationMasterService | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs | | | hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy | | | hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodeLabels | | | hadoop.yarn.server.resourcemanager.scheduler.TestQueueMetrics | | |
[jira] [Commented] (YARN-3319) Implement a FairOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509470#comment-14509470 ] Hudson commented on YARN-3319: -- FAILURE: Integrated in Hadoop-trunk-Commit #7648 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7648/]) YARN-3319. Implement a FairOrderingPolicy. (Craig Welch via wangda) (wangda: rev 395205444e8a9ae6fc86f0a441e98486a775511a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/TestFairOrderingPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/FairOrderingPolicy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/policy/CompoundComparator.java Implement a FairOrderingPolicy -- Key: YARN-3319 URL: https://issues.apache.org/jira/browse/YARN-3319 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3319.13.patch, YARN-3319.14.patch, YARN-3319.17.patch, YARN-3319.35.patch, YARN-3319.39.patch, YARN-3319.45.patch, YARN-3319.47.patch, YARN-3319.53.patch, YARN-3319.58.patch, YARN-3319.70.patch, YARN-3319.71.patch, YARN-3319.72.patch, YARN-3319.73.patch, YARN-3319.74.patch, YARN-3319.75.patch Implement a FairOrderingPolicy which prefers to allocate to SchedulerProcesses with least current usage, very similar to the FairScheduler's FairSharePolicy. The Policy will offer allocations to applications in a queue in order of least resources used, and preempt applications in reverse order (from most resources used). This will include conditional support for sizeBasedWeight style adjustment Optionally, based on a conditional configuration to enable sizeBasedWeight (default false), an adjustment to boost larger applications (to offset the natural preference for smaller applications) will adjust the resource usage value based on demand, dividing it by the below value: Math.log1p(app memory demand) / Math.log(2); In cases where the above is indeterminate (two applications are equal after this comparison), behavior falls back to comparison based on the application id, which is generally lexically FIFO for that comparison -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3517: Affects Version/s: (was: 2.7.0) RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3477) TimelineClientImpl swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509409#comment-14509409 ] Hadoop QA commented on YARN-3477: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 35s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 31s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 22s | The applied patch generated 1 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 8s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 7m 32s | Tests passed in hadoop-yarn-client. | | {color:red}-1{color} | yarn tests | 1m 53s | Tests failed in hadoop-yarn-common. | | | | 51m 18s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.client.api.impl.TestTimelineClient | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727644/YARN-3477-001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 49f6e3d | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7475/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-client test log | https://builds.apache.org/job/PreCommit-YARN-Build/7475/artifact/patchprocess/testrun_hadoop-yarn-client.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7475/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7475/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7475/console | This message was automatically generated. TimelineClientImpl swallows exceptions -- Key: YARN-3477 URL: https://issues.apache.org/jira/browse/YARN-3477 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0, 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-3477-001.patch If timeline client fails more than the retry count, the original exception is not thrown. Instead some runtime exception is raised saying retries run out # the failing exception should be rethrown, ideally via NetUtils.wrapException to include URL of the failing endpoing # Otherwise, the raised RTE should (a) state that URL and (b) set the original fault as the inner cause -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2740) ResourceManager side should properly handle node label modifications when distributed node label configuration enabled
[ https://issues.apache.org/jira/browse/YARN-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509449#comment-14509449 ] Wangda Tan commented on YARN-2740: -- Thanks [~Naganarasimha], will commit once Jenkins get back. ResourceManager side should properly handle node label modifications when distributed node label configuration enabled -- Key: YARN-2740 URL: https://issues.apache.org/jira/browse/YARN-2740 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2740-20141024-1.patch, YARN-2740.20150320-1.patch, YARN-2740.20150327-1.patch, YARN-2740.20150411-1.patch, YARN-2740.20150411-2.patch, YARN-2740.20150411-3.patch, YARN-2740.20150417-1.patch, YARN-2740.20150420-1.patch, YARN-2740.20150421-1.patch, YARN-2740.20150422-2.patch, YARN-2740.20150423-1.patch According to YARN-2495, when distributed node label configuration is enabled: - RMAdmin / REST API should reject change labels on node operations. - CommonNodeLabelsManager shouldn't persist labels on nodes when NM do heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3190) NM can't aggregate logs: token can't be found in cache
[ https://issues.apache.org/jira/browse/YARN-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3190. - Resolution: Duplicate issue is fixed by YARN-2964 NM can't aggregate logs: token can't be found in cache --- Key: YARN-3190 URL: https://issues.apache.org/jira/browse/YARN-3190 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Environment: CDH 5.3.1 HA HDFS Kerberos Reporter: Andrejs Dubovskis Priority: Minor In rare cases node manager can not aggregate logs: generating exception: {code} 2015-02-12 13:04:03,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Starting aggregate log-file for app application_1423661043235_2150 at /tmp/logs/catalyst/logs/application_1423661043235_2150/catdn001.intrum.net_8041.tmp 2015-02-12 13:04:03,707 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /data5/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 2015-02-12 13:04:03,707 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /data6/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 2015-02-12 13:04:03,707 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /data7/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 2015-02-12 13:04:03,709 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /data1/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150 2015-02-12 13:04:03,709 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:catalyst (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache 2015-02-12 13:04:03,709 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache 2015-02-12 13:04:03,709 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:catalyst (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache 2015-02-12 13:04:03,712 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:catalyst (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache 2015-02-12 13:04:03,712 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Cannot create writer for app application_1423661043235_2150. Disabling log-aggregation for this app. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy19.getServerDefaults(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:259) at sun.reflect.GeneratedMethodAccessor114.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy20.getServerDefaults(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:966) at org.apache.hadoop.fs.Hdfs.getServerDefaults(Hdfs.java:159) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:543) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:680) at
[jira] [Commented] (YARN-3522) DistributedShell uses the wrong user to put timeline data
[ https://issues.apache.org/jira/browse/YARN-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509464#comment-14509464 ] Jian He commented on YARN-3522: --- lgtm, thanks zhijie ! DistributedShell uses the wrong user to put timeline data - Key: YARN-3522 URL: https://issues.apache.org/jira/browse/YARN-3522 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-3522.1.patch, YARN-3522.2.patch, YARN-3522.3.patch YARN-3287 breaks the timeline access control of distributed shell. In distributed shell AM: {code} if (conf.getBoolean(YarnConfiguration.TIMELINE_SERVICE_ENABLED, YarnConfiguration.DEFAULT_TIMELINE_SERVICE_ENABLED)) { // Creating the Timeline Client timelineClient = TimelineClient.createTimelineClient(); timelineClient.init(conf); timelineClient.start(); } else { timelineClient = null; LOG.warn(Timeline service is not enabled); } {code} {code} ugi.doAs(new PrivilegedExceptionActionTimelinePutResponse() { @Override public TimelinePutResponse run() throws Exception { return timelineClient.putEntities(entity); } }); {code} YARN-3287 changes the timeline client to get the right ugi at serviceInit, but DS AM still doesn't use submitter ugi to init timeline client, but use the ugi for each put entity call. It result in the wrong user of the put request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3522) DistributedShell uses the wrong user to put timeline data
[ https://issues.apache.org/jira/browse/YARN-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509508#comment-14509508 ] Hudson commented on YARN-3522: -- FAILURE: Integrated in Hadoop-trunk-Commit #7649 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7649/]) YARN-3522. Fixed DistributedShell to instantiate TimeLineClient as the correct user. Contributed by Zhijie Shen (jianhe: rev aa4a192feb8939353254d058c5f81bddbd0335c0) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSFailedAppMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/CHANGES.txt DistributedShell uses the wrong user to put timeline data - Key: YARN-3522 URL: https://issues.apache.org/jira/browse/YARN-3522 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3522.1.patch, YARN-3522.2.patch, YARN-3522.3.patch YARN-3287 breaks the timeline access control of distributed shell. In distributed shell AM: {code} if (conf.getBoolean(YarnConfiguration.TIMELINE_SERVICE_ENABLED, YarnConfiguration.DEFAULT_TIMELINE_SERVICE_ENABLED)) { // Creating the Timeline Client timelineClient = TimelineClient.createTimelineClient(); timelineClient.init(conf); timelineClient.start(); } else { timelineClient = null; LOG.warn(Timeline service is not enabled); } {code} {code} ugi.doAs(new PrivilegedExceptionActionTimelinePutResponse() { @Override public TimelinePutResponse run() throws Exception { return timelineClient.putEntities(entity); } }); {code} YARN-3287 changes the timeline client to get the right ugi at serviceInit, but DS AM still doesn't use submitter ugi to init timeline client, but use the ugi for each put entity call. It result in the wrong user of the put request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509230#comment-14509230 ] Brahma Reddy Battula commented on YARN-3528: Planning to give 0 for all HTTP and RPC ports... Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509263#comment-14509263 ] Hadoop QA commented on YARN-3480: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727636/YARN-3480.01.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 189a63a | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7473/console | This message was automatically generated. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509276#comment-14509276 ] Jason Lowe commented on YARN-3535: -- The first item is to avoid containers failing due to an NM restart. As it is now, a container handed out by the RM to an idle NM can fail if the NM restarts before the AM launches the container. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-3539: - Attachment: YARN-3539-003.patch Patch -003 updates {{timelineserver.md}} # specify the REST API (non-normative) # add some more on futures # review configuration options # fix up broken internal links by adding the anchors # yarn/index.html includes link to ATS Rest API as one of the listed REST APIs Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, YARN-3539-003.patch The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3537) NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked
[ https://issues.apache.org/jira/browse/YARN-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3537: --- Attachment: YARN-3537-003.patch NPE when NodeManager.serviceInit fails and stopRecoveryStore invoked Key: YARN-3537 URL: https://issues.apache.org/jira/browse/YARN-3537 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3537-002.patch, YARN-3537-003.patch, YARN-3537.patch 2015-04-23 19:30:34,961 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state STOPPED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:326) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.tearDown(TestNodeManagerShutdown.java:106) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509774#comment-14509774 ] Li Lu commented on YARN-3431: - Hi [~zjshen], thanks for the update! The latest patch LGTM. Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch, YARN-3431.4.patch, YARN-3431.5.patch, YARN-3431.6.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3319) Implement a FairOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509641#comment-14509641 ] Craig Welch commented on YARN-3319: --- Yes, it's configured in the capacity scheduler configuration with something like this: {code} property name(yarn-queue-prefix).ordering-policy.fair.enable-size-based-weight/name valuetrue/value /property {code} Implement a FairOrderingPolicy -- Key: YARN-3319 URL: https://issues.apache.org/jira/browse/YARN-3319 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3319.13.patch, YARN-3319.14.patch, YARN-3319.17.patch, YARN-3319.35.patch, YARN-3319.39.patch, YARN-3319.45.patch, YARN-3319.47.patch, YARN-3319.53.patch, YARN-3319.58.patch, YARN-3319.70.patch, YARN-3319.71.patch, YARN-3319.72.patch, YARN-3319.73.patch, YARN-3319.74.patch, YARN-3319.75.patch Implement a FairOrderingPolicy which prefers to allocate to SchedulerProcesses with least current usage, very similar to the FairScheduler's FairSharePolicy. The Policy will offer allocations to applications in a queue in order of least resources used, and preempt applications in reverse order (from most resources used). This will include conditional support for sizeBasedWeight style adjustment Optionally, based on a conditional configuration to enable sizeBasedWeight (default false), an adjustment to boost larger applications (to offset the natural preference for smaller applications) will adjust the resource usage value based on demand, dividing it by the below value: Math.log1p(app memory demand) / Math.log(2); In cases where the above is indeterminate (two applications are equal after this comparison), behavior falls back to comparison based on the application id, which is generally lexically FIFO for that comparison -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen resolved YARN-2032. --- Resolution: Won't Fix It will be covered in YARN-2928 Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Li Lu Attachments: YARN-2032-091114.patch, YARN-2032-branch-2-1.patch, YARN-2032-branch2-2.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3540) Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler
Eric Payne created YARN-3540: Summary: Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler Key: YARN-3540 URL: https://issues.apache.org/jira/browse/YARN-3540 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Reporter: Eric Payne Assignee: Eric Payne Priority: Blocker We are seeing this happen when - an NM's disk goes bad during the creation of map output(s) - the reducer's fetcher can read the shuffle header and reserve the memory - but gets an IOException when trying to shuffle for InMemoryMapOutput - shuffle fetch retry is enabled -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3517: Attachment: YARN-3517.005.patch Uploaded a new patch to fix the whitespace and checkstyle errors. The test failure is unrelated to the patch. RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3363) add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container.
[ https://issues.apache.org/jira/browse/YARN-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509738#comment-14509738 ] Hadoop QA commented on YARN-3363: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 30s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 32s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 26s | The applied patch generated 3 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 3s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 6m 0s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 46m 41s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727688/YARN-3363.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 416b843 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7480/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7480/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7480/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7480/console | This message was automatically generated. add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container. Key: YARN-3363 URL: https://issues.apache.org/jira/browse/YARN-3363 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-3363.000.patch, YARN-3363.001.patch add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container. Currently ContainerMetrics has container's actual memory usage(YARN-2984), actual CPU usage(YARN-3122), resource and pid(YARN-3022). It will be better to have localization and container launch time in ContainerMetrics for each active container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2740) ResourceManager side should properly handle node label modifications when distributed node label configuration enabled
[ https://issues.apache.org/jira/browse/YARN-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509648#comment-14509648 ] Hadoop QA commented on YARN-2740: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 44s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 25s | The applied patch generated 3 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 2s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 2s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 52m 24s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 98m 43s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727647/YARN-2740.20150423-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 3952054 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7479/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7479/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7479/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7479/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7479/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7479/console | This message was automatically generated. ResourceManager side should properly handle node label modifications when distributed node label configuration enabled -- Key: YARN-2740 URL: https://issues.apache.org/jira/browse/YARN-2740 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2740-20141024-1.patch, YARN-2740.20150320-1.patch, YARN-2740.20150327-1.patch, YARN-2740.20150411-1.patch, YARN-2740.20150411-2.patch, YARN-2740.20150411-3.patch, YARN-2740.20150417-1.patch, YARN-2740.20150420-1.patch, YARN-2740.20150421-1.patch, YARN-2740.20150422-2.patch, YARN-2740.20150423-1.patch According to YARN-2495, when distributed node label configuration is enabled: - RMAdmin / REST API should reject change labels on node operations. - CommonNodeLabelsManager shouldn't persist labels on nodes when NM do heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509722#comment-14509722 ] zhihai xu commented on YARN-2893: - Hi [~jianhe], Do you think any of my earlier suggestions are reasonable? AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509717#comment-14509717 ] Zhijie Shen commented on YARN-3437: --- Sorry for the late comments. This patch has half MR code and half YARN code. It's not good to commit it as one patch. I have one thought of managing the commits: 1. The YARN code is nearly duplicate with YARN-3390. As YARN-3390 is almost ready, we can get that patch in first. 2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390. 3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556. So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556, and rename it to TimelineServicePerformanceTestv2? How do you think about the plan for the commits? W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series? And one nit in the patch: {{entity.setEntityType(TEZ_DAG_ID);}}. How about not mentioning TEZ in the MR code? convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch, YARN-3437.003.patch, YARN-3437.004.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3319) Implement a FairOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509630#comment-14509630 ] Vinod Kumar Vavilapalli commented on YARN-3319: --- Is FairOrderingPolicy.ENABLE_SIZE_BASED_WEIGHT supposed to be admin visible? If so, we need a better, fully qualified name.. Implement a FairOrderingPolicy -- Key: YARN-3319 URL: https://issues.apache.org/jira/browse/YARN-3319 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.8.0 Attachments: YARN-3319.13.patch, YARN-3319.14.patch, YARN-3319.17.patch, YARN-3319.35.patch, YARN-3319.39.patch, YARN-3319.45.patch, YARN-3319.47.patch, YARN-3319.53.patch, YARN-3319.58.patch, YARN-3319.70.patch, YARN-3319.71.patch, YARN-3319.72.patch, YARN-3319.73.patch, YARN-3319.74.patch, YARN-3319.75.patch Implement a FairOrderingPolicy which prefers to allocate to SchedulerProcesses with least current usage, very similar to the FairScheduler's FairSharePolicy. The Policy will offer allocations to applications in a queue in order of least resources used, and preempt applications in reverse order (from most resources used). This will include conditional support for sizeBasedWeight style adjustment Optionally, based on a conditional configuration to enable sizeBasedWeight (default false), an adjustment to boost larger applications (to offset the natural preference for smaller applications) will adjust the resource usage value based on demand, dividing it by the below value: Math.log1p(app memory demand) / Math.log(2); In cases where the above is indeterminate (two applications are equal after this comparison), behavior falls back to comparison based on the application id, which is generally lexically FIFO for that comparison -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509591#comment-14509591 ] Hadoop QA commented on YARN-3517: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 38s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. | | {color:green}+1{color} | javac | 7m 30s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 24s | The applied patch generated 2 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 39s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 2m 2s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 53m 59s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 98m 18s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727652/YARN-3517.004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 49f6e3d | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7478/artifact/patchprocess/whitespace.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7478/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7478/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7478/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7478/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7478/console | This message was automatically generated. RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3529) Add miniHBase cluster and Phoenix support to ATS v2 unit tests
[ https://issues.apache.org/jira/browse/YARN-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509689#comment-14509689 ] Zhijie Shen commented on YARN-3529: --- Previously, we met the compatibility issue of having dependency on HBase 0.9x. I'm not sure if HBase has resolve it, or we have the way to work around it. Please take a look at HADOOP-10995 and YARN-2032. Add miniHBase cluster and Phoenix support to ATS v2 unit tests -- Key: YARN-3529 URL: https://issues.apache.org/jira/browse/YARN-3529 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: AbstractMiniHBaseClusterTest.java, output_minicluster2.txt After we have our HBase and Phoenix writer implementations, we may want to find a way to set up HBase and Phoenix in our unit tests. We need to do this integration before the branch got merged back to trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509905#comment-14509905 ] Sangjin Lee commented on YARN-3437: --- Well it's not entirely true. It seems I still need to change TimelineCollector.getTimelineEntityContext() from protected to public. But creating another YARN JIRA just to make those several lines of changes seems too much. Thoughts folks? convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch, YARN-3437.003.patch, YARN-3437.004.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509932#comment-14509932 ] Zhijie Shen commented on YARN-3437: --- bq. How's that sound? It's also good to me. bq. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts? It's okay to make it a simpler case, but could we at least cover one config, and one metric, hence we can verify the db that storing this info also works? bq. But creating another YARN JIRA just to make those several lines of changes seems too much A couple of lines change in YARN for MR patch is okay. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch, YARN-3437.003.patch, YARN-3437.004.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3437: -- Issue Type: New Feature (was: Sub-task) Parent: (was: YARN-3378) convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch, YARN-3437.003.patch, YARN-3437.004.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3390: -- Attachment: YARN-3390.2.patch Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3390.1.patch, YARN-3390.2.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509961#comment-14509961 ] Li Lu commented on YARN-3411: - Oh, and one thing to add, in the added pom file, maybe we can centralize the version of hbase (the Phoenix patch also has this problem)? This may make version management slightly easier. Maybe we can address this problem together with the Phoenix one in YARN-3529? [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.2.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509962#comment-14509962 ] Zhijie Shen commented on YARN-3390: --- Thanks for the comments. I've addressed Sangjin and Li's comments except: bq. maybe we'd like to mark it as unstable? It's not the API for the users, hence it's okay to leave it unannotated. bq. In TimelineCollectorWebService, why we're removing the utility function getCollector? After the refactoring, we don't need to convert appId to string. It's not necessary to wrap a single statement in a method. In addition, I changed to use hook in TimelineCollectorManager, but postRemove is called before stopping the collector, because once the collector is stopped, the hook may not be able to do something with the stopped collector. Moreover, I moved RMApp.stopTimelineCollector into FinalTransition. Suppose the collector only collects application lifecycle events, it doesn't need to stay after the app is finished. We can adjust it later if we find the collector needs to stay after the app is finished. Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3390.1.patch, YARN-3390.2.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2498) Respect labels in preemption policy of capacity scheduler for inter-queue preemption
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509967#comment-14509967 ] Jian He commented on YARN-2498: --- - remove below {code} NavigableSetFiCaSchedulerApp ns = (NavigableSetFiCaSchedulerApp) leafQueue.getApplications(); {code} - this piece is dup with addToPreemptMap method ? {code} SetRMContainer toPreemptContainers = preemptMap.get(fc.getApplicationAttemptId()); if (null == toPreemptContainers) { toPreemptContainers = new HashSetRMContainer(); } preemptMap.put(fc.getApplicationAttemptId(), toPreemptContainers); {code} - below code at line 744 is dup with the check at line 650 ? {code} if (resToObtainByPartition.isEmpty()) { return; } {code} - tryPreemptContainerAndDeductResToObtain can also include the addToPreemptMap method so that every caller doesn’t need to invoke that. - TempQueuePartition - TempQueuePerPartition - a few long lines: e.g. tryPreemptContainerAndDeductResToObtain - remove LeafQueue#getIgnoreExclusivityResourceByPartition - simplify below a bit {code} private TempQueuePartition getQueueByPartition(String queueName, String partition) { if (!queueToPartitions.containsKey(queueName)) { return null; } if (!queueToPartitions.get(queueName).containsKey(partition)) { return null; } return queueToPartitions.get(queueName).get(partition); } {code} Respect labels in preemption policy of capacity scheduler for inter-queue preemption Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: STALED-YARN-2498.zip, YARN-2498.1.patch, YARN-2498.2.patch There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when calculating ideal_assigned for each queue, we need get by-partition-ideal-assigned according to queue's guaranteed/maximum/used/pending resource on specific partition. For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. In addition, we need to handle ignore_partition_exclusivity case, when we need to preempt containers from a queue's partition, we will first preempt ignore_partition_exclusivity allocated containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509802#comment-14509802 ] zhihai xu commented on YARN-3536: - Is This issue similar as YARN-2834? ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3476) Nodemanager can fail to delete local logs if log aggregation fails
[ https://issues.apache.org/jira/browse/YARN-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509841#comment-14509841 ] Jason Lowe commented on YARN-3476: -- I'm OK with deleting the logs upon error uploading. It should be a rare occurrence, and log availability is already a best-effort rather than guaranteed service. Even if we try to retain the logs it has questionable benefit in practice, as the history of a job always points to the aggregated logs, not the node's copy of the logs, and thus the logs will still be lost from the end-user's point of view. Savvy users may realize the logs could still be on the original node, but most won't know to check there or how to form the URL to find them. If we always point to the node then that defeats one of the features of log aggregation, since loss of the node will mean the node's URL is bad and we fail to show the logs even if they are aggregated. So for now I say we keep it simple and just cleanup the files on errors to prevent leaks. Speaking of which I took a look at the patch. It will fix the particular error we saw with TFiles, but there could easily be other non-IOExceptions that creep out of the code, especially as it is maintained over time. Would it be better to wrap the cleanup in a finally block or something a little more broadly applicable to errors that occur? Nodemanager can fail to delete local logs if log aggregation fails -- Key: YARN-3476 URL: https://issues.apache.org/jira/browse/YARN-3476 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation, nodemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Rohith Attachments: 0001-YARN-3476.patch If log aggregation encounters an error trying to upload the file then the underlying TFile can throw an illegalstateexception which will bubble up through the top of the thread and prevent the application logs from being deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3509) CollectorNodemanagerProtocol's authorization doesn't work
[ https://issues.apache.org/jira/browse/YARN-3509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509864#comment-14509864 ] Junping Du commented on YARN-3509: -- [~zjshen], thanks for updating the patch. Shall we wait security design for v2 timeline service get finalized then back to your patch? CollectorNodemanagerProtocol's authorization doesn't work - Key: YARN-3509 URL: https://issues.apache.org/jira/browse/YARN-3509 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, security, timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3509.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509934#comment-14509934 ] Zhijie Shen commented on YARN-3437: --- Oh, previously I said TimelineServicePerformanceTestv2, but actually I meant TimelineServicePerformanceV2. Just a minor suggestion, and it's up to you to find the suitable class name. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch, YARN-3437.003.patch, YARN-3437.004.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509890#comment-14509890 ] Hadoop QA commented on YARN-3517: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 52s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 31s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 5m 24s | There were no new checkstyle issues. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 39s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 52m 23s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 96m 50s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727691/YARN-3517.005.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 416b843 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7481/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7481/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7481/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7481/console | This message was automatically generated. RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509862#comment-14509862 ] Sangjin Lee commented on YARN-3437: --- Thanks for your comments [~zjshen]. {quote} 1. The YARN code is nearly duplicate with YARN-3390. As YARN-3390 is almost ready, we can get that patch in first. 2. Move this jira to MR project, only retain the MR code in the patch and do some minor rebase according to YARN-3390. {quote} Let's do this. While I was working on YARN-3438, I'm realizing that for the performance tests it is probably OK to use the TimelineCollectors directly and bypass the TimelineCollectorManager altogether. If we do that, then this could become purely a MR patch. I'll update this patch to remove the use of TimelineCollectorManager and move this JIRA to MAPREDUCE. How's that sound? {quote} 3. TimelineServicePerformanceTest is in different package and has the different name. Hopefully it won't conflict with YARN-2556. So once YARN-2556 gets committed, we just need to refactor TimelineServicePerformanceTest to reuse YARN-2556 code. BTW, can we put TimelineServicePerformanceTest into the same package of TimelineServicePerformance in YARN-2556, and rename it to TimelineServicePerformanceTestv2? {quote} That's fine. I'll move it back to the same package. {quote} W.R.T to the patch, I'm a bit concerned that the write which contains one event per entity is not so typical to represent real use case. And configuration and metrics are even not covered. Is it more realistic to write an entity with 10 events and 10 metrics, which have 100 points in the time series? And one nit in the patch: entity.setEntityType(TEZ_DAG_ID);. How about not mentioning TEZ in the MR code? {quote} Note that this is adding simple entity writes. The more realistic part of the test is coming in YARN-3438 (I'm nearly finished with that), and it will have multiple levels of entities as well as metrics and configuration. We would use that one for more realistic load whereas we could keep this mode as a simpler test. Thoughts? I'll change the name of the entity to be something else. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch, YARN-3437.003.patch, YARN-3437.004.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509951#comment-14509951 ] Li Lu commented on YARN-3411: - Hi [~vrushalic], thanks for the patch! I'm OK with the major part of this patch for now. Here, I'm listing some questions that we can have some discussion on. # About null checks: so far we do not have a fixed standard on if and where we need to do null checks. I noticed you assumed info, config, event, and other similar fields are not null. Maybe we'd like to explicitly decide when all those fields can be null or empty. # Maybe we'd like to change TimelineWriterUtils to default access modifier? I think it would be sufficient to make it visible in package? # One thing I'd like to open a discussion is on deciding the way to store and process metrics. Currently, in the hbase patch, startTime and endTime are not used. In the Phoenix patch, I store time series as a flattened, non-queryable strings. I think this part also requires some hint from the time-based aggregations. # Another thing I'd like to discuss here is if and how we'd like to set up a separate fast path for metric only updates. On the storage layer, I'd strongly +1 for a separate fast path such that we can only touch the (frequently updated) metrics table. Any proposals everyone? [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.2.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509330#comment-14509330 ] Junping Du commented on YARN-3505: -- Thanks for uploading a patch to fix this problem, [~xgong]! I am reviewing your patch. In the mean time, can you fix the findbugs warning message which seems to be related with the code? Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Attachments: YARN-3505.1.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3154) Should not upload partial logs for MR jobs or other short-running' applications
[ https://issues.apache.org/jira/browse/YARN-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-3154: - Release Note: Applications which made use of the LogAggregationContext in their application will need to revisit this code in order to make sure that their logs continue to get rolled out. Hadoop Flags: Incompatible change,Reviewed (was: Reviewed) Should not upload partial logs for MR jobs or other short-running' applications - Key: YARN-3154 URL: https://issues.apache.org/jira/browse/YARN-3154 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3154.1.patch, YARN-3154.2.patch, YARN-3154.3.patch, YARN-3154.4.patch Currently, if we are running a MR job, and we do not set the log interval properly, we will have their partial logs uploaded and then removed from the local filesystem which is not right. We only upload the partial logs for LRS applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3363) add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container.
[ https://issues.apache.org/jira/browse/YARN-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3363: Attachment: YARN-3363.001.patch add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container. Key: YARN-3363 URL: https://issues.apache.org/jira/browse/YARN-3363 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-3363.000.patch, YARN-3363.001.patch add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container. Currently ContainerMetrics has container's actual memory usage(YARN-2984), actual CPU usage(YARN-3122), resource and pid(YARN-3022). It will be better to have localization and container launch time in ContainerMetrics for each active container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3363) add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container.
[ https://issues.apache.org/jira/browse/YARN-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509674#comment-14509674 ] zhihai xu commented on YARN-3363: - Hi [~adhoot], thanks for the thorough review, Your suggestions are reasonable. I uploaded a new patch YARN-3363.001.patch, which addressed all your comments. Please review it. thanks add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container. Key: YARN-3363 URL: https://issues.apache.org/jira/browse/YARN-3363 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Labels: metrics, supportability Attachments: YARN-3363.000.patch, YARN-3363.001.patch add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container. Currently ContainerMetrics has container's actual memory usage(YARN-2984), actual CPU usage(YARN-3122), resource and pid(YARN-3022). It will be better to have localization and container launch time in ContainerMetrics for each active container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3516) killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status.
[ https://issues.apache.org/jira/browse/YARN-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510058#comment-14510058 ] Xuan Gong commented on YARN-3516: - +1 LGTM. Will commit killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status. --- Key: YARN-3516 URL: https://issues.apache.org/jira/browse/YARN-3516 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-3516.000.patch killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status. This is a typo from YARN-3024. With YARN-3024, ContainerLocalizer will be killed only if {{action}} is set to {{LocalizerAction.DIE}}, calling {{response.setLocalizerAction}} will be overwritten. This is also a regression from old code. Also it make sense to kill the ContainerLocalizer when FETCH_FAILURE happened, because the container will send CLEANUP_CONTAINER_RESOURCES event after localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2498) Respect labels in preemption policy of capacity scheduler for inter-queue preemption
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2498: - Attachment: YARN-2498.4.patch Attached ver.4, updated test to make sure ignore_partition_exclusivity containers will be added/removed to/from queue's map Respect labels in preemption policy of capacity scheduler for inter-queue preemption Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: STALED-YARN-2498.zip, YARN-2498.1.patch, YARN-2498.2.patch, YARN-2498.3.patch, YARN-2498.4.patch There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when calculating ideal_assigned for each queue, we need get by-partition-ideal-assigned according to queue's guaranteed/maximum/used/pending resource on specific partition. For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. In addition, we need to handle ignore_partition_exclusivity case, when we need to preempt containers from a queue's partition, we will first preempt ignore_partition_exclusivity allocated containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3516) killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status.
[ https://issues.apache.org/jira/browse/YARN-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510080#comment-14510080 ] Xuan Gong commented on YARN-3516: - Committed into trunk/branch-2. Thanks, zhihai ! killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status. --- Key: YARN-3516 URL: https://issues.apache.org/jira/browse/YARN-3516 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Fix For: 2.8.0 Attachments: YARN-3516.000.patch killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status. This is a typo from YARN-3024. With YARN-3024, ContainerLocalizer will be killed only if {{action}} is set to {{LocalizerAction.DIE}}, calling {{response.setLocalizerAction}} will be overwritten. This is also a regression from old code. Also it make sense to kill the ContainerLocalizer when FETCH_FAILURE happened, because the container will send CLEANUP_CONTAINER_RESOURCES event after localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510320#comment-14510320 ] gu-chi commented on YARN-3536: -- Thx, as the exception trace stack is almost, I once looked into this ticket. This patch is already merged into the current environment I use. Not same cause. ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3492) AM fails to come up because RM and NM can't connect to each other
[ https://issues.apache.org/jira/browse/YARN-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510099#comment-14510099 ] Vinod Kumar Vavilapalli commented on YARN-3492: --- [~kasha], can you try this on a different box or something to see if this is an env issue? Tx. AM fails to come up because RM and NM can't connect to each other - Key: YARN-3492 URL: https://issues.apache.org/jira/browse/YARN-3492 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.0 Environment: pseudo-distributed cluster on a mac Reporter: Karthik Kambatla Priority: Blocker Attachments: mapred-site.xml, yarn-kasha-nodemanager-kasha-mbp.local.log, yarn-kasha-resourcemanager-kasha-mbp.local.log, yarn-site.xml Stood up a pseudo-distributed cluster with 2.7.0 RC0. Submitted a pi job. The container gets allocated, but doesn't get launched. The NM can't talk to the RM. Logs to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3534) Report node resource utilization
[ https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri updated YARN-3534: -- Attachment: YARN-3534-1.patch No unit tests yet. Report node resource utilization Key: YARN-3534 URL: https://issues.apache.org/jira/browse/YARN-3534 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Assignee: Inigo Goiri Attachments: YARN-3534-1.patch Original Estimate: 336h Remaining Estimate: 336h YARN should be aware of the resource utilization of the nodes when scheduling containers. For this, this task will implement the NodeResourceMonitor and send this information to the Resource Manager in the heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3534) Report node resource utilization
[ https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri updated YARN-3534: -- Attachment: YARN-3534-2.patch Merged to trunk. Report node resource utilization Key: YARN-3534 URL: https://issues.apache.org/jira/browse/YARN-3534 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Assignee: Inigo Goiri Attachments: YARN-3534-1.patch, YARN-3534-2.patch Original Estimate: 336h Remaining Estimate: 336h YARN should be aware of the resource utilization of the nodes when scheduling containers. For this, this task will implement the NodeResourceMonitor and send this information to the Resource Manager in the heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3541) Add version info on timeline service / generic history web UI and RES API
Zhijie Shen created YARN-3541: - Summary: Add version info on timeline service / generic history web UI and RES API Key: YARN-3541 URL: https://issues.apache.org/jira/browse/YARN-3541 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3438) add a mode to replay MR job history files to the timeline service
[ https://issues.apache.org/jira/browse/YARN-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510142#comment-14510142 ] Hadoop QA commented on YARN-3438: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727769/YARN-3438.000.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | trunk / 0b3f895 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7486/console | This message was automatically generated. add a mode to replay MR job history files to the timeline service - Key: YARN-3438 URL: https://issues.apache.org/jira/browse/YARN-3438 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3438.000.patch The subtask covers the work on top of YARN-3437 to add a mode to replay MR job history files to the timeline service storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3438) add a mode to replay MR job history files to the timeline service
[ https://issues.apache.org/jira/browse/YARN-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510150#comment-14510150 ] Junping Du commented on YARN-3438: -- Thanks [~sjlee0] for uploading a patch! Given 90% work are on MR side, let's move to MapReduce project and under umbrella of MAPREDUCE-6331. add a mode to replay MR job history files to the timeline service - Key: YARN-3438 URL: https://issues.apache.org/jira/browse/YARN-3438 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3438.000.patch The subtask covers the work on top of YARN-3437 to add a mode to replay MR job history files to the timeline service storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3541) Add version info on timeline service / generic history web UI and RES API
[ https://issues.apache.org/jira/browse/YARN-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3541: -- Attachment: YARN-3541.1.patch Upload a patch: 1. Include version info in /ws/v1/timeline of timeline API 2. Add the endpoint at /ws/v1/applicationhistory/about and show version info of generic history API 3. Add an about page of generic history service and show version info. 4. Add test cases correspondingly. I've tried the patch locally. The web service and UI looks good. Add version info on timeline service / generic history web UI and RES API - Key: YARN-3541 URL: https://issues.apache.org/jira/browse/YARN-3541 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3541.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3382) Some of UserMetricsInfo metrics are incorrectly set to root queue metrics
[ https://issues.apache.org/jira/browse/YARN-3382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3382: -- Fix Version/s: (was: 2.8.0) 2.7.1 This seems like an important fix. I merged this into branch-2.7. Some of UserMetricsInfo metrics are incorrectly set to root queue metrics - Key: YARN-3382 URL: https://issues.apache.org/jira/browse/YARN-3382 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0 Reporter: Rohit Agarwal Assignee: Rohit Agarwal Fix For: 2.7.1 Attachments: YARN-3382.patch {{appsCompleted}}, {{appsPending}}, {{appsRunning}} etc. in {{UserMetricsInfo}} are incorrectly set to the root queue's value instead of the user's value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3351) AppMaster tracking URL is broken in HA
[ https://issues.apache.org/jira/browse/YARN-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3351: -- Fix Version/s: (was: 2.8.0) 2.7.1 This seems like an important fix. I merged this into branch-2.7. AppMaster tracking URL is broken in HA -- Key: YARN-3351 URL: https://issues.apache.org/jira/browse/YARN-3351 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.1 Attachments: YARN-3351.001.patch, YARN-3351.002.patch, YARN-3351.003.patch After YARN-2713, the AppMaster link is broken in HA. To repro a) setup RM HA and ensure the first RM is not active, b) run a long sleep job and view the tracking url on the RM applications page The log and full stack trace is shown below {noformat} 2015-02-05 20:47:43,478 WARN org.mortbay.log: /proxy/application_1423182188062_0002/: java.net.BindException: Cannot assign requested address {noformat} {noformat} java.net.BindException: Cannot assign requested address at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376) at java.net.Socket.bind(Socket.java:631) at java.net.Socket.init(Socket.java:423) at java.net.Socket.init(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:188) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:345) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3472) Possible leak in DelegationTokenRenewer#allTokens
[ https://issues.apache.org/jira/browse/YARN-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3472: -- Target Version/s: 2.7.1 (was: 2.8.0) This seems like an important fix. I merged this into branch-2.7. Possible leak in DelegationTokenRenewer#allTokens -- Key: YARN-3472 URL: https://issues.apache.org/jira/browse/YARN-3472 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Fix For: 2.7.1 Attachments: 0001-YARN-3472.patch, 0002-YARN-3472.patch When old token is expiring and being removed, it's not removed from the allTokens map, resulting in possible leak. {code} if (t.token.getKind().equals(new Text(HDFS_DELEGATION_TOKEN))) { iter.remove(); t.cancelTimer(); LOG.info(Removed expiring token + t); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3472) Possible leak in DelegationTokenRenewer#allTokens
[ https://issues.apache.org/jira/browse/YARN-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3472: -- Target Version/s: 2.8.0 (was: 2.7.1) Fix Version/s: (was: 2.8.0) 2.7.1 Possible leak in DelegationTokenRenewer#allTokens -- Key: YARN-3472 URL: https://issues.apache.org/jira/browse/YARN-3472 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Fix For: 2.7.1 Attachments: 0001-YARN-3472.patch, 0002-YARN-3472.patch When old token is expiring and being removed, it's not removed from the allTokens map, resulting in possible leak. {code} if (t.token.getKind().equals(new Text(HDFS_DELEGATION_TOKEN))) { iter.remove(); t.cancelTimer(); LOG.info(Removed expiring token + t); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3534) Report node resource utilization
[ https://issues.apache.org/jira/browse/YARN-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510205#comment-14510205 ] Hadoop QA commented on YARN-3534: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 35s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 78 line(s) that end in whitespace. | | {color:green}+1{color} | javac | 7m 29s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 32s | The applied patch generated 11 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 40s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 21s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 1s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-server-common. | | {color:green}+1{color} | yarn tests | 5m 55s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 53m 2s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727766/YARN-3534-2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0b3f895 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7485/artifact/patchprocess/whitespace.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7485/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7485/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7485/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7485/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7485/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7485/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7485/console | This message was automatically generated. Report node resource utilization Key: YARN-3534 URL: https://issues.apache.org/jira/browse/YARN-3534 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Assignee: Inigo Goiri Attachments: YARN-3534-1.patch, YARN-3534-2.patch Original Estimate: 336h Remaining Estimate: 336h YARN should be aware of the resource utilization of the nodes when scheduling containers. For this, this task will implement the NodeResourceMonitor and send this information to the Resource Manager in the heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510053#comment-14510053 ] Sangjin Lee commented on YARN-3390: --- [~zjshen], you might want to check out [~djp]'s comments and my response in the other JIRA here: https://issues.apache.org/jira/browse/MAPREDUCE-6335?focusedCommentId=14508378page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508378 I think those were small but useful changes. See this patch for the changes: https://issues.apache.org/jira/secure/attachment/12727521/YARN-3437.004.patch It would be good to preserve those changes. Thanks! Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3390.1.patch, YARN-3390.2.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3516) killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status.
[ https://issues.apache.org/jira/browse/YARN-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510092#comment-14510092 ] Hudson commented on YARN-3516: -- FAILURE: Integrated in Hadoop-trunk-Commit #7656 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7656/]) YARN-3516. killing ContainerLocalizer action doesn't take effect when (xgong: rev 0b3f8957a87ada1a275c9904b211fdbdcefafb02) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status. --- Key: YARN-3516 URL: https://issues.apache.org/jira/browse/YARN-3516 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Fix For: 2.8.0 Attachments: YARN-3516.000.patch killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status. This is a typo from YARN-3024. With YARN-3024, ContainerLocalizer will be killed only if {{action}} is set to {{LocalizerAction.DIE}}, calling {{response.setLocalizerAction}} will be overwritten. This is also a regression from old code. Also it make sense to kill the ContainerLocalizer when FETCH_FAILURE happened, because the container will send CLEANUP_CONTAINER_RESOURCES event after localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510163#comment-14510163 ] Gera Shegalov commented on YARN-2893: - Hi [~zxu], for me personally it's easier to review if you simply make the change, and upload a new patch. The additional benefit is that we'll see hopefully if our assumptions are validated by unit tests. AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: zhihai xu Attachments: YARN-2893.000.patch, YARN-2893.001.patch, YARN-2893.002.patch MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2498) Respect labels in preemption policy of capacity scheduler for inter-queue preemption
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510177#comment-14510177 ] Hadoop QA commented on YARN-2498: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 34s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 5 new or modified test files. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 3 line(s) that end in whitespace. | | {color:green}+1{color} | javac | 7m 29s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 31s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 23s | The applied patch generated 4 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 16s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 52m 10s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 92m 55s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727748/YARN-2498.3.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ac281e3 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7482/artifact/patchprocess/whitespace.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7482/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7482/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7482/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7482/console | This message was automatically generated. Respect labels in preemption policy of capacity scheduler for inter-queue preemption Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: STALED-YARN-2498.zip, YARN-2498.1.patch, YARN-2498.2.patch, YARN-2498.3.patch, YARN-2498.4.patch There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when calculating ideal_assigned for each queue, we need get by-partition-ideal-assigned according to queue's guaranteed/maximum/used/pending resource on specific partition. For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. In addition, we need to handle ignore_partition_exclusivity case, when we need to preempt containers from a queue's partition, we will first preempt ignore_partition_exclusivity allocated containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3458) CPU resource monitoring in Windows
[ https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Inigo Goiri reassigned YARN-3458: - Assignee: Inigo Goiri CPU resource monitoring in Windows -- Key: YARN-3458 URL: https://issues.apache.org/jira/browse/YARN-3458 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Affects Versions: 2.7.0 Environment: Windows Reporter: Inigo Goiri Assignee: Inigo Goiri Priority: Minor Labels: containers, metrics, windows Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch Original Estimate: 168h Remaining Estimate: 168h The current implementation of getCpuUsagePercent() for WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to do it. I reused the CpuTimeTracker using 1 jiffy=1ms. This was left open by YARN-3122. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3541) Add version info on timeline service / generic history web UI and RES API
[ https://issues.apache.org/jira/browse/YARN-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510216#comment-14510216 ] Hadoop QA commented on YARN-3541: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 37s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 31s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 35s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 7m 43s | The applied patch generated 5 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 47s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 2m 49s | Tests passed in hadoop-yarn-server-applicationhistoryservice. | | | | 45m 35s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727771/YARN-3541.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / bcf89dd | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7487/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-applicationhistoryservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/7487/artifact/patchprocess/testrun_hadoop-yarn-server-applicationhistoryservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7487/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7487/console | This message was automatically generated. Add version info on timeline service / generic history web UI and RES API - Key: YARN-3541 URL: https://issues.apache.org/jira/browse/YARN-3541 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3541.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509106#comment-14509106 ] Renan DelValle commented on YARN-2408: -- Hi Nikhil, While I would be glad to finish the development of this feature, the fact is that since being proposed on August 12, 2014 (more than 8 months ago), no member of the Hadoop team has shown an interest in including this feature as part the main software. Thus, to use this feature would mean always having to patch the Hadoop source intended for use and hoping that nothing is broken in future versions. As Adam pointed out, alternative solutions exist which may allow you to achieve this feature with a much more future-proof and painless solution, such as the approach Myriad takes (https://github.com/mesos/myriad). That having been said, I'd gladly release the source code for what I have working. As for me, unfortunately, at this time, I don't feel like it is within my best interests to put forth the time necessary to complete this feature. -Renan Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3477) TimelineClientImpl swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-3477: - Affects Version/s: 2.6.0 Summary: TimelineClientImpl swallows exceptions (was: TimelineClientImpl swallows root cause of retry failures) {{TimelineClientImpl}} also catches InterruptedExceptions and either # converts to IOE so making it potentially treated as a retry # during a sleep(), it will catch and discard. Issue #2 means it is impossible to to reliably interrupt a thread which is in the attempt-and-retry process of trying to talk to a non-responsive ATS instance. While this does not impact normal operations, it does make it hard to shut down threads talking to ATS TimelineClientImpl swallows exceptions -- Key: YARN-3477 URL: https://issues.apache.org/jira/browse/YARN-3477 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0, 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran If timeline client fails more than the retry count, the original exception is not thrown. Instead some runtime exception is raised saying retries run out # the failing exception should be rethrown, ideally via NetUtils.wrapException to include URL of the failing endpoing # Otherwise, the raised RTE should (a) state that URL and (b) set the original fault as the inner cause -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3431: -- Attachment: YARN-3431.6.patch Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch, YARN-3431.4.patch, YARN-3431.5.patch, YARN-3431.6.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508529#comment-14508529 ] Zhijie Shen commented on YARN-3431: --- bq. It would be a little more consistent and perform slightly better if the type check in getChildren() is consolidated into validateChildren(). Refactored the code, such that we don't iterate the set twice. bq. maybe we'd like to add some prefix to the fields we (implicitly) add to the info field of an entity? I changed the info keys a bit to make them start with SYSTEM_INFO_. Hopefully it will reduce the conflict. Anyway, we need to identify the system info keys in the documentation to notify users of not using them. Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch, YARN-3431.4.patch, YARN-3431.5.patch, YARN-3431.6.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3522) DistributedShell uses the wrong user to put timeline data
[ https://issues.apache.org/jira/browse/YARN-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508557#comment-14508557 ] Zhijie Shen commented on YARN-3522: --- I took a look at checkstyle errors and commented on [HADOOP-11869|https://issues.apache.org/jira/browse/HADOOP-11869?focusedCommentId=14508555page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14508555]. It seems more like noise now. DistributedShell uses the wrong user to put timeline data - Key: YARN-3522 URL: https://issues.apache.org/jira/browse/YARN-3522 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-3522.1.patch, YARN-3522.2.patch, YARN-3522.3.patch YARN-3287 breaks the timeline access control of distributed shell. In distributed shell AM: {code} if (conf.getBoolean(YarnConfiguration.TIMELINE_SERVICE_ENABLED, YarnConfiguration.DEFAULT_TIMELINE_SERVICE_ENABLED)) { // Creating the Timeline Client timelineClient = TimelineClient.createTimelineClient(); timelineClient.init(conf); timelineClient.start(); } else { timelineClient = null; LOG.warn(Timeline service is not enabled); } {code} {code} ugi.doAs(new PrivilegedExceptionActionTimelinePutResponse() { @Override public TimelinePutResponse run() throws Exception { return timelineClient.putEntities(entity); } }); {code} YARN-3287 changes the timeline client to get the right ugi at serviceInit, but DS AM still doesn't use submitter ugi to init timeline client, but use the ugi for each put entity call. It result in the wrong user of the put request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508586#comment-14508586 ] Nikhil Mulley commented on YARN-2408: - Hi [~rdelvalle] There are 8 people voting for it and 15 people watching this issue. I am not sure what is the requirement in the community for having a general interest though but I would be happy to help this move forward in terms of having the patch deployed on my test cluster and give it a whirl and see where it goes. I am as well interested in the rest api to provide means to monitor the cluster resources, in general, to have a means to monitor the slow/starving jobs and the resources requested/consumed per app/job via rest api. Nikhil Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renan DelValle resolved YARN-2408. -- Resolution: Done Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508593#comment-14508593 ] Adam B commented on YARN-2408: -- FYI, one of the original use cases (Myriad, to run YARN on Mesos) now just implements the YARN scheduler API directly, so it no longer needs a REST API for resource requests. Other tools may be able to take a similar approach of wrapping a traditional YARN scheduler, but that means that the tool is forced to live on the RM node, in-process. Some tools (especially non-Java tools) will not be able to take this approach. Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508536#comment-14508536 ] Peng Zhang commented on YARN-3405: -- Update patch: only preempt from children when queue is not starved and add test case. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: YARN-3405.01.patch, YARN-3405.02.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renan DelValle resolved YARN-2408. -- Resolution: Won't Fix Work on this feature has been dropped by original author due to general lack of interest. Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renan DelValle reopened YARN-2408: -- Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)