[jira] [Commented] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487812#comment-14487812 ] Hudson commented on YARN-3459: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #159 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/159/]) YARN-3459. Fix failiure of TestLog4jWarningErrorMetricsAppender. (Varun Vasudev via wangda) (wangda: rev 7af086a515d573dc90ea4deec7f4e3f23622e0e8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java * hadoop-yarn-project/CHANGES.txt Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Varun Vasudev Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3431: -- Attachment: (was: YARN-3431.2.patch) Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3266) RMContext inactiveNodes should have NodeId as map key
[ https://issues.apache.org/jira/browse/YARN-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487265#comment-14487265 ] Chengbing Liu commented on YARN-3266: - [~jianhe], would you like to take a look at this? Thanks! RMContext inactiveNodes should have NodeId as map key - Key: YARN-3266 URL: https://issues.apache.org/jira/browse/YARN-3266 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-3266.01.patch, YARN-3266.02.patch Under the default NM port configuration, which is 0, we have observed in the current version, lost nodes count is greater than the length of the lost node list. This will happen when we consecutively restart the same NM twice: * NM started at port 10001 * NM restarted at port 10002 * NM restarted at port 10003 * NM:10001 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=1; {{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, {{inactiveNodes}} has 1 element * NM:10002 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=2; {{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, {{inactiveNodes}} still has 1 element Since we allow multiple NodeManagers on one host (as discussed in YARN-1888), {{inactiveNodes}} should be of type {{ConcurrentMapNodeId, RMNode}}. If this will break the current API, then the key string should include the NM's port as well. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487314#comment-14487314 ] Junping Du commented on YARN-3391: -- Thanks [~vrushalic] for review, v5 patch LGTM too. [~vinodkv], any additional comments? Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, YARN-3391.4.patch, YARN-3391.5.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487581#comment-14487581 ] Hadoop QA commented on YARN-3301: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724229/YARN-3301.2.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7273//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7273//console This message is automatically generated. Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3301.1.patch, YARN-3301.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487979#comment-14487979 ] Siqi Li commented on YARN-3468: --- Can anyone have some comments on this jira? NM should not blindly rename usercache/filecache/nmPrivate on restart - Key: YARN-3468 URL: https://issues.apache.org/jira/browse/YARN-3468 Project: Hadoop YARN Issue Type: Bug Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-3468.v1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3465) Use LinkedHashMap to preserve order of resource requests
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487199#comment-14487199 ] Hudson commented on YARN-3465: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #149 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/149/]) YARN-3465. Use LinkedHashMap to preserve order of resource requests. (Zhihai Xu via kasha) (kasha: rev 6495940eae09418a939882a8955845f9241a6485) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * hadoop-yarn-project/CHANGES.txt Use LinkedHashMap to preserve order of resource requests Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487802#comment-14487802 ] Wangda Tan commented on YARN-3434: -- [~tgraves], you're right. But I'm wondering why this could happen: When continousReservation enabled, it will do check in assignContainer: {code} if (reservationsContinueLooking rmContainer == null) { // we could possibly ignoring parent queue capacity limits when // reservationsContinueLooking is set. // If we're trying to reserve a container here, not container will be // unreserved for reserving the new one. Check limits again before // reserve the new container if (!checkLimitsToReserve(clusterResource, application, capability)) { return Resources.none(); } } {code} When continousReservation disabled, assignContainers will ensure user-limit will not be violated. My point is, *user-limit and queue max capacity are all checked before reserve new container*. And allocation from reserved container will unreserve before continue. So I think in your case, https://issues.apache.org/jira/browse/YARN-3434?focusedCommentId=14485834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14485834: job-2 cannot reserve 25 * 12 GB containers. Did I miss anything? And I've a question about continous reservation checking behavior, may or may not related to this issue: Now it will try to unreserve all containers under a user, but actually it will only unreserve at most one container to allocate a new container. Do you think is it fine to change the logic to be: When (continousReservation-enabled) (user.usage + required - min(max-allocation, user.total-reserved) =user.limit), assignContainers will continue. This will prevent doing impossible allocation when user reserved lots of containers. (As same as queue reservation checking). Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3348: Attachment: apache-yarn-3348.5.patch Uploaded a new patch fixing the scrolling issue in the sort, fields screen. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch, apache-yarn-3348.4.patch, apache-yarn-3348.5.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3301: Attachment: YARN-3301.2.patch Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3301.1.patch, YARN-3301.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487992#comment-14487992 ] Varun Vasudev commented on YARN-3348: - Hit submit a little too soon. Deleted the patch I uploaded. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch, apache-yarn-3348.4.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3431: -- Attachment: YARN-3431.3.patch Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488180#comment-14488180 ] Hudson commented on YARN-3055: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7552 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7552/]) YARN-3055. Fixed ResourceManager's DelegationTokenRenewer to not stop token renewal of applications part of a bigger workflow. Contributed by Daryn Sharp. (vinodkv: rev 9c5911294e0ba71aefe4763731b0e780cde9d0ca) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487620#comment-14487620 ] Daryn Sharp commented on YARN-3055: --- Two apps could double renew tokens (completely benign) before this patch. In practice the possibility is slim and its harmless. However, currently it's quite buggy. Both apps renewed and then stomped over each other's dttrs in allTokens. Now both apps reference separate yet equivalent dttr instances, when the intention was only one app should reference a token. A second/duplicate timer task was also scheduled. Haven't bothered to check later fallout from the inconsistencies. Patch: A double renew can still occur (unavoidable) but only one timer is scheduled. All apps reference the same dttr instance. Moving the logic down only creates 3 loops instead of 2 loops but I'll do if you feel strongly. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3348: Attachment: apache-yarn-3348.4.patch Uploaded a new patch which fixes an issue with yarn top output not clearing itself correctly. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch, apache-yarn-3348.4.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487818#comment-14487818 ] Hudson commented on YARN-2901: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #159 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/159/]) YARN-2901 addendum: Fixed findbugs warning caused by previously patch (wangda: rev ba9ee22ca4ed2c5ff447b66b2e2dfe25f6880fe0) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3348: Attachment: apache-yarn-3348.5.patch Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch, apache-yarn-3348.4.patch, apache-yarn-3348.5.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487824#comment-14487824 ] Hudson commented on YARN-2890: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #159 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/159/]) YARN-2890. MiniYarnCluster should turn on timeline service if configured to do so. Contributed by Mit Desai. (hitesh: rev 265ed1fe804743601a8b62cabc1e4dc2ec8e502f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestMiniYarnCluster.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMRTimelineEventHandling.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestJobHistoryEventHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java MiniYarnCluster should turn on timeline service if configured to do so -- Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.8.0 Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3465) Use LinkedHashMap to preserve order of resource requests
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487811#comment-14487811 ] Hudson commented on YARN-3465: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #159 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/159/]) YARN-3465. Use LinkedHashMap to preserve order of resource requests. (Zhihai Xu via kasha) (kasha: rev 6495940eae09418a939882a8955845f9241a6485) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java Use LinkedHashMap to preserve order of resource requests Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488023#comment-14488023 ] Varun Vasudev commented on YARN-3348: - Uploaded a new patch that fixes a scrolling issue and makes the new method in YarnClient abstract. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch, apache-yarn-3348.4.patch, apache-yarn-3348.5.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3465) Use LinkedHashMap to preserve order of resource requests
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486909#comment-14486909 ] Hudson commented on YARN-3465: -- FAILURE: Integrated in Hadoop-trunk-Commit #7545 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7545/]) YARN-3465. Use LinkedHashMap to preserve order of resource requests. (Zhihai Xu via kasha) (kasha: rev 6495940eae09418a939882a8955845f9241a6485) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java Use LinkedHashMap to preserve order of resource requests Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3471) Fix timeline client retry
[ https://issues.apache.org/jira/browse/YARN-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3471: -- Attachment: YARN-3471.1.patch Upload a patch to fix the problem, and add test cases to verify the retry is working properly Fix timeline client retry - Key: YARN-3471 URL: https://issues.apache.org/jira/browse/YARN-3471 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3471.1.patch I found that the client retry has some problems: 1. The new put methods will retry on all exception, but they should only do it upon ConnectException. 2. We can reuse TimelineClientConnectionRetry to simplify the retry logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487915#comment-14487915 ] Hudson commented on YARN-2890: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2108 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2108/]) YARN-2890. MiniYarnCluster should turn on timeline service if configured to do so. Contributed by Mit Desai. (hitesh: rev 265ed1fe804743601a8b62cabc1e4dc2ec8e502f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMRTimelineEventHandling.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestJobHistoryEventHandler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestMiniYarnCluster.java MiniYarnCluster should turn on timeline service if configured to do so -- Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.8.0 Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487357#comment-14487357 ] Hadoop QA commented on YARN-3348: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724206/apache-yarn-3348.3.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7270//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7270//console This message is automatically generated. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3348: Attachment: (was: apache-yarn-3348.5.patch) Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch, apache-yarn-3348.4.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3469) Do not set watch for most cases in ZKRMStateStore
Jun Gong created YARN-3469: -- Summary: Do not set watch for most cases in ZKRMStateStore Key: YARN-3469 URL: https://issues.apache.org/jira/browse/YARN-3469 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Priority: Minor In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail](https://issues.apache.org/jira/browse/ZOOKEEPER-706). Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3469) Do not set watch for most cases in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487515#comment-14487515 ] Hadoop QA commented on YARN-3469: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724219/YARN-3469.01.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7271//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7271//console This message is automatically generated. Do not set watch for most cases in ZKRMStateStore - Key: YARN-3469 URL: https://issues.apache.org/jira/browse/YARN-3469 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Priority: Minor Attachments: YARN-3469.01.patch In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706]. Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3318) Create Initial OrderingPolicy Framework and FifoOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487631#comment-14487631 ] Craig Welch commented on YARN-3318: --- bq. ...Do we really see non-comparator based ordering-policy. We are unnecessarily adding two abstractions - adding policies and comparators... In the context of the code so far, the comparator based approach is specific to compounding comparators to achieve functionality (priority + fifo, fair + fifo, etc). This was the initial motivation for the two level api configuration, the broader surface of the policy which would allow for different collection types, sorting on demand, etc, (the original policy) and the narrower one within that (comparator) for the cases where comparator logic was sufficient, e.g. sharing a collection (for composition) and a collection type (a tree, for efficient resorting of individual elements when required) was possible. The two level api configuration was not well received. Offline Wangda has indicated that he thinks there are policies coming up which will need the wider, initial api, with control over the collection, sorting, etc. Supporting policy composition for those cases would be very awkward is not really worth pursuing. The various competing tradeoffs, the aversion to a multilevel api, the need for the higher level api, and the ability to compose policies creates something of a tension, I don't think it's realistic to try and accomplish it all together, the result will be Frankensteinian at best. Something has to go. Originally, I chose the multilevel api to resolve the dilemma, I like that choice, it seems unpopular with the crowd. Given that, the other optional dynamic is the ability to compose policies (there's no requirement for either of these as far as I can tell, it is a bonus feature). While I like the composition approach, it can't be maintained as such with the broader api and without the multilevel config/api. As one of these has to go and it appears it can't be the broader api or the multilevel api I suppose it will have to be composition. Internally there can be some composition of course, but it won't be transparent/exposed/configurable as it was initially. I'll put out a patch with that in a bit. Create Initial OrderingPolicy Framework and FifoOrderingPolicy -- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch, YARN-3318.39.patch, YARN-3318.45.patch, YARN-3318.47.patch, YARN-3318.48.patch Create the initial framework required for using OrderingPolicies and an initial FifoOrderingPolicy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488039#comment-14488039 ] Jason Lowe commented on YARN-3466: -- This is really low risk, and Wangda and I both manually tested the fix. This was broken in 2.7 and it would be great to fix it in the same release so we don't regress in a public release. Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column - Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address, container count and node label column on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3470) Make PermissionStatusFormat public
Arun Suresh created YARN-3470: - Summary: Make PermissionStatusFormat public Key: YARN-3470 URL: https://issues.apache.org/jira/browse/YARN-3470 Project: Hadoop YARN Issue Type: Bug Reporter: Arun Suresh Priority: Minor implementations of {{INodeAttributeProvider}} are required to provide an implementation of {{getPermissionLong()}} method. Unfortunately, the long permission format is an encoding of the user, group and mode with each field converted to int using {{SerialNumberManager}} which is package protected. Thus it would be nice to make the {{PermissionStatusFormat}} enum public (and also make the {{toLong()}} static method public) so that user specified implementations of {{INodeAttributeProvider}} may use it. This would also make it more consistent with {{AclStatusFormat}} which I guess has been made public for the same reason. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3465) Use LinkedHashMap to preserve order of resource requests
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487181#comment-14487181 ] Hudson commented on YARN-3465: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2090 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2090/]) YARN-3465. Use LinkedHashMap to preserve order of resource requests. (Zhihai Xu via kasha) (kasha: rev 6495940eae09418a939882a8955845f9241a6485) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * hadoop-yarn-project/CHANGES.txt Use LinkedHashMap to preserve order of resource requests Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487714#comment-14487714 ] Vinod Kumar Vavilapalli commented on YARN-3466: --- bq. It would be nice to get this into 2.7. Seeing as how long 2.7.0 release has taken, I propose that we put this in 2.7.1. I'll start a discussion on the dev lists to immediately follow up 2.7.0 with a 2.7.1 within 2-3 weeks. That works? Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column - Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address, container count and node label column on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3465) Use LinkedHashMap to preserve order of resource requests
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3465: --- Summary: Use LinkedHashMap to preserve order of resource requests (was: use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl) Use LinkedHashMap to preserve order of resource requests Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488149#comment-14488149 ] Vinod Kumar Vavilapalli commented on YARN-3055: --- bq. Not related to this patch, it's a bug in YARN-2704 .We should remove the token from the allTokens, otherwise, it's a leak in allTokens. it can be fixed separately. Good catch. Agree that this is not related to this patch, can you please file a ticket? Checking this in now. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated YARN-3055: -- Attachment: YARN-3055.patch The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3471) Fix timeline client retry
[ https://issues.apache.org/jira/browse/YARN-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488138#comment-14488138 ] Steve Loughran commented on YARN-3471: -- # all raised exceptions need to include the URL of the timeline server in them. Otherwise nobody will ever be able to track down the problem if its any # you can actually test TimelineClient (or any Yarn service) in a try-with-resources clause, to get the service automatically stopped {{ Service extends Closeable}}, see {code} try(Timeline client = createClient()) { } {code} Fix timeline client retry - Key: YARN-3471 URL: https://issues.apache.org/jira/browse/YARN-3471 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3471.1.patch I found that the client retry has some problems: 1. The new put methods will retry on all exception, but they should only do it upon ConnectException. 2. We can reuse TimelineClientConnectionRetry to simplify the retry logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487642#comment-14487642 ] Hadoop QA commented on YARN-3348: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724231/apache-yarn-3348.4.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7274//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7274//console This message is automatically generated. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch, apache-yarn-3348.4.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3318) Create Initial OrderingPolicy Framework and FifoOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3318: -- Attachment: YARN-3318.52.patch Update, removing composition in favor of broader interface Create Initial OrderingPolicy Framework and FifoOrderingPolicy -- Key: YARN-3318 URL: https://issues.apache.org/jira/browse/YARN-3318 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3318.13.patch, YARN-3318.14.patch, YARN-3318.17.patch, YARN-3318.34.patch, YARN-3318.35.patch, YARN-3318.36.patch, YARN-3318.39.patch, YARN-3318.45.patch, YARN-3318.47.patch, YARN-3318.48.patch, YARN-3318.52.patch Create the initial framework required for using OrderingPolicies and an initial FifoOrderingPolicy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3469) Do not set watch for most cases in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487989#comment-14487989 ] Karthik Kambatla commented on YARN-3469: When working on YARN-2716, I was wondering about the same. I think not setting watches makes sense. I ll let [~jianhe] also comment before committing this. Do not set watch for most cases in ZKRMStateStore - Key: YARN-3469 URL: https://issues.apache.org/jira/browse/YARN-3469 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Priority: Minor Attachments: YARN-3469.01.patch In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706]. Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488001#comment-14488001 ] Li Lu commented on YARN-3431: - Hi [~zjshen], I checked your proposal and in general it LGTM. I have some minor concerns, however: # In general we're using v1 object model for data transferring and storage. Rebuilding the special info for subclasses may be challenging, as the special keys may be mixed with user defined keys. Even though the chance is low, we may want to find a more elegant solution on this. # How do the sub-class instances identify their own types? I think this is the core challenge here. Are we using duck typing here? That said, maybe we want to have a new data transfer type, that can accommodate the extra data in subclasses in extension fields, and can self-identify its type? I'm just thinking out loud here... Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487245#comment-14487245 ] Hudson commented on YARN-2901: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #892 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/892/]) YARN-2901 addendum: Fixed findbugs warning caused by previously patch (wangda: rev ba9ee22ca4ed2c5ff447b66b2e2dfe25f6880fe0) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487251#comment-14487251 ] Hudson commented on YARN-2890: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #892 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/892/]) YARN-2890. MiniYarnCluster should turn on timeline service if configured to do so. Contributed by Mit Desai. (hitesh: rev 265ed1fe804743601a8b62cabc1e4dc2ec8e502f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestMiniYarnCluster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMRTimelineEventHandling.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestJobHistoryEventHandler.java MiniYarnCluster should turn on timeline service if configured to do so -- Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.8.0 Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487640#comment-14487640 ] Hadoop QA commented on YARN-3136: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724243/00010-YARN-3136.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7275//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7275//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7275//console This message is automatically generated. getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488011#comment-14488011 ] Thomas Graves commented on YARN-3434: - The code you mention is in the else part of that check where it would do a reservation. The situation I'm talking about actually allocates a container, not reserve one. I'll try to explain better: Application ask for lots of containers. It acquires some containers, then it reserves some. At this point it hits its normal user limit which in my example = capacity. It hasn't hit the max amount if can allocate or reserved (shouldAllocOrReserveNewContainer()). The next node heartbeats in that isn't yet reserved and has enough space for it to place a container on. It first checked in assignContainers - canAssignToThisQueue. That passes since we haven't hit max capacity. Then it checks assignContainers - canAssignToUser. That passes but only because used - reserved the user limit. This allows it to continue down into assignContainer. In assignContainer the node has available space and we haven't hit shouldAllocOrReserveNewContainer(). reservationsContinueLooking is on and labels are empty so it does the check: {noformat} if (!shouldAllocOrReserveNewContainer || Resources.greaterThan(resourceCalculator, clusterResource, minimumUnreservedResource, Resources.none())) {noformat} as I said before its allowed to allocate or reserve so it passes that test. Then it hasn't met its maximum capacity (capacity = 30% and max capacity = 100%) yet so that is None and that check doesn't kick in, so it doesn't go into the block to findNodeToUnreserve(). Then it goes ahead and allocates when it should have needed to unreserve. Basically we needed to also do the user limit check again and force it to do the findNodeToUnreserve. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs
[ https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3347: Attachment: YARN-3347.3.rebase.patch Improve YARN log command to get AMContainer logs as well as running containers logs --- Key: YARN-3347 URL: https://issues.apache.org/jira/browse/YARN-3347 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, YARN-3347.2.patch, YARN-3347.2.rebase.patch, YARN-3347.3.patch, YARN-3347.3.rebase.patch Right now, we could specify applicationId, node http address and container ID to get the specify container log. Or we could only specify applicationId to get all the container logs. It is very hard for the users to get logs for AM container since the AMContainer logs have more useful information. Users need to know the AMContainer's container ID and related Node http address. We could improve the YARN Log Command to allow users to get AMContainer logs directly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486863#comment-14486863 ] Zhijie Shen commented on YARN-3431: --- I uploaded a patch to resolve the problem in the other way. I think of the sub classes again, and find it is not necessary to have the corresponding web service resources for them. In fact, there're two levels: 1. Java API Level: we want to have the sub-classes of TimelineEntity as the first class citizen, which can facilitate users to operate on the predefined entities. They may have special setters/getters. 2. REST API Level: JSON schema isn't polymorphic, such that we should have one schema that is generic enough to describe different kinds of entities. Fortunately, the entity schema is able to do that. The sub-classes of TimelineEntity contain the following additional information: a) Special attributes: they can be put into the info map of the entity, and treated as the predefined info. For example, queue of application entity can be put into info with key=QUEUE_INFO_KEY and value = some queue name. b) Parent-child relationship: they can be put into the relate/is_related_to relationship map of the entity. The relate/is_related_to relationship can describe an arbitrary directed graph, and tree is one type of directed graphs. In the new patch, I fixed the API records instead of the endpoint. Therefore, we will still have a single endpoint to accept entities, while Java APIs keep unchanged too. In terms of JSON content for communication, we will always use the generic entity schema for TimelineEntity and all kinds of its sub-classes. BTW, I fixed some minor issue together in this patch, such as renaming UserEntity and QueueEntity, and FlowEntity attributes. Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487339#comment-14487339 ] Steve Loughran commented on YARN-2423: -- @rkanter —...if the YARN-2444 patch gets in (do you want to review that?)— all my production-side recommendations will be there TimelineClient should wrap all GET APIs to facilitate Java users Key: YARN-2423 URL: https://issues.apache.org/jira/browse/YARN-2423 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Robert Kanter Attachments: YARN-2423.004.patch, YARN-2423.005.patch, YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, YARN-2423.patch TimelineClient provides the Java method to put timeline entities. It's also good to wrap over all GET APIs (both entity and domain), and deserialize the json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487050#comment-14487050 ] Xuan Gong commented on YARN-3301: - bq. should it be the fix for 2.7? This is the format issue. It is OK to be the fix for 2.7, but not necessary to be the blocker. Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3301.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487482#comment-14487482 ] Daryn Sharp commented on YARN-3055: --- Thanks Vinod, I'll revise this morning. The ignores shouldn't be there. I did that for our internal emergency fix because we I didn't handle proxy refresh tokens so I didn't care the tests failed. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487239#comment-14487239 ] Hudson commented on YARN-3459: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #892 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/892/]) YARN-3459. Fix failiure of TestLog4jWarningErrorMetricsAppender. (Varun Vasudev via wangda) (wangda: rev 7af086a515d573dc90ea4deec7f4e3f23622e0e8) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Varun Vasudev Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487859#comment-14487859 ] Wangda Tan commented on YARN-2801: -- I've a doc for 2.6 and in apt format, will try to cover new changes in trunk and convert to markdown soon. Will keep you posted. Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gururaj Shetty Assignee: Wangda Tan Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1376) NM need to notify the log aggregation status to RM through Node heartbeat
[ https://issues.apache.org/jira/browse/YARN-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487551#comment-14487551 ] Junping Du commented on YARN-1376: -- Thanks [~xgong] for updating the patch. I just reviewed and have some comments below, most are minor issues except the last one: {code} public static final long DEFAULT_LOG_AGGREGATION_STATUS_TIME_OUT_MS = 10*60*1000; {code} Add space between number and *. In LogAggregationReportPBImpl.java, {code} + private LogAggregationStatus convertFromProtoFormat( + LogAggregationStatusProto s) { +return LogAggregationStatus.valueOf(s.name().replace( + LOGAGGREGATION_STATUS_PREFIX, )); + } + + private LogAggregationStatusProto + convertToProtoFormat(LogAggregationStatus s) { +return LogAggregationStatusProto.valueOf(LOGAGGREGATION_STATUS_PREFIX ++ s.name()); + } {code} Looks like we are adding/removing LOGAGGREGATION_STATUS_PREFIX between java obj and proto obj. I think this is not necessary? Am I missing something here? In NodeStatusUpdaterImpl.java, {code} + if (!latestLogAggregationReports.containsKey(logAggregationReport +.getApplicationId())) { ... // A + } else { ... // B } {code} Can we remove ! in if condition and adjust sequence of A and B, which looks simpler? In uploadLogsForContainers() of AppLogAggregatorImpl.java, {code} } catch (Exception e) { LOG.error( Failed to move temporary log file to final location: [ + remoteNodeTmpLogFileForApp + ] to [ + renamedPath + ], e); +diagnosticMessage = +Log uploaded failed for Application: + appId ++ in NodeManager: ++ LogAggregationUtils.getNodeString(nodeId) + at ++ Times.format(currentTime) + \n; } + + LogAggregationReport report = + Records.newRecord(LogAggregationReport.class); + report.setApplicationId(appId); + report.setNodeId(nodeId); + report.setDiagnosticMessage(diagnosticMessage); + if (appFinished) { +report.setLogAggregationStatus( LogAggregationStatus.FINISHED); + } else { +report.setLogAggregationStatus( LogAggregationStatus.RUNNING); + } + this.context.getLogAggregationStatusForApps().add(report); {code} Looks like we only set LogAggregationStatus to FINISHED or RUNNING here even it is failed to move temp log to HDFS. It doesn't seems correct to me. We should add a FAILED state for LogAggregationStatus to address this case? Other looks fine to me. NM need to notify the log aggregation status to RM through Node heartbeat - Key: YARN-1376 URL: https://issues.apache.org/jira/browse/YARN-1376 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: Screen Shot 2015-04-07 at 9.30.42 AM.png, YARN-1376.1.patch, YARN-1376.2.patch, YARN-1376.2.patch, YARN-1376.2015-04-04.patch, YARN-1376.2015-04-06.patch, YARN-1376.2015-04-07.patch, YARN-1376.3.patch, YARN-1376.4.patch Expose a client API to allow clients to figure if log aggregation is complete. The ticket is used to track the changes on NM side -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3361) CapacityScheduler side changes to support non-exclusive node labels
[ https://issues.apache.org/jira/browse/YARN-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487975#comment-14487975 ] Hadoop QA commented on YARN-3361: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723968/YARN-3361.4.patch against trunk revision 3fe61e0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 11 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7278//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7278//artifact/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7278//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7278//console This message is automatically generated. CapacityScheduler side changes to support non-exclusive node labels --- Key: YARN-3361 URL: https://issues.apache.org/jira/browse/YARN-3361 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3361.1.patch, YARN-3361.2.patch, YARN-3361.3.patch, YARN-3361.4.patch According to design doc attached in YARN-3214, we need implement following logic in CapacityScheduler: 1) When allocate a resource request with no node-label specified, it should get preferentially allocated to node without labels. 2) When there're some available resource in a node with label, they can be used by applications with following order: - Applications under queues which can access the label and ask for same labeled resource. - Applications under queues which can access the label and ask for non-labeled resource. - Applications under queues cannot access the label and ask for non-labeled resource. 3) Expose necessary information that can be used by preemption policy to make preemption decisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3465) Use LinkedHashMap to preserve order of resource requests
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487238#comment-14487238 ] Hudson commented on YARN-3465: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #892 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/892/]) YARN-3465. Use LinkedHashMap to preserve order of resource requests. (Zhihai Xu via kasha) (kasha: rev 6495940eae09418a939882a8955845f9241a6485) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * hadoop-yarn-project/CHANGES.txt Use LinkedHashMap to preserve order of resource requests Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3426) Add jdiff support to YARN
[ https://issues.apache.org/jira/browse/YARN-3426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487933#comment-14487933 ] Li Lu commented on YARN-3426: - The problem for the current solution is we're duplicating many maven code for hadoop-common/hdfs and yarn. We're also introducing duplications to mapreduce in the current approach. The next step for this work should be removing the duplications for those maven code. Meanwhile, for YARN, we may want to add maven codes to generate javadocs for public APIs only, similar to hadoop-common/hdfs. Add jdiff support to YARN - Key: YARN-3426 URL: https://issues.apache.org/jira/browse/YARN-3426 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Priority: Blocker Attachments: YARN-3426-040615-1.patch, YARN-3426-040615.patch, YARN-3426-040715.patch, YARN-3426-040815.patch Maybe we'd like to extend our current jdiff tool for hadoop-common and hdfs to YARN as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3471) Fix timeline client retry
Zhijie Shen created YARN-3471: - Summary: Fix timeline client retry Key: YARN-3471 URL: https://issues.apache.org/jira/browse/YARN-3471 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen I found that the client retry has some problems: 1. The new put methods will retry on all exception, but they should only do it upon ConnectException. 2. We can reuse TimelineClientConnectionRetry to simplify the retry logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs
[ https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487441#comment-14487441 ] Hadoop QA commented on YARN-3347: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724216/YARN-3347.3.1.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.client.api.impl.TestAMRMClient Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7272//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7272//console This message is automatically generated. Improve YARN log command to get AMContainer logs as well as running containers logs --- Key: YARN-3347 URL: https://issues.apache.org/jira/browse/YARN-3347 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, YARN-3347.2.patch, YARN-3347.2.rebase.patch, YARN-3347.3.1.patch, YARN-3347.3.patch, YARN-3347.3.rebase.patch Right now, we could specify applicationId, node http address and container ID to get the specify container log. Or we could only specify applicationId to get all the container logs. It is very hard for the users to get logs for AM container since the AMContainer logs have more useful information. Users need to know the AMContainer's container ID and related Node http address. We could improve the YARN Log Command to allow users to get AMContainer logs directly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488115#comment-14488115 ] Jian He commented on YARN-3055: --- thanks Daryn, patch looks good to me too. +1 Not related to this patch, it's a bug in YARN-2704 .We should remove the token from the allTokens, otherwise, it's a leak in allTokens. it can be fixed separately. {code} if (t.token.getKind().equals(new Text(HDFS_DELEGATION_TOKEN))) { iter.remove(); t.cancelTimer(); LOG.info(Removed expiring token + t); } {code} The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
[ https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487146#comment-14487146 ] Hadoop QA commented on YARN-3225: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724185/YARN-3225-4.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.api.impl.TestTimelineClient Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7268//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7268//console This message is automatically generated. New parameter or CLI for decommissioning node gracefully in RMAdmin CLI --- Key: YARN-3225 URL: https://issues.apache.org/jira/browse/YARN-3225 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Devaraj K Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, YARN-3225-4.patch, YARN-3225.patch, YARN-914.patch New CLI (or existing CLI with parameters) should put each node on decommission list to decommissioning status and track timeout to terminate the nodes that haven't get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487121#comment-14487121 ] Hudson commented on YARN-3459: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #158 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/158/]) YARN-3459. Fix failiure of TestLog4jWarningErrorMetricsAppender. (Varun Vasudev via wangda) (wangda: rev 7af086a515d573dc90ea4deec7f4e3f23622e0e8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java * hadoop-yarn-project/CHANGES.txt Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Varun Vasudev Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487127#comment-14487127 ] Hudson commented on YARN-2901: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #158 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/158/]) YARN-2901 addendum: Fixed findbugs warning caused by previously patch (wangda: rev ba9ee22ca4ed2c5ff447b66b2e2dfe25f6880fe0) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487133#comment-14487133 ] Hudson commented on YARN-2890: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #158 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/158/]) YARN-2890. MiniYarnCluster should turn on timeline service if configured to do so. Contributed by Mit Desai. (hitesh: rev 265ed1fe804743601a8b62cabc1e4dc2ec8e502f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMRTimelineEventHandling.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestMiniYarnCluster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestJobHistoryEventHandler.java MiniYarnCluster should turn on timeline service if configured to do so -- Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.8.0 Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487733#comment-14487733 ] Wangda Tan commented on YARN-3466: -- Just committed to trunk/branch-2, [~jlowe], [~vinodkv], please let me know when you figure out whether we need put to 2.7.x. Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column - Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address, container count and node label column on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs
[ https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3347: Attachment: YARN-3347.3.1.patch Improve YARN log command to get AMContainer logs as well as running containers logs --- Key: YARN-3347 URL: https://issues.apache.org/jira/browse/YARN-3347 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, YARN-3347.2.patch, YARN-3347.2.rebase.patch, YARN-3347.3.1.patch, YARN-3347.3.patch, YARN-3347.3.rebase.patch Right now, we could specify applicationId, node http address and container ID to get the specify container log. Or we could only specify applicationId to get all the container logs. It is very hard for the users to get logs for AM container since the AMContainer logs have more useful information. Users need to know the AMContainer's container ID and related Node http address. We could improve the YARN Log Command to allow users to get AMContainer logs directly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3431: -- Attachment: YARN-3431.2.patch Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.2.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3469) Do not set watch for most cases in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-3469: --- Description: In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706]. Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. was: In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail](https://issues.apache.org/jira/browse/ZOOKEEPER-706). Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. Do not set watch for most cases in ZKRMStateStore - Key: YARN-3469 URL: https://issues.apache.org/jira/browse/YARN-3469 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Priority: Minor In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706]. Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs
[ https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487145#comment-14487145 ] Hadoop QA commented on YARN-3347: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724191/YARN-3347.3.rebase.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7269//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7269//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7269//console This message is automatically generated. Improve YARN log command to get AMContainer logs as well as running containers logs --- Key: YARN-3347 URL: https://issues.apache.org/jira/browse/YARN-3347 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, YARN-3347.2.patch, YARN-3347.2.rebase.patch, YARN-3347.3.patch, YARN-3347.3.rebase.patch Right now, we could specify applicationId, node http address and container ID to get the specify container log. Or we could only specify applicationId to get all the container logs. It is very hard for the users to get logs for AM container since the AMContainer logs have more useful information. Users need to know the AMContainer's container ID and related Node http address. We could improve the YARN Log Command to allow users to get AMContainer logs directly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
[ https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3225: Attachment: YARN-3225-4.patch New parameter or CLI for decommissioning node gracefully in RMAdmin CLI --- Key: YARN-3225 URL: https://issues.apache.org/jira/browse/YARN-3225 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Devaraj K Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, YARN-3225-4.patch, YARN-3225.patch, YARN-914.patch New CLI (or existing CLI with parameters) should put each node on decommission list to decommissioning status and track timeout to terminate the nodes that haven't get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3293) Track and display capacity scheduler health metrics in web UI
[ https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486854#comment-14486854 ] Hadoop QA commented on YARN-3293: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723930/apache-yarn-3293.6.patch against trunk revision b1e0590. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7267//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7267//console This message is automatically generated. Track and display capacity scheduler health metrics in web UI - Key: YARN-3293 URL: https://issues.apache.org/jira/browse/YARN-3293 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch It would be good to display metrics that let users know about the health of the capacity scheduler in the web UI. Today it is hard to get an idea if the capacity scheduler is functioning correctly. Metrics such as the time for the last allocation, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487182#comment-14487182 ] Hudson commented on YARN-3459: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2090 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2090/]) YARN-3459. Fix failiure of TestLog4jWarningErrorMetricsAppender. (Varun Vasudev via wangda) (wangda: rev 7af086a515d573dc90ea4deec7f4e3f23622e0e8) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Varun Vasudev Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487212#comment-14487212 ] Hudson commented on YARN-2890: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #149 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/149/]) YARN-2890. MiniYarnCluster should turn on timeline service if configured to do so. Contributed by Mit Desai. (hitesh: rev 265ed1fe804743601a8b62cabc1e4dc2ec8e502f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestMiniYarnCluster.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMRTimelineEventHandling.java * hadoop-yarn-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestJobHistoryEventHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java MiniYarnCluster should turn on timeline service if configured to do so -- Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.8.0 Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487188#comment-14487188 ] Hudson commented on YARN-2901: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2090 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2090/]) YARN-2901 addendum: Fixed findbugs warning caused by previously patch (wangda: rev ba9ee22ca4ed2c5ff447b66b2e2dfe25f6880fe0) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487200#comment-14487200 ] Hudson commented on YARN-3459: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #149 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/149/]) YARN-3459. Fix failiure of TestLog4jWarningErrorMetricsAppender. (Varun Vasudev via wangda) (wangda: rev 7af086a515d573dc90ea4deec7f4e3f23622e0e8) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Varun Vasudev Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2890) MiniYarnCluster should turn on timeline service if configured to do so
[ https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487194#comment-14487194 ] Hudson commented on YARN-2890: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2090 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2090/]) YARN-2890. MiniYarnCluster should turn on timeline service if configured to do so. Contributed by Mit Desai. (hitesh: rev 265ed1fe804743601a8b62cabc1e4dc2ec8e502f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestJobHistoryEventHandler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/TestMRTimelineEventHandling.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestMiniYarnCluster.java MiniYarnCluster should turn on timeline service if configured to do so -- Key: YARN-2890 URL: https://issues.apache.org/jira/browse/YARN-2890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.8.0 Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch Currently the MiniMRYarnCluster does not consider the configuration value for enabling timeline service before starting. The MiniYarnCluster should only start the timeline service if it is configured to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487206#comment-14487206 ] Hudson commented on YARN-2901: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #149 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/149/]) YARN-2901 addendum: Fixed findbugs warning caused by previously patch (wangda: rev ba9ee22ca4ed2c5ff447b66b2e2dfe25f6880fe0) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486785#comment-14486785 ] Tsuyoshi Ozawa commented on YARN-2801: -- It would be good to add a documentation to ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/*.md. Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gururaj Shetty Assignee: Wangda Tan Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487416#comment-14487416 ] Thomas Graves commented on YARN-3434: - [~wangda] I'm not sure I follow what are saying? The reservations are already counted in the users usage and we do consider reserved when doing the user limit calculations. Look at LeafQueue.assignContainers call to allocateResource is where it ends up adding to user usage.The canAssignToUser is where it does user limit check and substracts the reservations off to see if it can continue. Note I do think we should just get rid of the config for reservationsContinueLooking, but that is a separate issue. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488061#comment-14488061 ] Thomas Graves commented on YARN-3434: - {quote} And I've a question about continous reservation checking behavior, may or may not related to this issue: Now it will try to unreserve all containers under a user, but actually it will only unreserve at most one container to allocate a new container. Do you think is it fine to change the logic to be: When (continousReservation-enabled) (user.usage + required - min(max-allocation, user.total-reserved) =user.limit), assignContainers will continue. This will prevent doing impossible allocation when user reserved lots of containers. (As same as queue reservation checking). {quote} I do think the reservation checking and unreserving can be improved. I basically started with very simple thing and figured we could improve. I'm not sure how much that check would help in practice. I guess it might help the cases where you have 1 user in the queue and a second one shows up and your user limit gets decreased by a lot. In that case it may prevent it from continuing when it can short circuit here. So it would seem to be ok for that. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3465) use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486776#comment-14486776 ] Hadoop QA commented on YARN-3465: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724015/YARN-3465.000.patch against trunk revision b1e0590. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7266//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7266//console This message is automatically generated. use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2801) Documentation development for Node labels requirment
[ https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486799#comment-14486799 ] Naganarasimha G R commented on YARN-2801: - Hi [~Wangda] [~ozawa], i can help if anything is required for this Documentation development for Node labels requirment Key: YARN-2801 URL: https://issues.apache.org/jira/browse/YARN-2801 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Gururaj Shetty Assignee: Wangda Tan Documentation needs to be developed for the node label requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487898#comment-14487898 ] Junping Du commented on YARN-3391: -- Sync offline with Vinod that he is fine with the latest patch. I will go ahead to commit it soon. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, YARN-3391.4.patch, YARN-3391.5.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3465) Use LinkedHashMap to preserve order of resource requests
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487902#comment-14487902 ] Hudson commented on YARN-3465: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2108 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2108/]) YARN-3465. Use LinkedHashMap to preserve order of resource requests. (Zhihai Xu via kasha) (kasha: rev 6495940eae09418a939882a8955845f9241a6485) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * hadoop-yarn-project/CHANGES.txt Use LinkedHashMap to preserve order of resource requests Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column
[ https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487738#comment-14487738 ] Hudson commented on YARN-3466: -- FAILURE: Integrated in Hadoop-trunk-Commit #7547 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7547/]) YARN-3466. Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column. (Jason Lowe via wangda) (wangda: rev 1885141e90837252934192040a40047c7adbc1b5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/NodesPage.java * hadoop-yarn-project/CHANGES.txt Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column - Key: YARN-3466 URL: https://issues.apache.org/jira/browse/YARN-3466 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.7.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3466.001.patch The ResourceManager does not support sorting by the node HTTP address, container count and node label column on the cluster nodes page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487047#comment-14487047 ] Xuan Gong commented on YARN-3301: - Test failures are not related Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3301.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487781#comment-14487781 ] Hadoop QA commented on YARN-3055: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724256/YARN-3055.patch against trunk revision 6495940. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7276//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7276//console This message is automatically generated. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3465) Use LinkedHashMap to preserve order of resource requests
[ https://issues.apache.org/jira/browse/YARN-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487120#comment-14487120 ] Hudson commented on YARN-3465: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #158 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/158/]) YARN-3465. Use LinkedHashMap to preserve order of resource requests. (Zhihai Xu via kasha) (kasha: rev 6495940eae09418a939882a8955845f9241a6485) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java Use LinkedHashMap to preserve order of resource requests Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.8.0 Attachments: YARN-3465.000.patch use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3469) Do not set watch for most cases in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-3469: --- Attachment: YARN-3469.01.patch Do not set watch for most cases in ZKRMStateStore - Key: YARN-3469 URL: https://issues.apache.org/jira/browse/YARN-3469 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Priority: Minor Attachments: YARN-3469.01.patch In ZKRMStateStore, most operations(e.g. getDataWithRetries, getDataWithRetries, getDataWithRetries) set watches on znode. Large watches will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause session re-establishment to fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706]. Although there is a workaround that setting jute.maxbuffer to a larger value, we need to adjust this value once there are more app and attempts stored in ZK. And those watches are useless now. It might be better that do not set watches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender
[ https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487903#comment-14487903 ] Hudson commented on YARN-3459: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2108 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2108/]) YARN-3459. Fix failiure of TestLog4jWarningErrorMetricsAppender. (Varun Vasudev via wangda) (wangda: rev 7af086a515d573dc90ea4deec7f4e3f23622e0e8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java * hadoop-yarn-project/CHANGES.txt Fix failiure of TestLog4jWarningErrorMetricsAppender Key: YARN-3459 URL: https://issues.apache.org/jira/browse/YARN-3459 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Varun Vasudev Priority: Blocker Fix For: 2.8.0 Attachments: apache-yarn-3459.0.patch TestLog4jWarningErrorMetricsAppender fails with the following message: {code} Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender) Time elapsed: 2.01 sec FAILURE! java.lang.AssertionError: expected:0 but was:1 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI
[ https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487909#comment-14487909 ] Hudson commented on YARN-2901: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2108 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2108/]) YARN-2901 addendum: Fixed findbugs warning caused by previously patch (wangda: rev ba9ee22ca4ed2c5ff447b66b2e2dfe25f6880fe0) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml Add errors and warning metrics page to RM, NM web UI Key: YARN-2901 URL: https://issues.apache.org/jira/browse/YARN-2901 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch It would be really useful to have statistics on the number of errors and warnings in the RM and NM web UI. I'm thinking about - 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 hours/day By errors and warnings I'm referring to the log level. I suspect we can probably achieve this by writing a custom appender?(I'm open to suggestions on alternate mechanisms for implementing this). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3136: -- Attachment: 00010-YARN-3136.patch Yes [~jianhe] I added that to fix findbugs which is not needed. I updated patch as per initial understanding. Kindly check. getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488206#comment-14488206 ] Hadoop QA commented on YARN-3348: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724296/apache-yarn-3348.5.patch against trunk revision 61dc2ea. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7279//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7279//console This message is automatically generated. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch, apache-yarn-3348.4.patch, apache-yarn-3348.5.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
[ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487871#comment-14487871 ] Vinod Kumar Vavilapalli commented on YARN-3055: --- You are right that the previous code also had the same issue. I am good with the patch. Will check this in unless jenkins or [~jianhe] say no. The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer -- Key: YARN-3055 URL: https://issues.apache.org/jira/browse/YARN-3055 Project: Hadoop YARN Issue Type: Bug Components: security Reporter: Yi Liu Assignee: Daryn Sharp Priority: Blocker Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch, YARN-3055.patch After YARN-2964, there is only one timer to renew the token if it's shared by jobs. In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared by other jobs, we will not cancel the token. Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}. Otherwise for the existing submitted applications which share this token will not get renew any more, and for new submitted applications which share this token, the token will be renew immediately. For example, we have 3 applications: app1, app2, app3. And they share the token1. See following scenario: *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only one token renewal timer for token1, and is scheduled when app1 is submitted *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed any more, but app2 and app3 still use it, so there is problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage
[ https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3348: Attachment: apache-yarn-3348.3.patch bq. In YarnClusterMetricsPBImpl, should the default num*NodeManagers return 0 ? Fixed. bq. getApplications in YarnClient.java may be an abstract method? It's a public class. Adding an abstract method will break compatibility. bq. “Queue Applications:” - if it’s aggregated number , maybe Queue(s) ? Fixed. Some other changes: # Based on an offline conversation with Jian, I've moved the app reports cache into TopCLI itself for now. # Improved help # The queue memory statistics are in GB instead of MB. Add a 'yarn top' tool to help understand cluster usage -- Key: YARN-3348 URL: https://issues.apache.org/jira/browse/YARN-3348 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, apache-yarn-3348.2.patch, apache-yarn-3348.3.patch It would be helpful to have a 'yarn top' tool that would allow administrators to understand which apps are consuming resources. Ideally the tool would allow you to filter by queue, user, maybe labels, etc and show you statistics on container allocation across the cluster to find out which apps are consuming the most resources on the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)