[jira] [Commented] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly
[ https://issues.apache.org/jira/browse/YARN-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498937#comment-14498937 ] Xuan Gong commented on YARN-2605: - [~adhoot] Feel free to assign back to yourself, if you have already worked on this ticket. [~steve_l] Please take a look. Removed the refresh header, add the location header to hold the redirect url, and set status as 307. {code} $ curl -i http://127.0.0.1:33188/ws/v1/cluster/metrics HTTP/1.1 307 TEMPORARY_REDIRECT Cache-Control: no-cache Expires: Thu, 16 Apr 2015 23:01:47 GMT Date: Thu, 16 Apr 2015 23:01:47 GMT Pragma: no-cache Expires: Thu, 16 Apr 2015 23:01:47 GMT Date: Thu, 16 Apr 2015 23:01:47 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 Location: http://localhost:23188/ws/v1/cluster/metrics Content-Length: 84 Server: Jetty(6.1.26) This is standby RM. The redirect url ishttp://localhost:23188/ws/v1/cluster/metrics {code} If i do {code} $ curl -i -L http://127.0.0.1:33188/ws/v1/cluster/metrics {code} , it will redirect to the active rm and get the metrics. [RM HA] Rest api endpoints doing redirect incorrectly - Key: YARN-2605 URL: https://issues.apache.org/jira/browse/YARN-2605 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: bc Wong Assignee: Anubhav Dhoot Labels: newbie Attachments: YARN-2605.1.patch The standby RM's webui tries to do a redirect via meta-refresh. That is fine for pages designed to be viewed by web browsers. But the API endpoints shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd suggest HTTP 303, or return a well-defined error message (json or xml) stating that the standby status and a link to the active RM. The standby RM is returning this today: {noformat} $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics HTTP/1.1 200 OK Cache-Control: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics Content-Length: 117 Server: Jetty(6.1.26) This is standby RM. Redirecting to the current active RM: http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499204#comment-14499204 ] Craig Welch commented on YARN-3463: --- bq. how about change it to MapString, String to explicitly pass option_key=value pairs to configure OrderingPolicy signature changed, will add configuration to pass in sizeBasedWeight as part of the FairOrderingPolicy patch, as that's where it would belong... bq. you can suppress them to avoid javac warning suppressed Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, YARN-3463.67.patch, YARN-3463.68.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3499) Optimize ResourceManager Web loading speed
Peter Shi created YARN-3499: --- Summary: Optimize ResourceManager Web loading speed Key: YARN-3499 URL: https://issues.apache.org/jira/browse/YARN-3499 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Peter Shi Priority: Minor after running 10k jobs, resoucemanager webui load speed become slow. As server side send 10k jobs information in one response, parsing and rendering page will cost a long time. Current paging logic is done in browser side. This issue makes server side to do the paging logic, so that the loading will be fast. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3501) problem in running yarn scheduler load simulator
[ https://issues.apache.org/jira/browse/YARN-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3501: --- Fix Version/s: (was: 2.3.0) problem in running yarn scheduler load simulator Key: YARN-3501 URL: https://issues.apache.org/jira/browse/YARN-3501 Project: Hadoop YARN Issue Type: Test Components: scheduler-load-simulator Affects Versions: 2.6.0 Environment: ubuntu Reporter: Awadhesh kumar shukla -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3501) problem in running yarn scheduler load simulator
[ https://issues.apache.org/jira/browse/YARN-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3501: --- Target Version/s: (was: 2.6.0) problem in running yarn scheduler load simulator Key: YARN-3501 URL: https://issues.apache.org/jira/browse/YARN-3501 Project: Hadoop YARN Issue Type: Test Components: scheduler-load-simulator Affects Versions: 2.6.0 Environment: ubuntu Reporter: Awadhesh kumar shukla -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499247#comment-14499247 ] Zhijie Shen commented on YARN-3437: --- Per my comment on [YARN-3390 | https://issues.apache.org/jira/browse/YARN-3390?focusedCommentId=14499245page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14499245]. Please feel free to move forward. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3501) problem in running yarn scheduler load simulator
[ https://issues.apache.org/jira/browse/YARN-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499375#comment-14499375 ] Naganarasimha G R commented on YARN-3501: - Can some more information be given on this jira ! problem in running yarn scheduler load simulator Key: YARN-3501 URL: https://issues.apache.org/jira/browse/YARN-3501 Project: Hadoop YARN Issue Type: Test Components: scheduler-load-simulator Affects Versions: 2.6.0 Environment: ubuntu Reporter: Awadhesh kumar shukla -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499285#comment-14499285 ] Hadoop QA commented on YARN-3463: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726059/YARN-3463.68.patch against trunk revision bb6dde6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7372//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7372//console This message is automatically generated. Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, YARN-3463.67.patch, YARN-3463.68.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499326#comment-14499326 ] Li Lu commented on YARN-3134: - Hi [~djp] and [~zjshen], thanks a lot for the review! I'll fix them pretty soon and upload a new patch. For now, I'm focusing on correctness, readability, and exception handling. Does that plan sound good to you? Thanks! [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3499) Optimize ResourceManager Web loading speed
[ https://issues.apache.org/jira/browse/YARN-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Shi resolved YARN-3499. - Resolution: Duplicate duplicated with YARN-3500 Optimize ResourceManager Web loading speed -- Key: YARN-3499 URL: https://issues.apache.org/jira/browse/YARN-3499 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Peter Shi Priority: Minor after running 10k jobs, resoucemanager webui load speed become slow. As server side send 10k jobs information in one response, parsing and rendering page will cost a long time. Current paging logic is done in browser side. This issue makes server side to do the paging logic, so that the loading will be fast. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499314#comment-14499314 ] Zhijie Shen commented on YARN-3431: --- Right, TimelineEntity is the generic Java form for us to compose a timeline entity in java code, while its corresponding JSON object is the payload during REST communication. Subclasses of TimelineEntity are defined to facilitate us/users to easily manipulate some predefined, specific attributes. bq. My main problem is with the prototype field of TimelineEntity. Maybe I should change prototype to real. After receiving the entity from the endpoint of the web server, not matter it was the generic TimelineEntity or the subclass object, it will be deserialized as TimelineEntity object. If it was the subclass object, the content is preserved, but the Java class hierarchy is lost after deserialization. However, we can use TimelineEntity and its type to construct the right subclass object in a *proxy* way. bq. For HierarchicalTimelineEntity, seems like we're not adding any special tags when we addIsRelatedToEntity() in setParent() Yeah, relates to/ is related to is used to construct a directed graph among entities. Parent-child relationship is a tree, which can be described by relates to/ is related. bq. Are we prohibiting the users from using isRelatedToEntities in HierarchicalTimelineEntity completely to avoid problems? Sounds good. I used to think about it, but not include it in this patch. bq. , I'm not sure if we really need the subclass information. I'm not pretty sure, but I guess we may probably not need the subclasses' Java APIs, and that's why I put a comment there. However, since it's not a big overhead given the way we construct the subclass object, I prefer to leave the code there, in case we want subclass APIs somewhere (e.g., aggregation). There're two additional bugs in this patch. I'll fix the outstanding issues and upload a new one later. Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error Failed to load/recover state when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499288#comment-14499288 ] Hadoop QA commented on YARN-3493: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726049/YARN-3493.3.patch against trunk revision bb6dde6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestWorkPreservingRMRestartForNodeLabel Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7373//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7373//console This message is automatically generated. RM fails to come up with error Failed to load/recover state when mem settings are changed Key: YARN-3493 URL: https://issues.apache.org/jira/browse/YARN-3493 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Reporter: Sumana Sathish Assignee: Jian He Priority: Critical Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, yarn-yarn-resourcemanager.log.zip RM fails to come up for the following case: 1. Change yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in background and wait for the job to reach running state 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 before the above job completes 4. Restart RM 5. RM fails to come up with the below error {code:title= RM error for Mem settings changed} - RM app submission failed in validating AM resource request for application application_1429094976272_0008 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=3072, maxMemory=2048 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) at
[jira] [Created] (YARN-3501) problem in running yarn scheduler load simulator
Awadhesh kumar shukla created YARN-3501: --- Summary: problem in running yarn scheduler load simulator Key: YARN-3501 URL: https://issues.apache.org/jira/browse/YARN-3501 Project: Hadoop YARN Issue Type: Test Components: scheduler-load-simulator Affects Versions: 2.6.0 Environment: ubuntu Reporter: Awadhesh kumar shukla Fix For: 2.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499393#comment-14499393 ] Sunil G commented on YARN-3487: --- Hi [~leftnoteasy] I am sorry for providing lesser content earlier. After seeing your comment again, i could see that my comment also was going on same line. Runtime updates can add or change some CLIs for a Queue. So if synchronized keyword s removed, checkAccess is open and some checks may pass/fail as per the partial information available for CLI of Queue. So we may run into partial errors which is a race case condition. CapacityScheduler scheduler lock obtained unnecessarily --- Key: YARN-3487 URL: https://issues.apache.org/jira/browse/YARN-3487 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-3487.001.patch, YARN-3487.002.patch Recently saw a significant slowdown of applications on a large cluster, and we noticed there were a large number of blocked threads on the RM. Most of the blocked threads were waiting for the CapacityScheduler lock while calling getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499264#comment-14499264 ] Zhijie Shen commented on YARN-3134: --- Some thoughts about backend POC, not just limited to Phoenix writer, but HBase writer too. 1. At the current stage, I suggest we focus on logic correctness and performance tuning. We may have multiple iterations between improving and doing benchmark. 2. At the beginning, we may not implement storing everything of timeline entity (such as relationship), but we should at lease make sure what Phoenix writer and HBase writer have implemented are identical in terms of the data to store. 3. It's good if we can have rich test suites like TimelineStoreTestUtils to ensure the robustness of the writer. Moreover, it's black box testing, and we can use them to check if Phoenix writer and HBase writer behave the same. /cc [~vrushalic] For Phoenix implementation only: I used Phoenix writer for a real deployment, and I could see the implementation is not thread safe. ConcurrentModificatioException will be thrown upon committing the statements. [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499399#comment-14499399 ] Sunil G commented on YARN-1963: --- Yes. We could try support both Integer and Label (with mappings). We may open independent Jiras to handle this case (have both patches, will sync up as one) , and which should achieve the same goal w/o complexity. And we will look for simpler version for now, not complex re-mappings etc. Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: 0001-YARN-1963-prototype.patch, YARN Application Priorities Design.pdf, YARN Application Priorities Design_01.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3463: -- Attachment: YARN-3463.68.patch Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, YARN-3463.67.patch, YARN-3463.68.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499205#comment-14499205 ] Craig Welch commented on YARN-3463: --- btw, the tests pass on my box with the change, failures not related to the patch Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, YARN-3463.67.patch, YARN-3463.68.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499245#comment-14499245 ] Zhijie Shen commented on YARN-3390: --- The only conflict part between YARN-3437 and this Jira is TimelineCollectorManager base class. And we happened to resort to the similar refactoring method. I'm okay to commit YARN-3437 first. However, the comments about the base TimelineCollectorManager also apply. At least, I think we should use ApplicationId instead of String. Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3390.1.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3500) Optimize ResourceManager Web loading speed
[ https://issues.apache.org/jira/browse/YARN-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499297#comment-14499297 ] Peter Shi commented on YARN-3500: - Yes, i have mark 3499 with duplicate Optimize ResourceManager Web loading speed -- Key: YARN-3500 URL: https://issues.apache.org/jira/browse/YARN-3500 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Peter Shi Priority: Minor after running 10k jobs, resoucemanager webui load speed become slow. As server side send 10k jobs information in one response, parsing and rendering page will cost a long time. Current paging logic is done in browser side. This issue makes server side to do the paging logic, so that the loading will be fast. Loading 10k jobs costs 55 sec. loading 2k costs 7 sec -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499398#comment-14499398 ] Li Lu commented on YARN-3134: - Hi [~zjshen] could you please provide some more information to reproduce the failures? Or, the exception stack would also be helpful. I'm trying to setup a deployment but would like to make sure we're seeing consistent problems. Thanks! [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499413#comment-14499413 ] Li Lu commented on YARN-3431: - bq. After receiving the entity from the endpoint of the web server, not matter it was the generic TimelineEntity or the subclass object, it will be deserialized as TimelineEntity object. If it was the subclass object, the content is preserved, but the Java class hierarchy is lost after deserialization. However, we can use TimelineEntity and its type to construct the right subclass object in a proxy way. OK, I agree the current design would save us one deep copy every time we receive a timeline entity. I'm still thinking about an appropriate name for the prototype field to better represent its nature... bq. For HierarchicalTimelineEntity, seems like we're not adding any special tags when we addIsRelatedToEntity() in setParent() bq. Yeah, relates to/ is related to is used to construct a directed graph among entities. Parent-child relationship is a tree, which can be described by relates to/ is related. bq. Are we prohibiting the users from using isRelatedToEntities in HierarchicalTimelineEntity completely to avoid problems? bq. Sounds good. I used to think about it, but not include it in this patch. That sounds good. It would be very helpful to explicitly prohibit direct usages of isRelatedToEntities and relatesToEntities IMHO. Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly
[ https://issues.apache.org/jira/browse/YARN-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499435#comment-14499435 ] Steve Loughran commented on YARN-2605: -- patch looks good in production; 307 is the error code rest apps need; these will ignore the text so that can stay human-readable. Why is a test now tagged as @ignore? [RM HA] Rest api endpoints doing redirect incorrectly - Key: YARN-2605 URL: https://issues.apache.org/jira/browse/YARN-2605 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: bc Wong Assignee: Xuan Gong Labels: newbie Attachments: YARN-2605.1.patch The standby RM's webui tries to do a redirect via meta-refresh. That is fine for pages designed to be viewed by web browsers. But the API endpoints shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd suggest HTTP 303, or return a well-defined error message (json or xml) stating that the standby status and a link to the active RM. The standby RM is returning this today: {noformat} $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics HTTP/1.1 200 OK Cache-Control: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics Content-Length: 117 Server: Jetty(6.1.26) This is standby RM. Redirecting to the current active RM: http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3503) Expose disk utilization percentage on NM via JMX
Varun Vasudev created YARN-3503: --- Summary: Expose disk utilization percentage on NM via JMX Key: YARN-3503 URL: https://issues.apache.org/jira/browse/YARN-3503 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3503) Expose disk utilization percentage on NM via JMX
[ https://issues.apache.org/jira/browse/YARN-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3503: Description: It would be useful to expose the disk utilization on the NMs via JMX so that alerts can be setup for nodes. Expose disk utilization percentage on NM via JMX Key: YARN-3503 URL: https://issues.apache.org/jira/browse/YARN-3503 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev It would be useful to expose the disk utilization on the NMs via JMX so that alerts can be setup for nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3496) Add a configuration to disable/enable storing localization state in NMLeveldbStateStore
[ https://issues.apache.org/jira/browse/YARN-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499537#comment-14499537 ] zhihai xu commented on YARN-3496: - Hi [~jlowe], You are right. Based on my profiling at YARN-3491, The levelDb's overhead is minor. thanks Add a configuration to disable/enable storing localization state in NMLeveldbStateStore --- Key: YARN-3496 URL: https://issues.apache.org/jira/browse/YARN-3496 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Add a configuration to disable/enable storing localization state in NMLeveldbStateStore. Store Localization state in the levelDB may have some overhead, which may affect NM performance. It would better to have a configuration to disable/enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Description: Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. It will cause public resource localization is serialized most of the time. was: Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. It will cause public resource localization is serialized most of the time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499401#comment-14499401 ] Sunil G commented on YARN-2003: --- HI [~leftnoteasy] bq. authenticateApplicationPriority, I'm wondering if we really need it. I understand your point. But we may fire a new APP_ADDED event from RmAppManager to respective scheduler with a unapproved priority, and then reject from there. If we can do much earlier with help of a single api, it may avoid some extra event handling in the case of wrong(invalid priority) app submission. Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side] -- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499534#comment-14499534 ] zhihai xu commented on YARN-3491: - Hi [~jlowe], You are right, I am really sorry all my previous guesses are wrong. I did the profiling and I find out the bottleneck is at the following code {code} getInitializedLocalDirs(); getInitializedLogDirs(); {code} More accurately the bottleneck is at checkLocalDir which call getFileStatus. I did two round profiling: 1.I measure the time in PublicLocalizer#addResource: the following code include levelDB operation take 1 ms. {code} Path publicRootPath = dirsHandler.getLocalPathForWrite(. + Path.SEPARATOR + ContainerLocalizer.FILECACHE, ContainerLocalizer.getEstimatedSize(resource), true); Path publicDirDestPath = publicRsrc.getPathForLocalization(key, publicRootPath); if (!publicDirDestPath.getParent().equals(publicRootPath)) { DiskChecker.checkDir(new File(publicDirDestPath.toUri().getPath())); } {code} getInitializedLocalDirs and getInitializedLogDirs take 12 ms together And the following queue.submit code take less than 1 ms. {code} synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } {code} 2. then I measure the time in getInitializedLocalDirs and getInitializedLogDirs. I find out checkLocalDir is really slow which is called by getInitializedLocalDirs. checkLocalDir takes 14 ms. There is only one local Dir in my test environment. {code} synchronized private ListString getInitializedLocalDirs() { ListString dirs = dirsHandler.getLocalDirs(); ListString checkFailedDirs = new ArrayListString(); for (String dir : dirs) { try { checkLocalDir(dir); } catch (YarnRuntimeException e) { checkFailedDirs.add(dir); } } {code} The log in my previous comment has more than 10 local Dirs, which will call checkLocalDir more than 10 times 10 * 14 is about 100+ms, So I find out where the 100+ms delay come from. I attached a patch YARN-3491.000.patch to fix the issue, The patch will call getInitializedLocalDirs only once for each container. The original code will call getInitializedLocalDirs for each public resource. Each container can have hundreds of public resource, which is the situation in my previous log. [~jlowe], Could you review it? thanks PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3261) rewrite resourcemanager restart doc to remove roadmap bits
[ https://issues.apache.org/jira/browse/YARN-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499517#comment-14499517 ] Gururaj Shetty commented on YARN-3261: -- Hi [~aw] Kindly review the update patch and let me know if I need to change anything. Thanks Regards, Gururaj rewrite resourcemanager restart doc to remove roadmap bits --- Key: YARN-3261 URL: https://issues.apache.org/jira/browse/YARN-3261 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Assignee: Gururaj Shetty Attachments: YARN-3261.01.patch Another mixture of roadmap and instruction manual that seems to be ever present in a lot of the recently written documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Description: Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. was: Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. It will cause public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Description: Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. It will cause public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. was: Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. It will cause public resource localization is serialized most of the time. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. It will cause public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3502) Expose number of unhealthy disks on NM via JMX
[ https://issues.apache.org/jira/browse/YARN-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3502: Description: It would be useful to expose the number of unhealthy disks on the NMs via JM so that alerts can be setup for the nodes. Expose number of unhealthy disks on NM via JMX -- Key: YARN-3502 URL: https://issues.apache.org/jira/browse/YARN-3502 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev It would be useful to expose the number of unhealthy disks on the NMs via JM so that alerts can be setup for the nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Attachment: YARN-3491.000.patch PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499402#comment-14499402 ] Sunil G commented on YARN-2003: --- Thank you [~leftnoteasy] for sharing comments. I will rebase patch and will address the comments mentioned in above comment Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side] -- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3502) Expose number of unhealthy disks on NM via JMX
Varun Vasudev created YARN-3502: --- Summary: Expose number of unhealthy disks on NM via JMX Key: YARN-3502 URL: https://issues.apache.org/jira/browse/YARN-3502 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3496) Add a configuration to disable/enable storing localization state in NMLeveldbStateStore
[ https://issues.apache.org/jira/browse/YARN-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3496. - Resolution: Not A Problem Add a configuration to disable/enable storing localization state in NMLeveldbStateStore --- Key: YARN-3496 URL: https://issues.apache.org/jira/browse/YARN-3496 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Add a configuration to disable/enable storing localization state in NMLeveldbStateStore. Store Localization state in the levelDB may have some overhead, which may affect NM performance. It would better to have a configuration to disable/enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499764#comment-14499764 ] Junping Du commented on YARN-3134: -- bq. 1. At the current stage, I suggest we focus on logic correctness and performance tuning. We may have multiple iterations between improving and doing benchmark +1. We should get some performance data which help us better understanding on the direction and priority. bq. For now, I'm focusing on correctness, readability, and exception handling. Does that plan sound good to you? Sounds like a good plan. Thanks [~gtCarrera9]. [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3381) A typographical error in InvalidStateTransitonException
[ https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3381: --- Attachment: YARN-3381-002.patch A typographical error in InvalidStateTransitonException - Key: YARN-3381 URL: https://issues.apache.org/jira/browse/YARN-3381 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Xiaoshuang LU Assignee: Brahma Reddy Battula Attachments: YARN-3381-002.patch, YARN-3381.patch Appears that InvalidStateTransitonException should be InvalidStateTransitionException. Transition was misspelled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499750#comment-14499750 ] Hadoop QA commented on YARN-2268: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726128/0001-YARN-2268.patch against trunk revision 76e7264. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/7375//artifact/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRM Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7375//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7375//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7375//console This message is automatically generated. Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith Attachments: 0001-YARN-2268.patch YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499609#comment-14499609 ] Hudson commented on YARN-3021: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #166 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/166/]) YARN-3021. YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev bb6dde68f19be1885a9e7f7949316a03825b6f3e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3181) FairScheduler: Fix up outdated findbugs issues
[ https://issues.apache.org/jira/browse/YARN-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499667#comment-14499667 ] Tsuyoshi Ozawa commented on YARN-3181: -- I agree with Karthik's comment - this is not urgent concern. In fact, findbugs-exclude.xml is not outdated since FairScheduler does some optimization which findbugs warns for getting better concurrency. On this JIRA, we should check the order of lock very carefully without degrading performance. As a result, it would be better we can remove IS2_INCONSISTENT_SYNC exclusion from findbugs-exclude file. FairScheduler: Fix up outdated findbugs issues -- Key: YARN-3181 URL: https://issues.apache.org/jira/browse/YARN-3181 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Brahma Reddy Battula Attachments: YARN-3181-002.patch, yarn-3181-1.patch In FairScheduler, we have excluded some findbugs-reported errors. Some of them aren't applicable anymore, and there are a few that can be easily fixed without needing an exclusion. It would be nice to fix them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3495) Confusing log generated by FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499702#comment-14499702 ] Tsuyoshi Ozawa commented on YARN-3495: -- Oops, I meant discussion on YARN-3197. Confusing log generated by FairScheduler Key: YARN-3495 URL: https://issues.apache.org/jira/browse/YARN-3495 Project: Hadoop YARN Issue Type: Bug Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3495.patch 2015-04-16 12:03:48,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499803#comment-14499803 ] Thomas Graves commented on YARN-3434: - Ok, I'll make the changes and post an updated patch Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3381) A typographical error in InvalidStateTransitonException
[ https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499836#comment-14499836 ] Hadoop QA commented on YARN-3381: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726152/YARN-3381-002.patch against trunk revision 76e7264. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7376//console This message is automatically generated. A typographical error in InvalidStateTransitonException - Key: YARN-3381 URL: https://issues.apache.org/jira/browse/YARN-3381 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Xiaoshuang LU Assignee: Brahma Reddy Battula Attachments: YARN-3381-002.patch, YARN-3381.patch Appears that InvalidStateTransitonException should be InvalidStateTransitionException. Transition was misspelled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3381) A typographical error in InvalidStateTransitonException
[ https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499797#comment-14499797 ] Brahma Reddy Battula commented on YARN-3381: Rebased the patch.. Kindly review.. A typographical error in InvalidStateTransitonException - Key: YARN-3381 URL: https://issues.apache.org/jira/browse/YARN-3381 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Xiaoshuang LU Assignee: Brahma Reddy Battula Attachments: YARN-3381-002.patch, YARN-3381.patch Appears that InvalidStateTransitonException should be InvalidStateTransitionException. Transition was misspelled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499580#comment-14499580 ] Hadoop QA commented on YARN-3491: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726118/YARN-3491.000.patch against trunk revision 76e7264. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7374//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7374//console This message is automatically generated. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499641#comment-14499641 ] Rohith commented on YARN-2268: -- I verified the patch deploying in HA cluster and Non-HA cluster. On any active RM is found in the cluster then exeption will be thrown back to console Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith Attachments: 0001-YARN-2268.patch YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499684#comment-14499684 ] Hudson commented on YARN-3021: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #157 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/157/]) YARN-3021. YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev bb6dde68f19be1885a9e7f7949316a03825b6f3e) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java * hadoop-yarn-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499621#comment-14499621 ] Hudson commented on YARN-3021: -- FAILURE: Integrated in Hadoop-Yarn-trunk #900 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/900/]) YARN-3021. YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev bb6dde68f19be1885a9e7f7949316a03825b6f3e) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java * hadoop-yarn-project/CHANGES.txt YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499675#comment-14499675 ] Hudson commented on YARN-3021: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2098 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2098/]) YARN-3021. YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev bb6dde68f19be1885a9e7f7949316a03825b6f3e) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3495) Confusing log generated by FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499701#comment-14499701 ] Tsuyoshi Ozawa commented on YARN-3495: -- +1. I checked the discussion YARN-3495, and the contents of this log looks good to me. I also checked that containerStatus cannot be null any case. I'll commit this 2 days after. Confusing log generated by FairScheduler Key: YARN-3495 URL: https://issues.apache.org/jira/browse/YARN-3495 Project: Hadoop YARN Issue Type: Bug Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Attachments: YARN-3495.patch 2015-04-16 12:03:48,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2268: - Issue Type: Improvement (was: Bug) Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2268: - Attachment: 0001-YARN-2268.patch Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith Attachments: 0001-YARN-2268.patch YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499638#comment-14499638 ] Rohith commented on YARN-2268: -- Attached the patch for disallowing format store using previous approach. Kindly review the patch Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith Attachments: 0001-YARN-2268.patch YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3503) Expose disk utilization percentage on NM via JMX
[ https://issues.apache.org/jira/browse/YARN-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499885#comment-14499885 ] Vinod Kumar Vavilapalli commented on YARN-3503: --- Given we are starting afresh on exposing resource-usage, how about we make this a REST API and merge it into YARN-3332? Expose disk utilization percentage on NM via JMX Key: YARN-3503 URL: https://issues.apache.org/jira/browse/YARN-3503 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev It would be useful to expose the disk utilization on the NMs via JMX so that alerts can be setup for nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499939#comment-14499939 ] Arun Suresh commented on YARN-2962: --- [~varun_saxena], wondering if you need any help with this. Would like to get this in soon. ZKRMStateStore: Limit the number of znodes under a znode Key: YARN-2962 URL: https://issues.apache.org/jira/browse/YARN-2962 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-2962.01.patch We ran into this issue where we were hitting the default ZK server message size configs, primarily because the message had too many znodes even though they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1442#comment-1442 ] Hudson commented on YARN-3021: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #167 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/167/]) YARN-3021. YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev bb6dde68f19be1885a9e7f7949316a03825b6f3e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1402) Related Web UI, CLI changes on exposing client API to check log aggregation status
[ https://issues.apache.org/jira/browse/YARN-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500019#comment-14500019 ] Junping Du commented on YARN-1402: -- Latest patch looks good to me. [~xgong], can you file a separated JIRA to track test failure in case we don't have one? Related Web UI, CLI changes on exposing client API to check log aggregation status -- Key: YARN-1402 URL: https://issues.apache.org/jira/browse/YARN-1402 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1402.1.patch, YARN-1402.2.patch, YARN-1402.3.1.patch, YARN-1402.3.2.patch, YARN-1402.3.patch, YARN-1402.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500287#comment-14500287 ] Inigo Goiri commented on YARN-3482: --- Yes, that one is good. My proposal for the third one was meaningless... I'll go code this. Report NM available resources in heartbeat -- Key: YARN-3482 URL: https://issues.apache.org/jira/browse/YARN-3482 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Original Estimate: 504h Remaining Estimate: 504h NMs are usually collocated with other processes like HDFS, Impala or HBase. To manage this scenario correctly, YARN should be aware of the actual available resources. The proposal is to have an interface to dynamically change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3410: - Attachment: 0004-YARN-3410.patch Updated the patch fixing usage format.. kindly review updated patch YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, 0004-YARN-3410.patch When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500257#comment-14500257 ] Hadoop QA commented on YARN-3136: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726183/00011-YARN-3136.patch against trunk revision 76e7264. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7377//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7377//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7377//console This message is automatically generated. getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 00011-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500277#comment-14500277 ] Karthik Kambatla commented on YARN-3482: For the third config, would something like yarn.nodemanager.dynamic-resource-availability=true/false be more descriptive? Admin interface (with a special command) sounds reasonable. Report NM available resources in heartbeat -- Key: YARN-3482 URL: https://issues.apache.org/jira/browse/YARN-3482 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Original Estimate: 504h Remaining Estimate: 504h NMs are usually collocated with other processes like HDFS, Impala or HBase. To manage this scenario correctly, YARN should be aware of the actual available resources. The proposal is to have an interface to dynamically change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3136: -- Attachment: 00012-YARN-3136.patch Rebased against trunk. Also changed the findbugs suppression for getTransferredContainers method. getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 00011-YARN-3136.patch, 00012-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500308#comment-14500308 ] Hadoop QA commented on YARN-3410: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726185/0003-YARN-3410.patch against trunk revision 76e7264. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7378//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7378//console This message is automatically generated. YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500398#comment-14500398 ] Sangjin Lee commented on YARN-3431: --- I know [~zjshen]'s updating the patch, but I'll provide some feedback based on the current patch and the discussion here. Generally I agree with the approach of using fields in TimelineEntity to store/retrieve specialized information. That would definitely help with the JSON's (lack of) support for polymorphism. With regards to parent-child relationship and the relationship in general, this might be some change, but would it be better to have some kind of a key or a label for a relationship? It would help locate the particular relationship (e.g. parent) quickly, and help other use cases in identifying exactly the relationship it needs to retrieve. Thoughts? On a related note, I have problems with prohibiting hierarchical timeline entities from having any other relationships than parent-child. For example, frameworks (e.g. mapreduce) may use hierarchical timeline entities to describe their hierarchy (job = task = task attempts), and these entities would have dotted lines to YARN system entities (app, containers, etc.) and vice versa. It would be a pretty severe restriction to prohibit them. If we adopt the above approach, we should be able to allow both, right? (FlowEntity.java) - l. 58: do we want to set the id once we calculate it from scratch? (TimelineEntity.java) - l.88: Some javadoc would be helpful in explaining this constructor. It doesn't come through as very obvious. Sub resources of timeline entity needs to be passed to a separate endpoint. --- Key: YARN-3431 URL: https://issues.apache.org/jira/browse/YARN-3431 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch We have TimelineEntity and some other entities as subclass that inherit from it. However, we only have a single endpoint, which consume TimelineEntity rather than sub-classes and this endpoint will check the incoming request body contains exactly TimelineEntity object. However, the json data which is serialized from sub-class object seems not to be treated as an TimelineEntity object, and won't be deserialized into the corresponding sub-class object which cause deserialization failure as some discussions in YARN-3334 : https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500395#comment-14500395 ] Sunil G commented on YARN-3482: --- Hi [~elgoiri] bq. better to report the resources utilized by the machine. Do you mean Total CPU, and Total Memory etc. Could you please elaborate how this can help in doing a better resource allotment. As I see, if affinity is not set in CPU, distribution will be more generic and it may not be so easy to derive from that. Report NM available resources in heartbeat -- Key: YARN-3482 URL: https://issues.apache.org/jira/browse/YARN-3482 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Original Estimate: 504h Remaining Estimate: 504h NMs are usually collocated with other processes like HDFS, Impala or HBase. To manage this scenario correctly, YARN should be aware of the actual available resources. The proposal is to have an interface to dynamically change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500405#comment-14500405 ] zhihai xu commented on YARN-3491: - I uploaded a new patch YARN-3491.001.patch for review I think a little bit deeper, The old patch may have a big delay if multiple containers are submitted at the same time. For example the following log shows 4 containers submitted at very close time: {code} 2015-04-07 21:42:22,071 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110648_01_078264 transitioned from NEW to LOCALIZING 2015-04-07 21:42:22,074 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110652_01_093777 transitioned from NEW to LOCALIZING 2015-04-07 21:42:22,076 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110668_01_049049 transitioned from NEW to LOCALIZING 2015-04-07 21:42:22,078 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110668_01_085183 transitioned from NEW to LOCALIZING {code} The new patch can overlap the delay with public localization from previous container, which will be a little bit better and more consistent with the behavior in the old code. Also It will be better for the container which only has private resource and no public resource. For this case, no delay will be added to Dispatcher thread. Finally the change in new patch is a little bit smaller than the first patch. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500417#comment-14500417 ] Hadoop QA commented on YARN-2003: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726222/0007-YARN-2003.patch against trunk revision c6b5203. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7382//console This message is automatically generated. Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side] -- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500338#comment-14500338 ] Wangda Tan commented on YARN-3410: -- bq. Yes, in the same user two RM can not be started. It check for PID and fail it. YARN-2268 disallows the formatting state store while RM is running. The same verification can be made for this also in that JIRA Yes we should, it's the same problem. The latest patch LGTM, +1. YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, 0004-YARN-3410.patch When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500342#comment-14500342 ] Wangda Tan commented on YARN-3487: -- Thanks for feedback from [~sunilg], [~jlowe]. Make this as a sub JIRA of YARN-3091, and w/r lock for CS is tracked by YARN-3139. The latest patch LGTM, will commit when Jenkins get back. CapacityScheduler scheduler lock obtained unnecessarily --- Key: YARN-3487 URL: https://issues.apache.org/jira/browse/YARN-3487 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-3487.001.patch, YARN-3487.002.patch, YARN-3487.003.patch Recently saw a significant slowdown of applications on a large cluster, and we noticed there were a large number of blocked threads on the RM. Most of the blocked threads were waiting for the CapacityScheduler lock while calling getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500348#comment-14500348 ] Inigo Goiri commented on YARN-3482: --- To make it match yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb, I'm calling it yarn.nodemanager.resource.dynamic-availability. Report NM available resources in heartbeat -- Key: YARN-3482 URL: https://issues.apache.org/jira/browse/YARN-3482 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Original Estimate: 504h Remaining Estimate: 504h NMs are usually collocated with other processes like HDFS, Impala or HBase. To manage this scenario correctly, YARN should be aware of the actual available resources. The proposal is to have an interface to dynamically change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500223#comment-14500223 ] Inigo Goiri commented on YARN-3482: --- Makes sense. I think we should implement both and give the option to use one or the other. Proposal for the names of the variables? yarn.nodemanager.track-utilization.node=true/false yarn.nodemanager.track-utilization.containers=true/false yarn.nodemanager.resource=true/false (The second one would be for YARN-3481.) For the interface, the simplest thing is to edit yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb in yarn-site.xml. However, this implies modifying the XML periodically which is kind of dirty for this purpose. I guess the cleanest is using the admin interface, preferences? Report NM available resources in heartbeat -- Key: YARN-3482 URL: https://issues.apache.org/jira/browse/YARN-3482 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Original Estimate: 504h Remaining Estimate: 504h NMs are usually collocated with other processes like HDFS, Impala or HBase. To manage this scenario correctly, YARN should be aware of the actual available resources. The proposal is to have an interface to dynamically change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500298#comment-14500298 ] Rohith commented on YARN-3410: -- bq. I think RM will check pid while start to avoid this case, correct? Yes, in the same user two RM can not be started. It check for PID and fail it. YARN-2268 disallows the formatting state store while RM is running. The same verification can be made for this also in that JIRA YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3487: - Issue Type: Sub-task (was: Bug) Parent: YARN-3091 CapacityScheduler scheduler lock obtained unnecessarily --- Key: YARN-3487 URL: https://issues.apache.org/jira/browse/YARN-3487 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-3487.001.patch, YARN-3487.002.patch, YARN-3487.003.patch Recently saw a significant slowdown of applications on a large cluster, and we noticed there were a large number of blocked threads on the RM. Most of the blocked threads were waiting for the CapacityScheduler lock while calling getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Attachment: YARN-3491.001.patch PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch, YARN-3491.001.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated YARN-3021: Release Note: ResourceManager renews delegation tokens for applications. This behavior has been changed to renew tokens only if the token's renewer is a non-empty string. MapReduce jobs can instruct ResourceManager to skip renewal of tokens obtained from certain hosts by specifying the hosts with configuration mapreduce.job.hdfs-servers.token-renewal.exclude=host1,host2,..,hostN. YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3046) [Event producers] Implement MapReduce AM writing some MR metrics to ATS
[ https://issues.apache.org/jira/browse/YARN-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500037#comment-14500037 ] Junping Du commented on YARN-3046: -- Thanks [~rkanter] and [~zjshen] for review and comments! bq. One minor thing: Is there a JIRA for this TODO? Yes. YARN-3367. Will add JIRA number in TODO comment here. bq. Task entity Id should be be the job Id, but the task Id. PS: there's a typo here Nice catch! This is definitely a bug. Fix it (and typo) in v3 patch. [Event producers] Implement MapReduce AM writing some MR metrics to ATS --- Key: YARN-3046 URL: https://issues.apache.org/jira/browse/YARN-3046 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Junping Du Attachments: YARN-3046-no-test-v2.patch, YARN-3046-no-test.patch, YARN-3046-v1-rebase.patch, YARN-3046-v1.patch, YARN-3046-v2.patch Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes written) and have the MR AM write the framework-specific metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500062#comment-14500062 ] Sangjin Lee commented on YARN-3390: --- Thanks Zhijie. I'll move forward with the existing patch for YARN-3437. You can still make the change of String = ApplicationId as part of this JIRA (as it involves more refactoring). How's that sound? Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3390.1.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500268#comment-14500268 ] Wangda Tan commented on YARN-3410: -- One question: What will happen if a running app is removed from state store while RM is running, will it cause state corrupted? I think RM will check pid while start to avoid this case, correct? And tried to deploy a local cluster to try this, everything works fine, one minor comment about usage: {code} Usage: java ResourceManager [-format-state-store] | [-remove-application-from-state-store ApplicationId] {code} Better to format it to? {code} Usage: yarn resourcemanager [-format-state-store] [-remove..] appId {code} YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly
[ https://issues.apache.org/jira/browse/YARN-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500307#comment-14500307 ] Xuan Gong commented on YARN-2605: - Thanks for the review. [~ste...@apache.org] bq. Why is a test now tagged as @ignore? The testcase does not work at all if we made the changes. It gives me too many redirect loops exception. [RM HA] Rest api endpoints doing redirect incorrectly - Key: YARN-2605 URL: https://issues.apache.org/jira/browse/YARN-2605 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: bc Wong Assignee: Xuan Gong Labels: newbie Attachments: YARN-2605.1.patch The standby RM's webui tries to do a redirect via meta-refresh. That is fine for pages designed to be viewed by web browsers. But the API endpoints shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd suggest HTTP 303, or return a well-defined error message (json or xml) stating that the standby status and a link to the active RM. The standby RM is returning this today: {noformat} $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics HTTP/1.1 200 OK Cache-Control: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics Content-Length: 117 Server: Jetty(6.1.26) This is standby RM. Redirecting to the current active RM: http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-3487: - Attachment: YARN-3487.003.patch Thanks for the feedback, Wangda and Sunil. In the interest of keeping this JIRA simple to expedite the getQueueInfo and getQueue fix this version of the patch restores the lock on checkAccess. IIRC there's already another JIRA proposing to add read/write locks to the CapacityScheduler to handle rare events like queue config refresh. CapacityScheduler scheduler lock obtained unnecessarily --- Key: YARN-3487 URL: https://issues.apache.org/jira/browse/YARN-3487 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-3487.001.patch, YARN-3487.002.patch, YARN-3487.003.patch Recently saw a significant slowdown of applications on a large cluster, and we noticed there were a large number of blocked threads on the RM. Most of the blocked threads were waiting for the CapacityScheduler lock while calling getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2003: -- Attachment: 0007-YARN-2003.patch Rebased the patch and addressed the comments. Thank you [~leftnoteasy] Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side] -- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500368#comment-14500368 ] Yongjun Zhang commented on YARN-3021: - Thanks also to [~ka...@cloudera.com] for the earlier discussions, and we worked out a release notes which I just updated. YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500029#comment-14500029 ] Yongjun Zhang commented on YARN-3021: - Thanks again [~jianhe] for the reviews/suggestions and committing! Thanks [~qwertymaniac] for diagnosing and reporting the issue, Harsh, [~vinodkv], [~adhoot] for the reviews and discussions! YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3390) Reuse TimelineCollectorManager for RM
[ https://issues.apache.org/jira/browse/YARN-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500152#comment-14500152 ] Zhijie Shen commented on YARN-3390: --- bq. How's that sound? Sure, I'll take care of it. Reuse TimelineCollectorManager for RM - Key: YARN-3390 URL: https://issues.apache.org/jira/browse/YARN-3390 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3390.1.patch RMTimelineCollector should have the context info of each app whose entity has been put -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500184#comment-14500184 ] Wangda Tan commented on YARN-3463: -- [~cwelch], Thanks for update, patch LGTM, +1. Integrate OrderingPolicy Framework with CapacityScheduler - Key: YARN-3463 URL: https://issues.apache.org/jira/browse/YARN-3463 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3463.50.patch, YARN-3463.61.patch, YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, YARN-3463.67.patch, YARN-3463.68.patch Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1402) Related Web UI, CLI changes on exposing client API to check log aggregation status
[ https://issues.apache.org/jira/browse/YARN-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500193#comment-14500193 ] Junping Du commented on YARN-1402: -- bq. This is a good point. The reports also need to be used in generating the RMAppLogAggregationWebUI. So, we can not simply delete them. Agree that we cannot simply delete them. Will file one to start discussion on solutions. +1. Will commit the latest patch shortly. Related Web UI, CLI changes on exposing client API to check log aggregation status -- Key: YARN-1402 URL: https://issues.apache.org/jira/browse/YARN-1402 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1402.1.patch, YARN-1402.2.patch, YARN-1402.3.1.patch, YARN-1402.3.2.patch, YARN-1402.3.patch, YARN-1402.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3046) [Event producers] Implement MapReduce AM writing some MR metrics to ATS
[ https://issues.apache.org/jira/browse/YARN-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3046: - Attachment: YARN-3046-v3.patch Incorporate [~zjshen] and [~rkanter]'s comments in v3 patch! Also, identify a NPE issue in previous patch for MiniMRYarnCluster (if not setting auxiliary service explicitly). Verify related tests can pass. [Event producers] Implement MapReduce AM writing some MR metrics to ATS --- Key: YARN-3046 URL: https://issues.apache.org/jira/browse/YARN-3046 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Junping Du Attachments: YARN-3046-no-test-v2.patch, YARN-3046-no-test.patch, YARN-3046-v1-rebase.patch, YARN-3046-v1.patch, YARN-3046-v2.patch, YARN-3046-v3.patch Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes written) and have the MR AM write the framework-specific metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3136: -- Attachment: (was: 00011-YARN-3136.patch) getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 00011-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500097#comment-14500097 ] Karthik Kambatla commented on YARN-3482: bq. With this and the containers utilization, we can estimate the utilization of external processes. True, but I fear that will be too conservative. If we go that route, HBase RegionServers could grow aggressively and adversely affect resources under Yarn. By having an interface for available resources, we ensure Yarn aggressively schedules work to claim all available resources. Changing these available resources could be through a secure interface admins or a white-list of processes can access. Report NM available resources in heartbeat -- Key: YARN-3482 URL: https://issues.apache.org/jira/browse/YARN-3482 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.7.0 Reporter: Inigo Goiri Original Estimate: 504h Remaining Estimate: 504h NMs are usually collocated with other processes like HDFS, Impala or HBase. To manage this scenario correctly, YARN should be aware of the actual available resources. The proposal is to have an interface to dynamically change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3136: -- Attachment: 00011-YARN-3136.patch Checking jenkins again getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 00011-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500187#comment-14500187 ] Jonathan Eagles commented on YARN-3437: --- Now that I have dug into timeline server performance (YARN-3448). I have a better understanding of what type of writes are costly. For example, a single entity will generate dozens or writes to the database. The number of primary keys, the number of related entities, and the write batch size (entities per put) greatly affect the time an entity put takes. While this is a good start, I think there should at least be a follow up that addresses these issues to better measure the write performance. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3437) convert load test driver to timeline service v.2
[ https://issues.apache.org/jira/browse/YARN-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500112#comment-14500112 ] Zhijie Shen commented on YARN-3437: --- Will take a look today. convert load test driver to timeline service v.2 Key: YARN-3437 URL: https://issues.apache.org/jira/browse/YARN-3437 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3437.001.patch, YARN-3437.002.patch This subtask covers the work for converting the proposed patch for the load test driver (YARN-2556) to work with the timeline service v.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500044#comment-14500044 ] Hudson commented on YARN-3021: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2116 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2116/]) YARN-3021. YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp. Contributed by Yongjun Zhang (jianhe: rev bb6dde68f19be1885a9e7f7949316a03825b6f3e) * hadoop-yarn-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp --- Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Assignee: Yongjun Zhang Fix For: 2.8.0 Attachments: YARN-3021.001.patch, YARN-3021.002.patch, YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.007.patch, YARN-3021.patch Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3491: Description: Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. was: Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which is about 10 ms. The total delay will be approximately number of local dirs * 10 ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. PublicLocalizer#addResource is too slow. Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3491.000.patch Based on the profiling, The bottleneck in PublicLocalizer#addResource is getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. checkLocalDir is very slow which takes about 10+ ms. The total delay will be approximately number of local dirs * 10+ ms. This delay will be added for each public resource localization. Because PublicLocalizer#addResource is slow, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. And also PublicLocalizer#addResource is running in Dispatcher thread, So the Dispatcher thread will be blocked by PublicLocalizer#addResource for long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2696) Queue sorting in CapacityScheduler should consider node label
[ https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500200#comment-14500200 ] Wangda Tan commented on YARN-2696: -- Failed test is tracked by YARN-2483 Queue sorting in CapacityScheduler should consider node label - Key: YARN-2696 URL: https://issues.apache.org/jira/browse/YARN-2696 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2696.1.patch, YARN-2696.2.patch, YARN-2696.3.patch, YARN-2696.4.patch In the past, when trying to allocate containers under a parent queue in CapacityScheduler. The parent queue will choose child queues by the used resource from smallest to largest. Now we support node label in CapacityScheduler, we should also consider used resource in child queues by node labels when allocating resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3410: - Attachment: 0003-YARN-3410.patch YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3500) Optimize ResourceManager Web loading speed
[ https://issues.apache.org/jira/browse/YARN-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3500: --- Priority: Major (was: Minor) Optimize ResourceManager Web loading speed -- Key: YARN-3500 URL: https://issues.apache.org/jira/browse/YARN-3500 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Peter Shi after running 10k jobs, resoucemanager webui load speed become slow. As server side send 10k jobs information in one response, parsing and rendering page will cost a long time. Current paging logic is done in browser side. This issue makes server side to do the paging logic, so that the loading will be fast. Loading 10k jobs costs 55 sec. loading 2k costs 7 sec -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500481#comment-14500481 ] Hadoop QA commented on YARN-3136: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726204/00012-YARN-3136.patch against trunk revision c6b5203. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7379//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7379//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7379//console This message is automatically generated. getTransferredContainers can be a bottleneck during AM registration --- Key: YARN-3136 URL: https://issues.apache.org/jira/browse/YARN-3136 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Sunil G Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 00011-YARN-3136.patch, 00012-YARN-3136.patch, 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 0009-YARN-3136.patch While examining RM stack traces on a busy cluster I noticed a pattern of AMs stuck waiting for the scheduler lock trying to call getTransferredContainers. The scheduler lock is highly contended, especially on a large cluster with many nodes heartbeating, and it would be nice if we could find a way to eliminate the need to grab this lock during this call. We've already done similar work during AM allocate calls to make sure they don't needlessly grab the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)