[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142931#comment-14142931 ] Karthik Kambatla commented on YARN-2453: The latest patch looks good. +1. Checking this in. TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch, YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2453) TestProportionalCapacityPreemptionPolicy fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2453: --- Summary: TestProportionalCapacityPreemptionPolicy fails with FairScheduler (was: TestProportionalCapacityPreemptionPolicy is failed for FairScheduler) TestProportionalCapacityPreemptionPolicy fails with FairScheduler - Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch, YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143017#comment-14143017 ] Hadoop QA commented on YARN-2198: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670374/YARN-2198.trunk.9.patch against trunk revision 9721e2c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5070//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5070//console This message is automatically generated. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143019#comment-14143019 ] Wangda Tan commented on YARN-2498: -- Hi [~sunilg], Many thanks for reviewing this patch, feedbacks: 1) bq. A scenario where node1 has more than 50% (say 60) of cluster resources, and queue A is given 50% in CS. IN that case, is there any chance of under utilization? Yes, queue-A can be under utilization. By design of YARN-796, this is acceptable. Now we will calculate realtime maximum resource can be accessed by each queue, and user/admin can get warning of queue under utilization from web UI - scheduler page. 2) bq. Here I feel, we may need to split up the resource of label in each node level. It's a very good question, I just thought this for a while again. I found a negtive example shows you're right: {code} node1: x,y node2: x,y node3: z each node has resource 10, resource tree: total = 30 /|\ 20x 20y 10z First request 20 resource with label = x resource tree: total = 10 /|\ 0x 20y 10z The correct result should be, y = 0, we cannot request resource with label=y. {code} So it's best to split up the resource of label to node level, but the problem is, it will have much larger time complexity. For each assign operation, we need O(n=#unique-set-of-labels-on-node). It can be very large in a big cluster. And considering m=#iteration and p=#leaf-queue, we need O(n * m * p) to get the ideal_assigned of each queue. It may have better way to calculate ideal_assigned, I will think about this. For now, it can only get correct ideal_assigned when all node in the cluster has = 1 label. It's the hard-partition use-case (cluster is partitioned to several smaller clusters by label). 3) bq. For preemption, we just calculate to match the totalResourceToPreempt from the over utilized queues. But whether this container is from which node, and also under which label, and whether this label is coming under which queue. Do we need to do this check for each container? I think the answer is yes if we want: every container preempted can be accessed by at least one queue under-satisfied (has ideal_assigned current). Please let me know if you have more comments, Thanks, Wangda [YARN-796] Respect labels in preemption policy of capacity scheduler Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, yarn-2498-implementation-notes.pdf There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when there're some resource available in the cluster, we shouldn't assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot access such labels For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143034#comment-14143034 ] Zhijie Shen commented on YARN-1530: --- bq. Scenario 1. ATS service goes down bq. Scenario 2. ATS service partially down In general, I agree the concerns about the scenario when the timeline server is (partially) down makes sense. However, if we change the subject from ATS to HDFS/Kafka, I'm afraid we can get the similar conclusion. For example, HDFS is temporally not writable (We actually have observed this issue around YARN log aggregation). I can see the judgement has a obvious implication that the timeline server can be down, but HDFS/Kafka will not. It's correct to some extent base on the current timeline server SLA. Therefore, is making the timeline server reliable (or always-up) the essential solution? If the timeline server is reliable, it's going to relax the requirement to persist entities in a third place (this is the basic benefit I can see with HDFS/Kafka channel). While it may take a while to make sure the timeline server be as reliable as HDFS/Kafka does, we can make progress step by step, for example, YARN-2520 should realistic to be achieved within a reasonable timeline. Of course, there may still be a reliability gap between ATS/HBase and HDFS/Kafka (Actually, I'm not experienced about the reliability about the latter components, please let me know the exact gap it will be). It could be arguable that we still need to persist the entities in HDFS/Kafka when ATS/HBase is not available but HDFS/Kafka is still available. However, if we anyway need to improve the timeline server reliability, perhaps we should think carefully of the cost performance of implementing and maintaing another writing channel to bridge the gap. bq. Scenario 3. ATS backing store fails In this scenario, the entities have already reached the timeline server, right? I'm considering it as the internal reliability problem of the timeline server. As I mentioned the previous threads, it's the requirement that if the entity has reached the timeline server: the timeline server should take the responsibility to prevent if from being lost. I think it's a good point that the date store is going to be in outage (as HDFS can be temporally not writable). Having local backup for those outstanding received requests should be an answer for this scenario. bq. However, with the HDFS channel, the ATS can essentially throttle the events Suppose you have a cluster pushing X events/second to the ATS. With the REST implementation, the ATS must try to handle X events every second; if it can’t keep up, or if it gets too many incoming connections, there’s not too much we can do here. This may not be the accurate judgement. I'm supposing you are comparing pushing each event in on request for REST API with writing a batch of X events into HDFS. REST API allows to to batch X events and send one request. Please refer to TimelineClient#putEntities for the details. bq. In making the write path pluggable, we’d have to have two pieces: one to do the writing from the TimelineClient and one to the receiving in the ATS. These would have to be in pairs. We’ve already discussed some different implementations for this: REST, Kafka, and HDFS. bq. The backing store is already pluggable. No problem, it's feasible to make the write path pluggable. However. though the store is pluggable, Leveldb an HBase is relatively similar to each compared HTTP REST vs HDFS/Kafka pair. The more important thing is that it's more difficult to manage different writing channels than to manage different stores, because one is client-side and the other is server-side. At server-side, the YARN cluster operator has the full control of the servers, and the limited hosts to deal with. At client-side, the YARN cluster operator may not have the access to it, and don't know how many clients and how many type of apps he/she needs to deal with. TimelineClient is a generic tool (not for a particular application such as Spark), such that it's good to make it lightweight and portable. Again, it's still a cost performance question. bq. Though as bc pointed out before, it’s fine for more experienced users to use HBase, but “regular” users should have a solution as well that is hopefully more scalable and reliable than LevelDB. Right, and this is also my concern about HDFS/Kafka channel, in particularly using it as a default. Regular users may not be experienced enough about HBase as well as HDFS/Kafka. It very much depends on the users and the use cases. [~bcwalrus] and [~rkanter], thanks for putting new idea into the timeline service. In general, the timeline service is still a young project. We have different problems to solve and multiple ways to them. Additional writing channel is interesting, while given the whole roadmap
[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143039#comment-14143039 ] Zhijie Shen commented on YARN-2556: --- [~jeagles], it sounds an interesting work. Is it possible to see the throughput difference between TimelineDataManager and the web front interface? I suspect the web front interface is going to be bottleneck to throttle the end-to-end performance. With this analysis, we can have clearer picture about the reasonable to timeline server instances required to get rid of web font interface bottleneck (YARN-2520). Tool to measure the performance of the timeline server -- Key: YARN-2556 URL: https://issues.apache.org/jira/browse/YARN-2556 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: chang li We need to be able to understand the capacity model for the timeline server to give users the tools they need to deploy a timeline server with the correct capacity. I propose we create a mapreduce job that can measure timeline server write and read performance. Transactions per second, I/O for both read and write would be a good start. This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143100#comment-14143100 ] Hudson commented on YARN-2452: -- FAILURE: Integrated in Hadoop-Yarn-trunk #688 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/688/]) YARN-2452. TestRMApplicationHistoryWriter fails with FairScheduler. (Zhihai Xu via kasha) (kasha: rev c50fc92502934aa2a8f84ea2466d4da1e3eace9d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/TestRMApplicationHistoryWriter.java TestRMApplicationHistoryWriter fails with FairScheduler --- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.6.0 Attachments: YARN-2452.000.patch, YARN-2452.001.patch, YARN-2452.002.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143099#comment-14143099 ] Hudson commented on YARN-2453: -- FAILURE: Integrated in Hadoop-Yarn-trunk #688 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/688/]) YARN-2453. TestProportionalCapacityPreemptionPolicy fails with FairScheduler. (Zhihai Xu via kasha) (kasha: rev 9721e2c1feb5aecea3a6dab5bda96af1cd0f8de3) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java TestProportionalCapacityPreemptionPolicy fails with FairScheduler - Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.6.0 Attachments: YARN-2453.000.patch, YARN-2453.001.patch, YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2551) Windows Secure Cotnainer Executor: Add checks to validate that the wsce-site.xml is write restricted to Administrators only
[ https://issues.apache.org/jira/browse/YARN-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2551: --- Attachment: YARN-2551.1.patch Windows Secure Cotnainer Executor: Add checks to validate that the wsce-site.xml is write restricted to Administrators only --- Key: YARN-2551 URL: https://issues.apache.org/jira/browse/YARN-2551 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows, wsce Attachments: YARN-2551.1.patch The wsce-site.xml containes the impersonate.allowed and impersonate.denied keys that restrict/control the users that can be impersonated by the WSCE containers. The impersonation frameworks in winutils should validate that only Administrators have write control on this file. This is similar to how LCE is validating that only root has write permissions on container-executor.cfg file on secure Linux clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2551) Windows Secure Cotnainer Executor: Add checks to validate that the wsce-site.xml is write restricted to Administrators only
[ https://issues.apache.org/jira/browse/YARN-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu resolved YARN-2551. Resolution: Implemented The patch will be contained in YARN-2198 patch 10 and forward Windows Secure Cotnainer Executor: Add checks to validate that the wsce-site.xml is write restricted to Administrators only --- Key: YARN-2551 URL: https://issues.apache.org/jira/browse/YARN-2551 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows, wsce Attachments: YARN-2551.1.patch The wsce-site.xml containes the impersonate.allowed and impersonate.denied keys that restrict/control the users that can be impersonated by the WSCE containers. The impersonation frameworks in winutils should validate that only Administrators have write control on this file. This is similar to how LCE is validating that only root has write permissions on container-executor.cfg file on secure Linux clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143158#comment-14143158 ] Tsuyoshi OZAWA commented on YARN-2312: -- [~vinodkv], [~jianhe] do you have any feedbacks? [~jlowe], I appreciate if you give us comments about WrappedJvmID. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
Rohith created YARN-2579: Summary: Both RM's state is Active , but 1 RM is not really active. Key: YARN-2579 URL: https://issues.apache.org/jira/browse/YARN-2579 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Rohith I encountered a situaltion where both RM's web page was able to access and its state displayed as Active. But One of the RM's ActiveServices were stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143200#comment-14143200 ] Rohith commented on YARN-2579: -- This scenario could ocure if 2 thread trying to access ResourceManager#transitionToStandby().One is from AdminService#trainsitiontostandby first and then RMFatalEventDispatcher#transitionToStandBy(). This I simulated using debug point. The main problem is in resetting dispatcher, stops the dispatcher. Suppose, if AdminService is stopping dispatcher but dispatcher thread is blocked for getting acquire lock on ResourceManager, then ResourceManager never get transitioned to StandBy. It wait infinitely. {code} AsyncDispatcher event handler daemon prio=10 tid=0x007ea000 nid=0x39d1 waiting for monitor entry [0x7fe0a77f6000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:976) - waiting to lock 0xc1f7d438 (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:701) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:678) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) IPC Server handler 0 on 45021 daemon prio=10 tid=0x7fe0a9026800 nid=0x30ab in Object.wait() [0x7fe0a7cfa000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xeb3310e8 (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1281) - locked 0xeb3310e8 (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked 0xeb32fef8 (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetDispatcher(ResourceManager.java:1166) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:987) - locked 0xc1f7d438 (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:308) - locked 0xc2038d10 (a org.apache.hadoop.yarn.server.resourcemanager.AdminService) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4462) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} Both RM's state is Active , but 1 RM is not really active. -- Key: YARN-2579 URL: https://issues.apache.org/jira/browse/YARN-2579 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Rohith I encountered a situaltion where both RM's web page was able to access and its state displayed as Active. But One of the RM's ActiveServices were stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143235#comment-14143235 ] Hudson commented on YARN-2453: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1879 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1879/]) YARN-2453. TestProportionalCapacityPreemptionPolicy fails with FairScheduler. (Zhihai Xu via kasha) (kasha: rev 9721e2c1feb5aecea3a6dab5bda96af1cd0f8de3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/CHANGES.txt TestProportionalCapacityPreemptionPolicy fails with FairScheduler - Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.6.0 Attachments: YARN-2453.000.patch, YARN-2453.001.patch, YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143236#comment-14143236 ] Hudson commented on YARN-2452: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1879 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1879/]) YARN-2452. TestRMApplicationHistoryWriter fails with FairScheduler. (Zhihai Xu via kasha) (kasha: rev c50fc92502934aa2a8f84ea2466d4da1e3eace9d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/TestRMApplicationHistoryWriter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java * hadoop-yarn-project/CHANGES.txt TestRMApplicationHistoryWriter fails with FairScheduler --- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.6.0 Attachments: YARN-2452.000.patch, YARN-2452.001.patch, YARN-2452.002.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2580) Windows Secure Container Executor: grant job query privileges to the container user
[ https://issues.apache.org/jira/browse/YARN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu reassigned YARN-2580: -- Assignee: Remus Rusanu Windows Secure Container Executor: grant job query privileges to the container user --- Key: YARN-2580 URL: https://issues.apache.org/jira/browse/YARN-2580 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu mapred.MapTask.iniitalize uses WindowsBasedProcessTree which uses winutils to query the container NT JOB. This must eb granted query permission by the hadoopwinutilsvc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2580) Windows Secure Container Executor: grant job query privileges to the container user
Remus Rusanu created YARN-2580: -- Summary: Windows Secure Container Executor: grant job query privileges to the container user Key: YARN-2580 URL: https://issues.apache.org/jira/browse/YARN-2580 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu mapred.MapTask.iniitalize uses WindowsBasedProcessTree which uses winutils to query the container NT JOB. This must eb granted query permission by the hadoopwinutilsvc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143301#comment-14143301 ] Wangda Tan commented on YARN-1198: -- Hi [~cwelch], Sorry for this late response, I've just looked your ver.8 patch and comments, My reply, bq. -re we don't need write HeadroomProvider for each scheduler And bq. Provider vs Reference I agree with this, I think we need write different Headroom Provider and it's better to keep Provider since its more general. bq. -re As mentioned by Jason, currently ... Agree, this can be done in a separated JIRA bq. -re the cost of the calculation Agree, it's just a small computation effort. In the past, I suggest do as I mentioned https://issues.apache.org/jira/browse/YARN-1198?focusedCommentId=14108991page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14108991 because I think that will make code more clean. But according to your ver.8 patch, I realized that may not doable. In LeafQueue#computeUserLimit, it uses required to get user limit. In your patch, you save the lastRequired to user class. However, we need different required for different app under a same user. We can only do the calculate when app heartbeats (We can also loop and set all app's headroom, but that's a way we abandoned before). So basically, IMHO, I think your ver.7 is a more correct way to go. Which keeps complexity/efficiency balanced. Any thoughts? [~jianhe], [~cwelch]. Wangda Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143386#comment-14143386 ] Hudson commented on YARN-2453: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1904 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1904/]) YARN-2453. TestProportionalCapacityPreemptionPolicy fails with FairScheduler. (Zhihai Xu via kasha) (kasha: rev 9721e2c1feb5aecea3a6dab5bda96af1cd0f8de3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * hadoop-yarn-project/CHANGES.txt TestProportionalCapacityPreemptionPolicy fails with FairScheduler - Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.6.0 Attachments: YARN-2453.000.patch, YARN-2453.001.patch, YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143387#comment-14143387 ] Hudson commented on YARN-2452: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1904 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1904/]) YARN-2452. TestRMApplicationHistoryWriter fails with FairScheduler. (Zhihai Xu via kasha) (kasha: rev c50fc92502934aa2a8f84ea2466d4da1e3eace9d) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/TestRMApplicationHistoryWriter.java TestRMApplicationHistoryWriter fails with FairScheduler --- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.6.0 Attachments: YARN-2452.000.patch, YARN-2452.001.patch, YARN-2452.002.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143401#comment-14143401 ] bc Wong commented on YARN-1530: --- Hi [~zjshen]. First, glad to see that we're discussing approaches. You seem to agree with the premise that *ATS write path should not slow down apps*. bq. Therefore, is making the timeline server reliable (or always-up) the essential solution? If the timeline server is reliable, ... In theory, you can make the ATS *always-up*. In practice, we both know what real life distributed systems do. Always-up isn't the only thing. The write path needs to have good uptime and latency regardless of what's happening to the read path or the backing store. HDFS is a good default for the write channel because: * We don't have to design an ATS that is always-up. If you really want to, I'm sure you can eventually build something with good uptime. But it took other projects (HDFS, ZK) lots of hard work to get to that point. * If we reuse HDFS, cluster admins know how to operate HDFS and get good uptime from it. But it'll take training and hard-learned lessons for operators to figure out how to get good uptime from ATS, even after you build an always-up ATS. * All the popular YARN app frameworks (MR, Spark, etc.) already rely on HDFS by default. So do most of the 3rd party applications that I know of. Architecturally, it seems easier for admins to accept that ATS write path depends on HDFS for reliability, instead of a new component that (we claim) will be as reliable as HDFS/ZK. bq. given the whole roadmap of the timeline service, let's think critically of work that can improve the timeline service most significantly. Exactly. Strong +1. If we can drop the high uptime + low write latency requirement from the ATS service, we can avoid tons of effort. ATS doesn't need to be as reliable as HDFS. We don't need to worry about insulating the write path from the read path. We don't need to worry about occasional hiccups in HBase (or whatever the store is). And at the end of all this, everybody sleeps better because ATS service going down isn't a catastrophic failure. [Umbrella] Store, manage and serve per-framework application-timeline data -- Key: YARN-1530 URL: https://issues.apache.org/jira/browse/YARN-1530 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, application timeline design-20140116.pdf, application timeline design-20140130.pdf, application timeline design-20140210.pdf This is a sibling JIRA for YARN-321. Today, each application/framework has to do store, and serve per-framework data all by itself as YARN doesn't have a common solution. This JIRA attempts to solve the storage, management and serving of per-framework data from various applications, both running and finished. The aim is to change YARN to collect and store data in a generic manner with plugin points for frameworks to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2581) NMs need to find a way to get LogAggregationContext
Xuan Gong created YARN-2581: --- Summary: NMs need to find a way to get LogAggregationContext Key: YARN-2581 URL: https://issues.apache.org/jira/browse/YARN-2581 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong After YARN-2569, we have LogAggregationContext for application in ApplicationSubmissionContext. NMs need to find a way to get this information. We have this requirement: For all containers in the same application should honor the same LogAggregationContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143460#comment-14143460 ] Allen Wittenauer commented on YARN-913: --- bq. Summary: need to fix ZK client and then have curator configure it, so the rest of us don't have to care. This might be a blocker then. If a client needs to talk to more than one ZK, it sounds like they are basically screwed. bq. do you mean in the endpoint fields? It should ... let me clarify that in the example. I was mainly looking at the hostname pattern: {code} + String HOSTNAME_PATTERN = + ([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9]); {code} It doesn't appear to support periods/dots. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2582) Log related CLI and Web UI changes for LRS
Xuan Gong created YARN-2582: --- Summary: Log related CLI and Web UI changes for LRS Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143473#comment-14143473 ] Jian He commented on YARN-1372: --- +1 for the latest patch, committing Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch, YARN-1372.006.patch, YARN-1372.007.patch, YARN-1372.008.patch, YARN-1372.009.patch, YARN-1372.009.patch, YARN-1372.010.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
Xuan Gong created YARN-2583: --- Summary: Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will directly delete the app-log-dir from HDFS. This will not work for LRS. We expect a LRS application can keep running for a long time. Deleting the app-log-dir for the LRS applications is not a right way to handle it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-2583: --- Assignee: Xuan Gong Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will directly delete the app-log-dir from HDFS. This will not work for LRS. We expect a LRS application can keep running for a long time. Deleting the app-log-dir for the LRS applications is not a right way to handle it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2554) Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy
[ https://issues.apache.org/jira/browse/YARN-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143495#comment-14143495 ] Steve Loughran commented on YARN-2554: -- Vinod, this patch is independent of kerberos, secure AMs, etc. This patch so to an any AM to export an HTTPS URL; you can't do this on a secure or insecure cluster today. It doesn't mean that clients can trust something just because it is on HTTPS; that's an independent issue. Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy - Key: YARN-2554 URL: https://issues.apache.org/jira/browse/YARN-2554 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.6.0 Reporter: Jonathan Maron Attachments: YARN-2554.1.patch, YARN-2554.2.patch, YARN-2554.3.patch, YARN-2554.3.patch If the HTTP policy to enable HTTPS is specified, the RM and AM are initialized with SSL listeners. The RM has a web app proxy servlet that acts as a proxy for incoming AM requests. In order to forward the requests to the AM the proxy servlet makes use of HttpClient. However, the HttpClient utilized is not initialized correctly with the necessary certs to allow for successful one way SSL invocations to the other nodes in the cluster (it is not configured to access/load the client truststore specified in ssl-client.xml). I imagine SSLFactory.createSSLSocketFactory() could be utilized to create an instance that can be assigned to the HttpClient. The symptoms of this issue are: AM: Displays unknown_certificate exception RM: Displays an exception such as javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143499#comment-14143499 ] Xuan Gong commented on YARN-2468: - This is a very big patch and is hard for review. I would like to decouple the big patch to several smaller patches: 1) api changes will be tracked by YARN-2569 2) NMs need to find a way to get LogAggregationContext. This will be tracked by YARN-2581 3) Current ticket will be used to track changes for NM handling the logs for LRS which will include the log layout changes 4) Log Deletion Service changes will be tracked by YARN-2583 5) Related CL and web UI changes will be tracked by YARN-2582 Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143567#comment-14143567 ] Jian He commented on YARN-2562: --- patch looks good. thanks Tsuyoshi ! could you add brief comment in the toString method that the epoch will increase if RM restarts or fails over? ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143594#comment-14143594 ] Zhijie Shen commented on YARN-1530: --- Hi, [~bcwalrus]. Thanks for your further comments. bq. You seem to agree with the premise that ATS write path should not slow down apps. Definitely. The arguable point is that the current timeline client is going to slow down the app, given we have a scalable and reliable timeline server. bq. If we can drop the high uptime + low write latency requirement from the ATS service, we can avoid tons of effort. I'm not sure such fundamental requirements can be dropped from the timeline service. Projecting the future, scalable and high available timeline servers have multiple benefits and enable different use cases. For example, 1. We can use it to serve realtime or near realtime data, such that we can go the timeline server to see what happens to an application. It's in particularly useful for the long running services, which will never turn down. 2. We can build checkpoints on the timeline server for the app do to recovery once it crashes. It's pretty much like what we've done for MR jobs. I bundled scalable and reliable together because multiple-instance solution will improve the timeline server in both dimensions. Moreover, no matter how scalable and reliable the channel could be, we eventually want to get the timeline data accommodated into the timeline server, right? Otherwise, it is not going to be accessible by users (Of course, tricks can be played to fetch it directly from HDFS, but it's completely another story than the timeline server). If the apps are publishing 10GB data per hour, while the server can only process 1G per hour, the 9GB outstanding data per hour that resides in some temp location of HDFS is going to be useless writes. We have narrow down very much to discuss the reliability of the write path, but if we look into the big picture, *the timeline server is not just place to store data, but also serves it to users* (e.g., YARN-2513). In terms of use case, users may want to monitor completed apps as well as running apps and cluster. If the timeline server doesn't have capacity to serve the data for a particular use case, it's actually wasting the cost on aggregating it. IMHO, the scalable and the reliable timeline server is going to be *the eventual solution to satisfy multiple stakeholders*, regardless the use case is read intensive, write intensive or both intensive. That's why I think it could a high margin work to improve the timeline server. It's may be a hard work, but we should definitely pick it up. [Umbrella] Store, manage and serve per-framework application-timeline data -- Key: YARN-1530 URL: https://issues.apache.org/jira/browse/YARN-1530 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, application timeline design-20140116.pdf, application timeline design-20140130.pdf, application timeline design-20140210.pdf This is a sibling JIRA for YARN-321. Today, each application/framework has to do store, and serve per-framework data all by itself as YARN doesn't have a common solution. This JIRA attempts to solve the storage, management and serving of per-framework data from various applications, both running and finished. The aim is to change YARN to collect and store data in a generic manner with plugin points for frameworks to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2583: Description: Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. was:Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will directly delete the app-log-dir from HDFS. This will not work for LRS. We expect a LRS application can keep running for a long time. Deleting the app-log-dir for the LRS applications is not a right way to handle it. Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143665#comment-14143665 ] Craig Welch commented on YARN-2494: --- The other day [~vinodkv] suggested changing the addLabel removeLabel ... api's to addNodeLabel removeNodeLabel... to make it more clear (and presumably make adding other possible types of labels in the future more smooth). This would not effect the label apis, the node-to-label ones are ok already, I think. Thoughts? [YARN-796] Node label manager API and storage implementations - Key: YARN-2494 URL: https://issues.apache.org/jira/browse/YARN-2494 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch This JIRA includes APIs and storage implementations of node label manager, NodeLabelManager is an abstract class used to manage labels of nodes in the cluster, it has APIs to query/modify - Nodes according to given label - Labels according to given hostname - Add/remove labels - Set labels of nodes in the cluster - Persist/recover changes of labels/labels-on-nodes to/from storage And it has two implementations to store modifications - Memory based storage: It will not persist changes, so all labels will be lost when RM restart - FileSystem based storage: It will persist/recover to/from FileSystem (like HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143699#comment-14143699 ] Jason Lowe commented on YARN-2312: -- bq. One idea is to add id for upper 32 bits of container Id to ID class. The ID class is used by much more than just JvmID objects. I'm not a fan of making all IDs pay for this extra storage when we only need it for this one case. I'd rather store the extra bits in JvmID. Actually I don't think it's critical that JvmID derives from ID. We could have JvmID store the long itself rather than try to hack an extra 4-bytes onto ID and then need to explain why JvmID.getId doesn't do what one would expect. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2320: -- Attachment: YARN-2320.1.patch Removing old application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2320.1.patch After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143783#comment-14143783 ] Zhijie Shen commented on YARN-2320: --- Upload a huge patch, but it doesn't have complex logic, but removing the old application history store stack, including: 1. Null|Memory|FileSystemApplicationHistoryStore, the related protobuf classes, and ApplicationHistoryManagerImpl based on it. 2. RMApplicationHistoryWriter, the events used by it, and the invokes in the scope of RM. 3. Unnecessary configurations in YarnConfiguration. I addition, I've fixed the test cases based on ApplicationHistoryStore, and rename ApplicationHistoryManagerOnTimelineStore to ApplicationHistoryManagerImpl. Removing old application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2320.1.patch After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2540) Fair Scheduler : queue filters not working on scheduler page in RM UI
[ https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143793#comment-14143793 ] Ashwin Shankar commented on YARN-2540: -- Hi [~kasha], when you get a chance can you please review/commit the latest patch ? Fair Scheduler : queue filters not working on scheduler page in RM UI - Key: YARN-2540 URL: https://issues.apache.org/jira/browse/YARN-2540 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0, 2.5.1 Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: YARN-2540-v1.txt, YARN-2540-v2.txt, YARN-2540-v3.txt Steps to reproduce : 1. Run an app in default queue. 2. While the app is running, go to the scheduler page on RM UI. 3. You would see the app in the apptable at the bottom. 4. Now click on default queue to filter the apptable on root.default. 5. App disappears from apptable although it is running on default queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2320: -- Attachment: YARN-2320.2.patch Remove the unnecessary proto file as well Removing old application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2320.1.patch, YARN-2320.2.patch After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) [YARN-796] Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143855#comment-14143855 ] Craig Welch commented on YARN-2505: --- 1) -re rename all-nodes-to-lables to nodes-to-labels - done -re node-filter, I don't think that it makes sense to switch it. While code-wise I see where it is awkward to do a value filter, this follows the spec and it makes sense from a use case perspective - I expect that the desire is to find all of the nodes which have a particular label on them, that is the purpose of this filter and it makes sense to me that someone would want to do that and it seems to fit in with this api. I think there are easier ways to see what labels are on a node, adding it as a filter to this kind of an api call makes little sense to me anyway as it is more-or-less a direct property of a node - if it's missing I think it belongs elsewhere else anyway. Have shortened lines where found [YARN-796] Support get/add/remove/change labels in RM REST API -- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2505) [YARN-796] Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2505: -- Attachment: YARN-2505.1.patch [YARN-796] Support get/add/remove/change labels in RM REST API -- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143890#comment-14143890 ] Karthik Kambatla commented on YARN-2578: Would it be possible to add a test case for this? NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Attachments: YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2540) FairScheduler: Queue filters not working on scheduler page in RM UI
[ https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2540: --- Summary: FairScheduler: Queue filters not working on scheduler page in RM UI (was: Fair Scheduler : queue filters not working on scheduler page in RM UI) FairScheduler: Queue filters not working on scheduler page in RM UI --- Key: YARN-2540 URL: https://issues.apache.org/jira/browse/YARN-2540 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0, 2.5.1 Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: YARN-2540-v1.txt, YARN-2540-v2.txt, YARN-2540-v3.txt Steps to reproduce : 1. Run an app in default queue. 2. While the app is running, go to the scheduler page on RM UI. 3. You would see the app in the apptable at the bottom. 4. Now click on default queue to filter the apptable on root.default. 5. App disappears from apptable although it is running on default queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2569: Attachment: YARN-2569.4.patch fix all the latest comments Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, YARN-2569.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143951#comment-14143951 ] Vinod Kumar Vavilapalli commented on YARN-2578: --- bq. The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. I am surprised that RM itself fails over (in the context of firewall rule that drops traffic) - we never implemented health monitoring like in ZKFC with HDFS. It seems like if the rpc port gets blocked the RM will not failover as the embedded ZK continues to use the local loop-back and so doesn't detect the network failure. NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Attachments: YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1959) Fix headroom calculation in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1959: Attachment: YARN-1959.001.patch Addressed feedback Fix headroom calculation in Fair Scheduler -- Key: YARN-1959 URL: https://issues.apache.org/jira/browse/YARN-1959 Project: Hadoop YARN Issue Type: Bug Reporter: Sandy Ryza Assignee: Anubhav Dhoot Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch The Fair Scheduler currently always sets the headroom to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143964#comment-14143964 ] Vinod Kumar Vavilapalli commented on YARN-2562: --- How about {{container_e17_1410901177871_0001_01_05}}? A number at the end for me always pointed to the container-id. We also don't need to be verbose with epoch. And we can still parse it in a backwards compatible fashion. If nothing, my fourth preference is to have something like {{container_1410901177871_0001_01_05_e17}}, the first three preferences are what I proposed above :P ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2540) FairScheduler: Queue filters not working on scheduler page in RM UI
[ https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143967#comment-14143967 ] Karthik Kambatla commented on YARN-2540: Verified the patch fixes the issue on a pseudo-dist cluster. +1. Committing this. FairScheduler: Queue filters not working on scheduler page in RM UI --- Key: YARN-2540 URL: https://issues.apache.org/jira/browse/YARN-2540 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0, 2.5.1 Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: YARN-2540-v1.txt, YARN-2540-v2.txt, YARN-2540-v3.txt Steps to reproduce : 1. Run an app in default queue. 2. While the app is running, go to the scheduler page on RM UI. 3. You would see the app in the apptable at the bottom. 4. Now click on default queue to filter the apptable on root.default. 5. App disappears from apptable although it is running on default queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143974#comment-14143974 ] Jason Lowe commented on YARN-90: Thanks, Varun! Comments on the latest patch: It's a bit odd to have a hash map to map disk error types to lists of directories, fill them all in, but we only in practice actually look at one type in the map and that's DISK_FULL. It'd be simpler (and faster and less space since there's no hashmap involved) to just track full disks as a separate collection like we already do for localDirs and failedDirs. Nit: DISK_ERROR_CAUSE should be DiskErrorCause (if we keep the enum) to match the style of other enum types in the code. In verifyDirUsingMkdir, if an error occurs during the finally clause then that exception will mask the original exception isDiskUsageUnderPercentageLimit is named backwards. Disk usage being under the configured limit shouldn't be a full disk error, and the error message is inconsistent with the method name (method talks about being under but error message says its above). {code} if (isDiskUsageUnderPercentageLimit(testDir)) { msg = used space above threshold of + diskUtilizationPercentageCutoff + %, removing from the list of valid directories.; {code} We should only call getDisksHealthReport() once in the following code: {code} +String report = getDisksHealthReport(); +if (!report.isEmpty()) { + LOG.info(Disk(s) failed. + getDisksHealthReport()); {code} Should updateDirsAfterTest always say Disk(s) failed if the report isn't empty? Thinking of the case where two disks go bad, then one later is restored. The health report will still have something, but that last update is a disk turning good not failing. Before this code was only called when a new disk failed, and now that's not always the case. Maybe it should just be something like Disk health update: instead? Is it really necessary to stat a directory before we try to delete it? Seems like we can just try to delete it. The idiom of getting the directories and adding the full directories seems pretty common. Might be good to have dirhandler methods that already do this, like getLocalDirsForCleanup or getLogDirsForCleanup. I'm a bit worried that getInitializedLocalDirs could potentially try to delete an entire directory tree for a disk. If this fails in some sector-specific way but other containers are currently using their files from other sectors just fine on the same disk, removing these files from underneath active containers could be very problematic and difficult to debug. NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2540) FairScheduler: Queue filters not working on scheduler page in RM UI
[ https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143980#comment-14143980 ] Ashwin Shankar commented on YARN-2540: -- Thanks [~kasha], [~ywskycn] ! FairScheduler: Queue filters not working on scheduler page in RM UI --- Key: YARN-2540 URL: https://issues.apache.org/jira/browse/YARN-2540 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0, 2.5.1 Reporter: Ashwin Shankar Assignee: Ashwin Shankar Fix For: 2.6.0 Attachments: YARN-2540-v1.txt, YARN-2540-v2.txt, YARN-2540-v3.txt Steps to reproduce : 1. Run an app in default queue. 2. While the app is running, go to the scheduler page on RM UI. 3. You would see the app in the apptable at the bottom. 4. Now click on default queue to filter the apptable on root.default. 5. App disappears from apptable although it is running on default queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143993#comment-14143993 ] Craig Welch commented on YARN-2496: --- So, re the headroom issue (2) - the short version - I don't think we can put off addressing this, because I think it is going to be a typical case and will be problematic. I think the most realistic solution is to support only a short list of pre-configured label expressions per queue. Another option is to limit nodes to supporting only 1 label per node (which, realistically, might be sufficient). A third option is to limit the number of labels which a queue can access to a very small value + the all value (1-2). Basically, one of the factors pushing the large set of possible values which must be considered to properly calculate headroom needs to be made finite/drastically reduced. longer version... I don't think we should move forward without addressing it. I say this because I think it is likely to be a typical situation to have a queue which has more than one label associated with it- most likely, the simple case of a queue which can address all nodes some of which have a label and some of which do not. Jobs entering these queues using a restrictive label expression will hit this headroom issue - it's especially true in cases where there are lower resources, which is what one would expect from a small set of special machines (e.g. typical node label case). It's important to make sure headroom is correctly handled as we add node labels, and as things stand, we know it is not. I'm afraid it is something of a design issue, allowing arbitrary node label expressions with multiple labels on queues, etc, is leading to something of a combinatory explosion. It may be that the right solution is to narrow the feature set a bit for this iteration. We could choose to only support a restricted set of expressions on a given queue. This could even mean only supporting the default label expression - I'm concerned that this may be too restrictive - and so that we would need to support a set of expressions. This could then be a finite list which is pre-calculated. I think, in practical terms, this will probably meet people's needs. A second option is to restrict the number of labels supported on a queue, a small enough set could be pre-calculated for all possibilities. I suspicious of this latter option, though, it would have to be a very small number of labels to be manageable and I think it reduces, realistically, to the restricted set of expressions. I also don't see any performant way to support arbitrary nodelable expressions on every request with unlimited labels per queue and node - things as they are. It appears to me you would need to keep track of all resource values for intersection of all label combinations. If we limited the number of possible labels on a node to one then we could calculate based on expressions at runtime (possibly for a very small number 1, but again, growth is exponential? I believe... and functionally complex) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2539) FairScheduler: Update the default value for maxAMShare
[ https://issues.apache.org/jira/browse/YARN-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143997#comment-14143997 ] Karthik Kambatla commented on YARN-2539: +1 FairScheduler: Update the default value for maxAMShare -- Key: YARN-2539 URL: https://issues.apache.org/jira/browse/YARN-2539 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2539-1.patch Currently, the maxAMShare per queue is -1 in default, which disables the AM share constraint. Change to 0.5f would be good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) [YARN-796] Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143998#comment-14143998 ] Hadoop QA commented on YARN-2505: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670512/YARN-2505.1.patch against trunk revision 23e17ce. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5073//console This message is automatically generated. [YARN-796] Support get/add/remove/change labels in RM REST API -- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2539) FairScheduler: Set the default value for maxAMShare to 0.5
[ https://issues.apache.org/jira/browse/YARN-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2539: --- Summary: FairScheduler: Set the default value for maxAMShare to 0.5 (was: FairScheduler: Update the default value for maxAMShare) FairScheduler: Set the default value for maxAMShare to 0.5 -- Key: YARN-2539 URL: https://issues.apache.org/jira/browse/YARN-2539 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor Attachments: YARN-2539-1.patch Currently, the maxAMShare per queue is -1 in default, which disables the AM share constraint. Change to 0.5f would be good. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144004#comment-14144004 ] Hadoop QA commented on YARN-2578: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670359/YARN-2578.patch against trunk revision 23e17ce. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5071//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5071//console This message is automatically generated. NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Attachments: YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2129) Add scheduling priority to the WindowsSecureContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144006#comment-14144006 ] Hadoop QA commented on YARN-2129: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649565/YARN-2129.2.patch against trunk revision 43efdd3. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5074//console This message is automatically generated. Add scheduling priority to the WindowsSecureContainerExecutor - Key: YARN-2129 URL: https://issues.apache.org/jira/browse/YARN-2129 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 3.0.0 Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2129.1.patch, YARN-2129.2.patch The WCE (YARN-1972) could and should honor NM_CONTAINER_EXECUTOR_SCHED_PRIORITY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144016#comment-14144016 ] Wilfred Spiegelenburg commented on YARN-2578: - To address [~vinodkv] comments: The active RM is completely shut off from the network so are all the other services on the node, including zookeeper. The RM can update zookeeper but that will never be propagated outside of the node to the other zookeeper nodes. It can thus not be seen by the standby RM. The standby RM detects no updates in zookeeper for the timeout period and becomes the active node. That is the normal HA behaviour from the standby node as if the RM would have crashed. NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Attachments: YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144022#comment-14144022 ] Wilfred Spiegelenburg commented on YARN-2578: - I looked into automated testing but like in HDFS-4858 I have not been able to find a way to test this using junit tests. Manual testing is really simple using the above reproduction scenario. NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Attachments: YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144037#comment-14144037 ] Tsuyoshi OZAWA commented on YARN-2312: -- Talked with Jian offline. {quote} 2. Priority. Can we change the definition of Proto? It's used widely and one concern is backward compatibility. {quote} Priority class is used with ContainerId#getId only in test code(e.g. ApplicationHistoryStoreTestUtils). We can leave it for now. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144043#comment-14144043 ] Karthik Kambatla commented on YARN-1959: Thanks Anubhav. Thought about this a little more, and I wonder if we need to have separate headroom calculations for policies. Would DRF#getHeadroom not work for other policies? Fix headroom calculation in Fair Scheduler -- Key: YARN-1959 URL: https://issues.apache.org/jira/browse/YARN-1959 Project: Hadoop YARN Issue Type: Bug Reporter: Sandy Ryza Assignee: Anubhav Dhoot Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch The Fair Scheduler currently always sets the headroom to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144053#comment-14144053 ] Hadoop QA commented on YARN-2320: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670492/YARN-2320.2.patch against trunk revision 23e17ce. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 23 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.server.TestContainerManagerSecurity The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5072//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5072//console This message is automatically generated. Removing old application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2320.1.patch, YARN-2320.2.patch After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144075#comment-14144075 ] Hadoop QA commented on YARN-2569: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670525/YARN-2569.4.patch against trunk revision 43efdd3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5075//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5075//console This message is automatically generated. Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, YARN-2569.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144076#comment-14144076 ] Anubhav Dhoot commented on YARN-1959: - The queue fair share for fifo and fair policies, sets CPU to zero always. Thus using DRF calculations would cause the headroom to always be set to zero CPU. That can be incorrectly interpreted by the user as having no headroom for CPU (instead of don't care about CPU headroom). Fix headroom calculation in Fair Scheduler -- Key: YARN-1959 URL: https://issues.apache.org/jira/browse/YARN-1959 Project: Hadoop YARN Issue Type: Bug Reporter: Sandy Ryza Assignee: Anubhav Dhoot Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch The Fair Scheduler currently always sets the headroom to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144087#comment-14144087 ] Eric Payne commented on YARN-2056: -- [~leftnoteasy]: Good catch! It's actually even worse than what you specified. The way the patch is written now, if the disable preemption queue is 1) over capacity and 2) asking for more resources, it will preempt from other queues and make them go below their guarantee! I don't have a good suggestion to fix the problem you have outlined other than stating the following: If a queue is over capacity and has untouchable resources in its pool, it cannot preempt other queues at that level. In other words, if you disable preemption on a queue, the only way it will get over it's capacity is when other resources free up. Those other resources won't be preempted to fulfill a non-preemptable queues request if that non-preemptable queue is already over capacity. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144090#comment-14144090 ] Craig Welch commented on YARN-796: -- It looks like the FileSystemNodeLabelManager will just append changes to the edit log forever, until it is restarted, is that correct? If so, a long-running cluster with lots of changes could result in a rather large edit log. I think every so many writes (N writes) a recovery should be forced to clean up the edit log and consolidate state (do a recover...) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144099#comment-14144099 ] Hadoop QA commented on YARN-1959: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670524/YARN-1959.001.patch against trunk revision 43efdd3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5076//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5076//console This message is automatically generated. Fix headroom calculation in Fair Scheduler -- Key: YARN-1959 URL: https://issues.apache.org/jira/browse/YARN-1959 Project: Hadoop YARN Issue Type: Bug Reporter: Sandy Ryza Assignee: Anubhav Dhoot Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch The Fair Scheduler currently always sets the headroom to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144111#comment-14144111 ] Karthik Kambatla commented on YARN-1959: Thanks for the clarification here and offline. I understand why the headroom needs to policy specific. Couple of nits: # In FifoPolicy and FairSharePolicy, we can avoid one instance of Resource - {{queueAvailable}}, and use an int for memory instead. May be, we should just use two ints in DRFPolicy as well. # TestFSAppAttempt#VerifyHeadroom should be verifyHeadroom. Fix headroom calculation in Fair Scheduler -- Key: YARN-1959 URL: https://issues.apache.org/jira/browse/YARN-1959 Project: Hadoop YARN Issue Type: Bug Reporter: Sandy Ryza Assignee: Anubhav Dhoot Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch The Fair Scheduler currently always sets the headroom to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1959) Fix headroom calculation in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1959: Attachment: YARN-1959.002.patch Addressed feedback Fix headroom calculation in Fair Scheduler -- Key: YARN-1959 URL: https://issues.apache.org/jira/browse/YARN-1959 Project: Hadoop YARN Issue Type: Bug Reporter: Sandy Ryza Assignee: Anubhav Dhoot Attachments: YARN-1959.001.patch, YARN-1959.002.patch, YARN-1959.prelim.patch The Fair Scheduler currently always sets the headroom to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2168) SCM/Client/NM/Admin protocols
[ https://issues.apache.org/jira/browse/YARN-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144152#comment-14144152 ] Chris Trezzo commented on YARN-2168: Thanks for the comments [~vinodkv]. I will make changes to reflect all of these comments in the appropriate sub-patches. SCM/Client/NM/Admin protocols - Key: YARN-2168 URL: https://issues.apache.org/jira/browse/YARN-2168 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2168-trunk-v1.patch, YARN-2168-trunk-v2.patch This jira is meant to be used to review the main shared cache APIs. They are as follows: * ClientSCMProtocol - The protocol between the yarn client and the cache manager. This protocol controls how resources in the cache are claimed and released. ** UseSharedCacheResourceRequest ** UseSharedCacheResourceResponse ** ReleaseSharedCacheResourceRequest ** ReleaseSharedCacheResourceResponse * SCMAdminProtocol - This is an administrative protocol for the cache manager. It allows administrators to manually trigger cleaner runs. ** RunSharedCacheCleanerTaskRequest ** RunSharedCacheCleanerTaskResponse * NMCacheUploaderSCMProtocol - The protocol between the NodeManager and the cache manager. This allows the NodeManager to coordinate with the cache manager when uploading new resources to the shared cache. ** NotifySCMRequest ** NotifySCMResponse -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2569: Attachment: YARN-2569.4.1.patch Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, YARN-2569.4.1.patch, YARN-2569.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144184#comment-14144184 ] Hadoop QA commented on YARN-1959: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670555/YARN-1959.002.patch against trunk revision 7b8df93. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5077//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5077//console This message is automatically generated. Fix headroom calculation in Fair Scheduler -- Key: YARN-1959 URL: https://issues.apache.org/jira/browse/YARN-1959 Project: Hadoop YARN Issue Type: Bug Reporter: Sandy Ryza Assignee: Anubhav Dhoot Attachments: YARN-1959.001.patch, YARN-1959.002.patch, YARN-1959.prelim.patch The Fair Scheduler currently always sets the headroom to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144198#comment-14144198 ] Hadoop QA commented on YARN-2569: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670568/YARN-2569.4.1.patch against trunk revision 7b8df93. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5078//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5078//console This message is automatically generated. Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, YARN-2569.4.1.patch, YARN-2569.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144268#comment-14144268 ] Zhijie Shen commented on YARN-2569: --- LGTM in general. Some comments about the patch. 1. Per discussion offline, is it a bit aggressive to mark the new APIs \@Stable? In particular when the class is marked \@Evolving. BTW, should we make LogAggregationContext \@Public? 2. It's good to describe what kind of pattern the user should use? Wildcard patten? http://en.wikipedia.org/wiki/Wildcard_character#File_and_directory_patterns 3. Miss a full stop? {code} + * how often the logAggregationSerivce uploads container logs in seconds {code} 4. The description is broken? {code} + * to set {code} 5. It shouldn't be part of API? {code} + + @Private + public abstract LogAggregationContextProto getProto(); {code} Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, YARN-2569.4.1.patch, YARN-2569.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2581) NMs need to find a way to get LogAggregationContext
[ https://issues.apache.org/jira/browse/YARN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2581: Attachment: YARN-2581.1.patch NMs need to find a way to get LogAggregationContext --- Key: YARN-2581 URL: https://issues.apache.org/jira/browse/YARN-2581 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2581.1.patch After YARN-2569, we have LogAggregationContext for application in ApplicationSubmissionContext. NMs need to find a way to get this information. We have this requirement: For all containers in the same application should honor the same LogAggregationContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2584) TestContainerManagerSecurity fails on trunk
Zhijie Shen created YARN-2584: - Summary: TestContainerManagerSecurity fails on trunk Key: YARN-2584 URL: https://issues.apache.org/jira/browse/YARN-2584 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen {code} Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 259.553 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 258.762 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144301#comment-14144301 ] Zhijie Shen commented on YARN-2320: --- The console log only shows TestContainerManagerSecurity, which seems to fail on trunk as well. File a Jira for it: YARN-2584 Removing old application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2320.1.patch, YARN-2320.2.patch After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2585) TestContainerManagerSecurity failed on trunk
Junping Du created YARN-2585: Summary: TestContainerManagerSecurity failed on trunk Key: YARN-2585 URL: https://issues.apache.org/jira/browse/YARN-2585 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Jian He -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2584) TestContainerManagerSecurity fails on trunk
[ https://issues.apache.org/jira/browse/YARN-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-2584: - Assignee: Jian He TestContainerManagerSecurity fails on trunk --- Key: YARN-2584 URL: https://issues.apache.org/jira/browse/YARN-2584 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Jian He {code} Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 259.553 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 258.762 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2584) TestContainerManagerSecurity fails on trunk
[ https://issues.apache.org/jira/browse/YARN-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2584: -- Attachment: YARN-2584.1.patch uploaded a patch TestContainerManagerSecurity fails on trunk --- Key: YARN-2584 URL: https://issues.apache.org/jira/browse/YARN-2584 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Jian He Attachments: YARN-2584.1.patch {code} Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 259.553 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 258.762 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2585) TestContainerManagerSecurity failed on trunk
[ https://issues.apache.org/jira/browse/YARN-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-2585. -- Resolution: Duplicate TestContainerManagerSecurity failed on trunk Key: YARN-2585 URL: https://issues.apache.org/jira/browse/YARN-2585 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Jian He -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2581) NMs need to find a way to get LogAggregationContext
[ https://issues.apache.org/jira/browse/YARN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2581: Attachment: YARN-2581.2.patch NMs need to find a way to get LogAggregationContext --- Key: YARN-2581 URL: https://issues.apache.org/jira/browse/YARN-2581 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2581.1.patch, YARN-2581.2.patch After YARN-2569, we have LogAggregationContext for application in ApplicationSubmissionContext. NMs need to find a way to get this information. We have this requirement: For all containers in the same application should honor the same LogAggregationContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2584) TestContainerManagerSecurity fails on trunk
[ https://issues.apache.org/jira/browse/YARN-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144321#comment-14144321 ] Junping Du commented on YARN-2584: -- Patch looks good to me. +1 pending on Jenkins result. TestContainerManagerSecurity fails on trunk --- Key: YARN-2584 URL: https://issues.apache.org/jira/browse/YARN-2584 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Jian He Attachments: YARN-2584.1.patch {code} Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 259.553 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 258.762 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2569: Attachment: YARN-2569.5.patch Addressed all the comments Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, YARN-2569.4.1.patch, YARN-2569.4.patch, YARN-2569.5.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2312: - Attachment: YARN-2312.1.patch Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144335#comment-14144335 ] Tsuyoshi OZAWA commented on YARN-2312: -- Attached a first patch. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2468: Attachment: YARN-2468.7.patch Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2584) TestContainerManagerSecurity fails on trunk
[ https://issues.apache.org/jira/browse/YARN-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144349#comment-14144349 ] Hadoop QA commented on YARN-2584: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670603/YARN-2584.1.patch against trunk revision 7b8df93. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5079//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5079//console This message is automatically generated. TestContainerManagerSecurity fails on trunk --- Key: YARN-2584 URL: https://issues.apache.org/jira/browse/YARN-2584 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Jian He Attachments: YARN-2584.1.patch {code} Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 259.553 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity) Time elapsed: 258.762 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144367#comment-14144367 ] Hadoop QA commented on YARN-2569: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670606/YARN-2569.5.patch against trunk revision 7b8df93. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5080//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5080//console This message is automatically generated. Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, YARN-2569.4.1.patch, YARN-2569.4.patch, YARN-2569.5.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144381#comment-14144381 ] Zhijie Shen commented on YARN-2569: --- +1 for the latest patch. Leave it tomorrow in case Vinod has further comments about it. Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, YARN-2569.4.1.patch, YARN-2569.4.patch, YARN-2569.5.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)