date:20140922


[ 
https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142931#comment-14142931
 ] 

Karthik Kambatla commented on YARN-2453:


The latest patch looks good. +1. Checking this in.

 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
 

 Key: YARN-2453
 URL: https://issues.apache.org/jira/browse/YARN-2453
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2453.000.patch, YARN-2453.001.patch, 
 YARN-2453.002.patch


 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler.
 The following is error message:
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy)
   Time elapsed: 1.61 sec   FAILURE!
 java.lang.AssertionError: Failed to find SchedulingMonitor service, please 
 check what happened
   at org.junit.Assert.fail(Assert.java:88)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469)
 This test should only work for capacity scheduler because the following 
 source code in ResourceManager.java prove it will only work for capacity 
 scheduler.
 {code}
 if (scheduler instanceof PreemptableResourceScheduler
conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
   YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
 {code}
 Because CapacityScheduler is instance of PreemptableResourceScheduler and 
 FairScheduler is not  instance of PreemptableResourceScheduler.
 I will upload a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2453) TestProportionalCapacityPreemptionPolicy fails with FairScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2453:
---
Summary: TestProportionalCapacityPreemptionPolicy fails with FairScheduler  
(was: TestProportionalCapacityPreemptionPolicy is failed for FairScheduler)

 TestProportionalCapacityPreemptionPolicy fails with FairScheduler
 -

 Key: YARN-2453
 URL: https://issues.apache.org/jira/browse/YARN-2453
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2453.000.patch, YARN-2453.001.patch, 
 YARN-2453.002.patch


 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler.
 The following is error message:
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy)
   Time elapsed: 1.61 sec   FAILURE!
 java.lang.AssertionError: Failed to find SchedulingMonitor service, please 
 check what happened
   at org.junit.Assert.fail(Assert.java:88)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469)
 This test should only work for capacity scheduler because the following 
 source code in ResourceManager.java prove it will only work for capacity 
 scheduler.
 {code}
 if (scheduler instanceof PreemptableResourceScheduler
conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
   YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
 {code}
 Because CapacityScheduler is instance of PreemptableResourceScheduler and 
 FairScheduler is not  instance of PreemptableResourceScheduler.
 I will upload a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

[
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143017#comment-14143017
]

Hadoop QA commented on YARN-2198:
-

{color:green}+1 overall{color}. Here are the results of testing the latest
attachment

http://issues.apache.org/jira/secure/attachment/12670374/YARN-2198.trunk.9.patch
against trunk revision 9721e2c.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 5 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-common-project/hadoop-common
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5070//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5070//console

This message is automatically generated.

Remove the need to run NodeManager as privileged account for Windows Secure
Container Executor
--

Key: YARN-2198
URL: https://issues.apache.org/jira/browse/YARN-2198
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Labels: security, windows
Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch,
YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch,
YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.4.patch,
YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch,
YARN-2198.trunk.9.patch

YARN-1972 introduces a Secure Windows Container Executor. However this
executor requires a the process launching the container to be LocalSystem or
a member of the a local Administrators group. Since the process in question
is the NodeManager, the requirement translates to the entire NM to run as a
privileged account, a very large surface area to review and protect.
This proposal is to move the privileged operations into a dedicated NT
service. The NM can run as a low privilege account and communicate with the
privileged NT service when it needs to launch a container. This would reduce
the surface exposed to the high privileges.
There has to exist a secure, authenticated and authorized channel of
communication between the NM and the privileged NT service. Possible
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would
be to use Windows LPC (Local Procedure Calls), which is a Windows platform
specific inter-process communication channel that satisfies all requirements
and is easy to deploy. The privileged NT service would register and listen on
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop
with libwinutils which would host the LPC client code. The client would
connect to the LPC port (NtConnectPort) and send a message requesting a
container launch (NtRequestWaitReplyPort). LPC provides authentication and
the privileged NT service can use authorization API (AuthZ) to validate the
caller.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler

2014-09-22 Thread Wangda Tan (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143019#comment-14143019
]

Wangda Tan commented on YARN-2498:
--

Hi [~sunilg],
Many thanks for reviewing this patch, feedbacks:

1)
bq. A scenario where node1 has more than 50% (say 60) of cluster resources, and
queue A is given 50% in CS. IN that case, is there any chance of under
utilization?
Yes, queue-A can be under utilization. By design of YARN-796, this is
acceptable. Now we will calculate realtime maximum resource can be accessed by
each queue, and user/admin can get warning of queue under utilization from web
UI - scheduler page.

2)
bq. Here I feel, we may need to split up the resource of label in each node
level.
It's a very good question, I just thought this for a while again. I found a
negtive example shows you're right:
{code}
node1: x,y
node2: x,y
node3: z

each node has resource 10,
resource tree:
total = 30
/|\
20x 20y 10z

First request 20 resource with label = x
resource tree:
total = 10
/|\
0x 20y 10z

The correct result should be, y = 0, we cannot request resource with label=y.
{code}
So it's best to split up the resource of label to node level, but the problem
is, it will have much larger time complexity. For each assign operation, we
need O(n=#unique-set-of-labels-on-node). It can be very large in a big cluster.
And considering m=#iteration and p=#leaf-queue, we need O(n * m * p) to get the
ideal_assigned of each queue.
It may have better way to calculate ideal_assigned, I will think about this.
For now, it can only get correct ideal_assigned when all node in the cluster
has = 1 label. It's the hard-partition use-case (cluster is partitioned to
several smaller clusters by label).

3)
bq. For preemption, we just calculate to match the totalResourceToPreempt from
the over utilized queues. But whether this container is from which node, and
also under which label, and whether this label is coming under which queue. Do
we need to do this check for each container?
I think the answer is yes if we want: every container preempted can be accessed
by at least one queue under-satisfied (has ideal_assigned current).

Please let me know if you have more comments,

Thanks,
Wangda

[YARN-796] Respect labels in preemption policy of capacity scheduler

Key: YARN-2498
URL: https://issues.apache.org/jira/browse/YARN-2498
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch,
yarn-2498-implementation-notes.pdf

There're 3 stages in ProportionalCapacityPreemptionPolicy,
# Recursively calculate {{ideal_assigned}} for queue. This is depends on
available resource, resource used/pending in each queue and guaranteed
capacity of each queue.
# Mark to-be preempted containers: For each over-satisfied queue, it will
mark some containers will be preempted.
# Notify scheduler about to-be preempted container.
We need respect labels in the cluster for both #1 and #2:
For #1, when there're some resource available in the cluster, we shouldn't
assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot
access such labels
For #2, when we make decision about whether we need preempt a container, we
need make sure, resource this container is *possibly* usable by a queue which
is under-satisfied and has pending resource.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143034#comment-14143034
]

Zhijie Shen commented on YARN-1530:
---

bq. Scenario 1. ATS service goes down
bq. Scenario 2. ATS service partially down

In general, I agree the concerns about the scenario when the timeline server is
(partially) down makes sense. However, if we change the subject from ATS to
HDFS/Kafka, I'm afraid we can get the similar conclusion. For example, HDFS is
temporally not writable (We actually have observed this issue around YARN log
aggregation). I can see the judgement has a obvious implication that the
timeline server can be down, but HDFS/Kafka will not. It's correct to some
extent base on the current timeline server SLA. Therefore, is making the
timeline server reliable (or always-up) the essential solution? If the timeline
server is reliable, it's going to relax the requirement to persist entities in
a third place (this is the basic benefit I can see with HDFS/Kafka channel).
While it may take a while to make sure the timeline server be as reliable as
HDFS/Kafka does, we can make progress step by step, for example, YARN-2520
should realistic to be achieved within a reasonable timeline.

Of course, there may still be a reliability gap between ATS/HBase and
HDFS/Kafka (Actually, I'm not experienced about the reliability about the
latter components, please let me know the exact gap it will be). It could be
arguable that we still need to persist the entities in HDFS/Kafka when
ATS/HBase is not available but HDFS/Kafka is still available. However, if we
anyway need to improve the timeline server reliability, perhaps we should think
carefully of the cost performance of implementing and maintaing another writing
channel to bridge the gap.

bq. Scenario 3. ATS backing store fails

In this scenario, the entities have already reached the timeline server, right?
I'm considering it as the internal reliability problem of the timeline server.
As I mentioned the previous threads, it's the requirement that if the entity
has reached the timeline server: the timeline server should take the
responsibility to prevent if from being lost. I think it's a good point that
the date store is going to be in outage (as HDFS can be temporally not
writable). Having local backup for those outstanding received requests should
be an answer for this scenario.

bq. However, with the HDFS channel, the ATS can essentially throttle the events
Suppose you have a cluster pushing X events/second to the ATS. With the REST
implementation, the ATS must try to handle X events every second; if it can’t
keep up, or if it gets too many incoming connections, there’s not too much we
can do here.

This may not be the accurate judgement. I'm supposing you are comparing pushing
each event in on request for REST API with writing a batch of X events into
HDFS. REST API allows to to batch X events and send one request. Please refer
to TimelineClient#putEntities for the details.

bq. In making the write path pluggable, we’d have to have two pieces: one to do
the writing from the TimelineClient and one to the receiving in the ATS. These
would have to be in pairs. We’ve already discussed some different
implementations for this: REST, Kafka, and HDFS.
bq. The backing store is already pluggable.

No problem, it's feasible to make the write path pluggable. However. though the
store is pluggable, Leveldb an HBase is relatively similar to each compared
HTTP REST vs HDFS/Kafka pair. The more important thing is that it's more
difficult to manage different writing channels than to manage different stores,
because one is client-side and the other is server-side. At server-side, the
YARN cluster operator has the full control of the servers, and the limited
hosts to deal with. At client-side, the YARN cluster operator may not have the
access to it, and don't know how many clients and how many type of apps he/she
needs to deal with. TimelineClient is a generic tool (not for a particular
application such as Spark), such that it's good to make it lightweight and
portable. Again, it's still a cost performance question.

bq. Though as bc pointed out before, it’s fine for more experienced users to
use HBase, but “regular” users should have a solution as well that is hopefully
more scalable and reliable than LevelDB.

Right, and this is also my concern about HDFS/Kafka channel, in particularly
using it as a default. Regular users may not be experienced enough about
HBase as well as HDFS/Kafka. It very much depends on the users and the use
cases.

[~bcwalrus] and [~rkanter], thanks for putting new idea into the timeline
service. In general, the timeline service is still a young project. We have
different problems to solve and multiple ways to them. Additional writing
channel is interesting, while given the whole roadmap

[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server


[ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143039#comment-14143039
 ] 

Zhijie Shen commented on YARN-2556:
---

[~jeagles], it sounds an interesting work. Is it possible to see the throughput 
difference between TimelineDataManager and the web front interface? I suspect 
the web front interface is going to be bottleneck to throttle the end-to-end 
performance. With this analysis, we can have clearer picture about the 
reasonable to timeline server instances required to get rid of web font 
interface bottleneck (YARN-2520).

 Tool to measure the performance of the timeline server
 --

 Key: YARN-2556
 URL: https://issues.apache.org/jira/browse/YARN-2556
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: chang li

 We need to be able to understand the capacity model for the timeline server 
 to give users the tools they need to deploy a timeline server with the 
 correct capacity.
 I propose we create a mapreduce job that can measure timeline server write 
 and read performance. Transactions per second, I/O for both read and write 
 would be a good start.
 This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter fails with FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143100#comment-14143100
 ] 

Hudson commented on YARN-2452:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #688 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/688/])
YARN-2452. TestRMApplicationHistoryWriter fails with FairScheduler. (Zhihai Xu 
via kasha) (kasha: rev c50fc92502934aa2a8f84ea2466d4da1e3eace9d)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/TestRMApplicationHistoryWriter.java


 TestRMApplicationHistoryWriter fails with FairScheduler
 ---

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.6.0

 Attachments: YARN-2452.000.patch, YARN-2452.001.patch, 
 YARN-2452.002.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy fails with FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143099#comment-14143099
 ] 

Hudson commented on YARN-2453:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #688 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/688/])
YARN-2453. TestProportionalCapacityPreemptionPolicy fails with FairScheduler. 
(Zhihai Xu via kasha) (kasha: rev 9721e2c1feb5aecea3a6dab5bda96af1cd0f8de3)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java


 TestProportionalCapacityPreemptionPolicy fails with FairScheduler
 -

 Key: YARN-2453
 URL: https://issues.apache.org/jira/browse/YARN-2453
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.6.0

 Attachments: YARN-2453.000.patch, YARN-2453.001.patch, 
 YARN-2453.002.patch


 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler.
 The following is error message:
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy)
   Time elapsed: 1.61 sec   FAILURE!
 java.lang.AssertionError: Failed to find SchedulingMonitor service, please 
 check what happened
   at org.junit.Assert.fail(Assert.java:88)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469)
 This test should only work for capacity scheduler because the following 
 source code in ResourceManager.java prove it will only work for capacity 
 scheduler.
 {code}
 if (scheduler instanceof PreemptableResourceScheduler
conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
   YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
 {code}
 Because CapacityScheduler is instance of PreemptableResourceScheduler and 
 FairScheduler is not  instance of PreemptableResourceScheduler.
 I will upload a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2551) Windows Secure Cotnainer Executor: Add checks to validate that the wsce-site.xml is write restricted to Administrators only


 [ 
https://issues.apache.org/jira/browse/YARN-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2551:
---
Attachment: YARN-2551.1.patch

 Windows Secure Cotnainer Executor: Add checks to validate that the 
 wsce-site.xml is write restricted to Administrators only
 ---

 Key: YARN-2551
 URL: https://issues.apache.org/jira/browse/YARN-2551
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows, wsce
 Attachments: YARN-2551.1.patch


 The wsce-site.xml containes the impersonate.allowed and impersonate.denied 
 keys that restrict/control the users that can be impersonated by the WSCE 
 containers. The impersonation frameworks in winutils should validate that 
 only Administrators have write control on this file. 
 This is similar to how LCE is validating that only root has write permissions 
 on container-executor.cfg file on secure Linux clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2551) Windows Secure Cotnainer Executor: Add checks to validate that the wsce-site.xml is write restricted to Administrators only


 [ 
https://issues.apache.org/jira/browse/YARN-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu resolved YARN-2551.

Resolution: Implemented

The patch will be contained in YARN-2198 patch 10 and forward

 Windows Secure Cotnainer Executor: Add checks to validate that the 
 wsce-site.xml is write restricted to Administrators only
 ---

 Key: YARN-2551
 URL: https://issues.apache.org/jira/browse/YARN-2551
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows, wsce
 Attachments: YARN-2551.1.patch


 The wsce-site.xml containes the impersonate.allowed and impersonate.denied 
 keys that restrict/control the users that can be impersonated by the WSCE 
 containers. The impersonation frameworks in winutils should validate that 
 only Administrators have write control on this file. 
 This is similar to how LCE is validating that only root has write permissions 
 on container-executor.cfg file on secure Linux clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated


[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143158#comment-14143158
 ] 

Tsuyoshi OZAWA commented on YARN-2312:
--

[~vinodkv], [~jianhe] do you have any feedbacks?

[~jlowe], I appreciate if you give us comments about WrappedJvmID. 

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-09-22 Thread Rohith (JIRA)

Rohith created YARN-2579:


 Summary: Both RM's state is Active , but 1 RM is not really active.
 Key: YARN-2579
 URL: https://issues.apache.org/jira/browse/YARN-2579
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Rohith


I encountered a situaltion where both RM's web page was able to access and its 
state displayed as Active. But One of the RM's ActiveServices were stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-09-22 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143200#comment-14143200
 ] 

Rohith commented on YARN-2579:
--

This scenario could ocure if 2 thread trying to access 
ResourceManager#transitionToStandby().One is from 
AdminService#trainsitiontostandby first and then 
RMFatalEventDispatcher#transitionToStandBy(). This I simulated using debug 
point.
The main problem is in resetting dispatcher, stops the dispatcher. Suppose, if 
AdminService is stopping dispatcher but dispatcher thread is blocked for 
getting acquire lock on ResourceManager, then ResourceManager never get 
transitioned to StandBy. It wait infinitely.

{code}
AsyncDispatcher event handler daemon prio=10 tid=0x007ea000 
nid=0x39d1 waiting for monitor entry [0x7fe0a77f6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:976)
- waiting to lock 0xc1f7d438 (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:701)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:678)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
IPC Server handler 0 on 45021 daemon prio=10 tid=0x7fe0a9026800 
nid=0x30ab in Object.wait() [0x7fe0a7cfa000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0xeb3310e8 (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1281)
- locked 0xeb3310e8 (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1355)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked 0xeb32fef8 (a java.lang.Object)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetDispatcher(ResourceManager.java:1166)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:987)
- locked 0xc1f7d438 (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:308)
- locked 0xc2038d10 (a 
org.apache.hadoop.yarn.server.resourcemanager.AdminService)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4462)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}


 Both RM's state is Active , but 1 RM is not really active.
 --

 Key: YARN-2579
 URL: https://issues.apache.org/jira/browse/YARN-2579
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Rohith

 I encountered a situaltion where both RM's web page was able to access and 
 its state displayed as Active. But One of the RM's ActiveServices were 
 stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy fails with FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143235#comment-14143235
 ] 

Hudson commented on YARN-2453:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1879 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1879/])
YARN-2453. TestProportionalCapacityPreemptionPolicy fails with FairScheduler. 
(Zhihai Xu via kasha) (kasha: rev 9721e2c1feb5aecea3a6dab5bda96af1cd0f8de3)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* hadoop-yarn-project/CHANGES.txt


 TestProportionalCapacityPreemptionPolicy fails with FairScheduler
 -

 Key: YARN-2453
 URL: https://issues.apache.org/jira/browse/YARN-2453
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.6.0

 Attachments: YARN-2453.000.patch, YARN-2453.001.patch, 
 YARN-2453.002.patch


 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler.
 The following is error message:
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy)
   Time elapsed: 1.61 sec   FAILURE!
 java.lang.AssertionError: Failed to find SchedulingMonitor service, please 
 check what happened
   at org.junit.Assert.fail(Assert.java:88)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469)
 This test should only work for capacity scheduler because the following 
 source code in ResourceManager.java prove it will only work for capacity 
 scheduler.
 {code}
 if (scheduler instanceof PreemptableResourceScheduler
conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
   YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
 {code}
 Because CapacityScheduler is instance of PreemptableResourceScheduler and 
 FairScheduler is not  instance of PreemptableResourceScheduler.
 I will upload a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter fails with FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143236#comment-14143236
 ] 

Hudson commented on YARN-2452:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1879 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1879/])
YARN-2452. TestRMApplicationHistoryWriter fails with FairScheduler. (Zhihai Xu 
via kasha) (kasha: rev c50fc92502934aa2a8f84ea2466d4da1e3eace9d)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/TestRMApplicationHistoryWriter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java
* hadoop-yarn-project/CHANGES.txt


 TestRMApplicationHistoryWriter fails with FairScheduler
 ---

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.6.0

 Attachments: YARN-2452.000.patch, YARN-2452.001.patch, 
 YARN-2452.002.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2580) Windows Secure Container Executor: grant job query privileges to the container user


 [ 
https://issues.apache.org/jira/browse/YARN-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu reassigned YARN-2580:
--

Assignee: Remus Rusanu

 Windows Secure Container Executor: grant job query privileges to the 
 container user
 ---

 Key: YARN-2580
 URL: https://issues.apache.org/jira/browse/YARN-2580
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu

 mapred.MapTask.iniitalize uses WindowsBasedProcessTree which uses winutils to 
 query the container NT JOB. This must eb granted query permission by the 
 hadoopwinutilsvc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2580) Windows Secure Container Executor: grant job query privileges to the container user

Remus Rusanu created YARN-2580:
--

 Summary: Windows Secure Container Executor: grant job query 
privileges to the container user
 Key: YARN-2580
 URL: https://issues.apache.org/jira/browse/YARN-2580
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu


mapred.MapTask.iniitalize uses WindowsBasedProcessTree which uses winutils to 
query the container NT JOB. This must eb granted query permission by the 
hadoopwinutilsvc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected

2014-09-22 Thread Wangda Tan (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143301#comment-14143301
]

Wangda Tan commented on YARN-1198:
--

Hi [~cwelch],
Sorry for this late response, I've just looked your ver.8 patch and comments,
My reply,
bq. -re we don't need write HeadroomProvider for each scheduler
And
bq. Provider vs Reference
I agree with this, I think we need write different Headroom Provider and it's
better to keep Provider since its more general.

bq. -re As mentioned by Jason, currently ...
Agree, this can be done in a separated JIRA

bq. -re the cost of the calculation
Agree, it's just a small computation effort.

In the past, I suggest do as I mentioned
https://issues.apache.org/jira/browse/YARN-1198?focusedCommentId=14108991page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14108991
because I think that will make code more clean.
But according to your ver.8 patch, I realized that may not doable. In
LeafQueue#computeUserLimit, it uses required to get user limit. In your patch,
you save the lastRequired to user class. However, we need different required
for different app under a same user. We can only do the calculate when app
heartbeats (We can also loop and set all app's headroom, but that's a way we
abandoned before).

So basically, IMHO, I think your ver.7 is a more correct way to go. Which keeps
complexity/efficiency balanced.
Any thoughts? [~jianhe], [~cwelch].

Wangda

Capacity Scheduler headroom calculation does not work as expected
-

Key: YARN-1198
URL: https://issues.apache.org/jira/browse/YARN-1198
Project: Hadoop YARN
Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Craig Welch
Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch,
YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch,
YARN-1198.8.patch

Today headroom calculation (for the app) takes place only when
* New node is added/removed from the cluster
* New container is getting assigned to the application.
However there are potentially lot of situations which are not considered for
this calculation
* If a container finishes then headroom for that application will change and
should be notified to the AM accordingly.
* If a single user has submitted multiple applications (app1 and app2) to the
same queue then
** If app1's container finishes then not only app1's but also app2's AM
should be notified about the change in headroom.
** Similarly if a container is assigned to any applications app1/app2 then
both AM should be notified about their headroom.
** To simplify the whole communication process it is ideal to keep headroom
per User per LeafQueue so that everyone gets the same picture (apps belonging
to same user and submitted in same queue).
* If a new user submits an application to the queue then all applications
submitted by all users in that queue should be notified of the headroom
change.
* Also today headroom is an absolute number ( I think it should be normalized
but then this is going to be not backward compatible..)
* Also when admin user refreshes queue headroom has to be updated.
These all are the potential bugs in headroom calculations

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy fails with FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143386#comment-14143386
 ] 

Hudson commented on YARN-2453:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1904 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1904/])
YARN-2453. TestProportionalCapacityPreemptionPolicy fails with FairScheduler. 
(Zhihai Xu via kasha) (kasha: rev 9721e2c1feb5aecea3a6dab5bda96af1cd0f8de3)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java
* hadoop-yarn-project/CHANGES.txt


 TestProportionalCapacityPreemptionPolicy fails with FairScheduler
 -

 Key: YARN-2453
 URL: https://issues.apache.org/jira/browse/YARN-2453
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.6.0

 Attachments: YARN-2453.000.patch, YARN-2453.001.patch, 
 YARN-2453.002.patch


 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler.
 The following is error message:
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy)
   Time elapsed: 1.61 sec   FAILURE!
 java.lang.AssertionError: Failed to find SchedulingMonitor service, please 
 check what happened
   at org.junit.Assert.fail(Assert.java:88)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469)
 This test should only work for capacity scheduler because the following 
 source code in ResourceManager.java prove it will only work for capacity 
 scheduler.
 {code}
 if (scheduler instanceof PreemptableResourceScheduler
conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
   YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
 {code}
 Because CapacityScheduler is instance of PreemptableResourceScheduler and 
 FairScheduler is not  instance of PreemptableResourceScheduler.
 I will upload a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter fails with FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143387#comment-14143387
 ] 

Hudson commented on YARN-2452:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1904 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1904/])
YARN-2452. TestRMApplicationHistoryWriter fails with FairScheduler. (Zhihai Xu 
via kasha) (kasha: rev c50fc92502934aa2a8f84ea2466d4da1e3eace9d)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ahs/TestRMApplicationHistoryWriter.java


 TestRMApplicationHistoryWriter fails with FairScheduler
 ---

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.6.0

 Attachments: YARN-2452.000.patch, YARN-2452.001.patch, 
 YARN-2452.002.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-09-22 Thread bc Wong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143401#comment-14143401
]

bc Wong commented on YARN-1530:
---

Hi [~zjshen]. First, glad to see that we're discussing approaches. You seem to
agree with the premise that *ATS write path should not slow down apps*.

bq. Therefore, is making the timeline server reliable (or always-up) the
essential solution? If the timeline server is reliable, ...

In theory, you can make the ATS *always-up*. In practice, we both know what
real life distributed systems do. Always-up isn't the only thing. The write
path needs to have good uptime and latency regardless of what's happening to
the read path or the backing store.

HDFS is a good default for the write channel because:
* We don't have to design an ATS that is always-up. If you really want to, I'm
sure you can eventually build something with good uptime. But it took other
projects (HDFS, ZK) lots of hard work to get to that point.
* If we reuse HDFS, cluster admins know how to operate HDFS and get good uptime
from it. But it'll take training and hard-learned lessons for operators to
figure out how to get good uptime from ATS, even after you build an always-up
ATS.
* All the popular YARN app frameworks (MR, Spark, etc.) already rely on HDFS by
default. So do most of the 3rd party applications that I know of.
Architecturally, it seems easier for admins to accept that ATS write path
depends on HDFS for reliability, instead of a new component that (we claim)
will be as reliable as HDFS/ZK.

bq. given the whole roadmap of the timeline service, let's think critically of
work that can improve the timeline service most significantly.

Exactly. Strong +1. If we can drop the high uptime + low write latency
requirement from the ATS service, we can avoid tons of effort. ATS doesn't need
to be as reliable as HDFS. We don't need to worry about insulating the write
path from the read path. We don't need to worry about occasional hiccups in
HBase (or whatever the store is). And at the end of all this, everybody sleeps
better because ATS service going down isn't a catastrophic failure.

[Umbrella] Store, manage and serve per-framework application-timeline data
--

Key: YARN-1530
URL: https://issues.apache.org/jira/browse/YARN-1530
Project: Hadoop YARN
Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Attachments: ATS-Write-Pipeline-Design-Proposal.pdf,
ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf,
application timeline design-20140116.pdf, application timeline
design-20140130.pdf, application timeline design-20140210.pdf

This is a sibling JIRA for YARN-321.
Today, each application/framework has to do store, and serve per-framework
data all by itself as YARN doesn't have a common solution. This JIRA attempts
to solve the storage, management and serving of per-framework data from
various applications, both running and finished. The aim is to change YARN to
collect and store data in a generic manner with plugin points for frameworks
to do their own thing w.r.t interpretation and serving.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2581) NMs need to find a way to get LogAggregationContext

Xuan Gong created YARN-2581:
---

 Summary: NMs need to find a way to get LogAggregationContext
 Key: YARN-2581
 URL: https://issues.apache.org/jira/browse/YARN-2581
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong


After YARN-2569, we have LogAggregationContext for application in 
ApplicationSubmissionContext. NMs need to find a way to get this information.
We have this requirement: For all containers in the same application should 
honor the same LogAggregationContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-22 Thread Allen Wittenauer (JIRA)

[
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143460#comment-14143460
]

Allen Wittenauer commented on YARN-913:
---

bq. Summary: need to fix ZK client and then have curator configure it, so the
rest of us don't have to care.

This might be a blocker then. If a client needs to talk to more than one ZK,
it sounds like they are basically screwed.

bq. do you mean in the endpoint fields? It should ... let me clarify that in
the example.

I was mainly looking at the hostname pattern:
{code}
+ String HOSTNAME_PATTERN =
+ ([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9]);
{code}

It doesn't appear to support periods/dots.

Add a way to register long-lived services in a YARN cluster
---

Key: YARN-913
URL: https://issues.apache.org/jira/browse/YARN-913
Project: Hadoop YARN
Issue Type: New Feature
Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf,
2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt,
YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch,
YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch,
YARN-913-007.patch, YARN-913-008.patch, yarnregistry.pdf, yarnregistry.tla

In a YARN cluster you can't predict where services will come up -or on what
ports. The services need to work those things out as they come up and then
publish them somewhere.
Applications need to be able to find the service instance they are to bond to
-and not any others in the cluster.
Some kind of service registry -in the RM, in ZK, could do this. If the RM
held the write access to the ZK nodes, it would be more secure than having
apps register with ZK themselves.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2582) Log related CLI and Web UI changes for LRS

Xuan Gong created YARN-2582:
---

 Summary: Log related CLI and Web UI changes for LRS
 Key: YARN-2582
 URL: https://issues.apache.org/jira/browse/YARN-2582
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong


After YARN-2468, we have change the log layout to support log aggregation for 
Long Running Service. Log CLI and related Web UI should be modified 
accordingly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart


[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143473#comment-14143473
 ] 

Jian He commented on YARN-1372:
---

+1 for the latest patch,  committing

 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
 YARN-1372.002_NMHandlesCompletedApp.patch, 
 YARN-1372.002_RMHandlesCompletedApp.patch, 
 YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, 
 YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch, 
 YARN-1372.006.patch, YARN-1372.007.patch, YARN-1372.008.patch, 
 YARN-1372.009.patch, YARN-1372.009.patch, YARN-1372.010.patch, 
 YARN-1372.prelim.patch, YARN-1372.prelim2.patch


 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS

Xuan Gong created YARN-2583:
---

 Summary: Modify the LogDeletionService to support Log aggregation 
for LRS
 Key: YARN-2583
 URL: https://issues.apache.org/jira/browse/YARN-2583
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong


Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will 
directly delete the app-log-dir from HDFS. This will not work for LRS. We 
expect a LRS application can keep running for a long time. Deleting the 
app-log-dir for the LRS applications is not a right way to handle it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS


 [ 
https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong reassigned YARN-2583:
---

Assignee: Xuan Gong

 Modify the LogDeletionService to support Log aggregation for LRS
 

 Key: YARN-2583
 URL: https://issues.apache.org/jira/browse/YARN-2583
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong

 Currently, AggregatedLogDeletionService will delete old logs from HDFS. It 
 will directly delete the app-log-dir from HDFS. This will not work for LRS. 
 We expect a LRS application can keep running for a long time. Deleting the 
 app-log-dir for the LRS applications is not a right way to handle it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2554) Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy

2014-09-22 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143495#comment-14143495
 ] 

Steve Loughran commented on YARN-2554:
--

Vinod, this patch is independent of kerberos, secure AMs, etc.

This patch so to an any AM to export an HTTPS URL; you can't do this on a 
secure or insecure cluster today.

It doesn't mean that clients can trust something just because it is on HTTPS; 
that's an independent issue. 

 Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy
 -

 Key: YARN-2554
 URL: https://issues.apache.org/jira/browse/YARN-2554
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.6.0
Reporter: Jonathan Maron
 Attachments: YARN-2554.1.patch, YARN-2554.2.patch, YARN-2554.3.patch, 
 YARN-2554.3.patch


 If the HTTP policy to enable HTTPS is specified, the RM and AM are 
 initialized with SSL listeners.  The RM has a web app proxy servlet that acts 
 as a proxy for incoming AM requests.  In order to forward the requests to the 
 AM the proxy servlet makes use of HttpClient.  However, the HttpClient 
 utilized is not initialized correctly with the necessary certs to allow for 
 successful one way SSL invocations to the other nodes in the cluster (it is 
 not configured to access/load the client truststore specified in 
 ssl-client.xml).   I imagine SSLFactory.createSSLSocketFactory() could be 
 utilized to create an instance that can be assigned to the HttpClient.
 The symptoms of this issue are:
 AM: Displays unknown_certificate exception
 RM:  Displays an exception such as javax.net.ssl.SSLHandshakeException: 
 sun.security.validator.ValidatorException: PKIX path building failed: 
 sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
 valid certification path to requested target



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143499#comment-14143499
 ] 

Xuan Gong commented on YARN-2468:
-

This is a very big patch and is hard for review. I would like to decouple the 
big patch to several smaller patches:
1) api changes will be tracked by YARN-2569
2) NMs need to find a way to get LogAggregationContext. This will be tracked by 
YARN-2581
3) Current ticket will be used to track changes for NM handling the logs for 
LRS which will include the log layout changes
4) Log Deletion Service changes will be tracked by YARN-2583
5) Related CL and web UI changes will be tracked by YARN-2582

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182


[ 
https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143567#comment-14143567
 ] 

Jian He commented on YARN-2562:
---

patch looks good. thanks Tsuyoshi ! could you add brief comment in the toString 
method that the epoch will increase if RM restarts or fails over?

 ContainerId@toString() is unreadable for epoch 0 after YARN-2182
 -

 Key: YARN-2562
 URL: https://issues.apache.org/jira/browse/YARN-2562
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-2562.1.patch


 ContainerID string format is unreadable for RMs that restarted at least once 
 (epoch  0) after YARN-2182. For e.g, 
 container_1410901177871_0001_01_05_17.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

[
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143594#comment-14143594
]

Zhijie Shen commented on YARN-1530:
---

Hi, [~bcwalrus]. Thanks for your further comments.

bq. You seem to agree with the premise that ATS write path should not slow down
apps.

Definitely. The arguable point is that the current timeline client is going to
slow down the app, given we have a scalable and reliable timeline server.

bq. If we can drop the high uptime + low write latency requirement from the ATS
service, we can avoid tons of effort.

I'm not sure such fundamental requirements can be dropped from the timeline
service. Projecting the future, scalable and high available timeline servers
have multiple benefits and enable different use cases. For example,

1. We can use it to serve realtime or near realtime data, such that we can go
the timeline server to see what happens to an application. It's in particularly
useful for the long running services, which will never turn down.

2. We can build checkpoints on the timeline server for the app do to recovery
once it crashes. It's pretty much like what we've done for MR jobs.

I bundled scalable and reliable together because multiple-instance solution
will improve the timeline server in both dimensions.

Moreover, no matter how scalable and reliable the channel could be, we
eventually want to get the timeline data accommodated into the timeline server,
right? Otherwise, it is not going to be accessible by users (Of course, tricks
can be played to fetch it directly from HDFS, but it's completely another story
than the timeline server). If the apps are publishing 10GB data per hour, while
the server can only process 1G per hour, the 9GB outstanding data per hour that
resides in some temp location of HDFS is going to be useless writes.

We have narrow down very much to discuss the reliability of the write path, but
if we look into the big picture, *the timeline server is not just place to
store data, but also serves it to users* (e.g., YARN-2513). In terms of use
case, users may want to monitor completed apps as well as running apps and
cluster. If the timeline server doesn't have capacity to serve the data for a
particular use case, it's actually wasting the cost on aggregating it. IMHO,
the scalable and the reliable timeline server is going to be *the eventual
solution to satisfy multiple stakeholders*, regardless the use case is read
intensive, write intensive or both intensive. That's why I think it could a
high margin work to improve the timeline server. It's may be a hard work, but
we should definitely pick it up.

[Umbrella] Store, manage and serve per-framework application-timeline data
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS

[
https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xuan Gong updated YARN-2583:

Description:
Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will
check the cut-off-time, if all logs for this application is older than this
cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for
LRS. We expect a LRS application can keep running for a long time.
Two different scenarios:
1) If we configured the rollingIntervalSeconds, the new log file will be always
uploaded to HDFS. The number of log files for this application will become
larger and larger. And there is no log files will be deleted.
2) If we did not configure the rollingIntervalSeconds, the log file can only be
uploaded to HDFS after the application is finished. It is very possible that
the logs are uploaded after the cut-off-time. It will cause problem because at
that time the app-log-dir for this application in HDFS has been deleted.

was:Currently, AggregatedLogDeletionService will delete old logs from HDFS.
It will directly delete the app-log-dir from HDFS. This will not work for LRS.
We expect a LRS application can keep running for a long time. Deleting the
app-log-dir for the LRS applications is not a right way to handle it.

Modify the LogDeletionService to support Log aggregation for LRS

Key: YARN-2583
URL: https://issues.apache.org/jira/browse/YARN-2583
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong

Currently, AggregatedLogDeletionService will delete old logs from HDFS. It
will check the cut-off-time, if all logs for this application is older than
this cut-off-time. The app-log-dir from HDFS will be deleted. This will not
work for LRS. We expect a LRS application can keep running for a long time.
Two different scenarios:
1) If we configured the rollingIntervalSeconds, the new log file will be
always uploaded to HDFS. The number of log files for this application will
become larger and larger. And there is no log files will be deleted.
2) If we did not configure the rollingIntervalSeconds, the log file can only
be uploaded to HDFS after the application is finished. It is very possible
that the logs are uploaded after the cut-off-time. It will cause problem
because at that time the app-log-dir for this application in HDFS has been
deleted.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations


[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143665#comment-14143665
 ] 

Craig Welch commented on YARN-2494:
---

The other day [~vinodkv] suggested changing the addLabel removeLabel ... api's 
to addNodeLabel removeNodeLabel... to make it more clear (and presumably make 
adding other possible types of labels in the future more smooth).  This would 
not effect the label apis, the node-to-label ones are ok already, I think.  
Thoughts?  

 [YARN-796] Node label manager API and storage implementations
 -

 Key: YARN-2494
 URL: https://issues.apache.org/jira/browse/YARN-2494
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
 YARN-2494.patch


 This JIRA includes APIs and storage implementations of node label manager,
 NodeLabelManager is an abstract class used to manage labels of nodes in the 
 cluster, it has APIs to query/modify
 - Nodes according to given label
 - Labels according to given hostname
 - Add/remove labels
 - Set labels of nodes in the cluster
 - Persist/recover changes of labels/labels-on-nodes to/from storage
 And it has two implementations to store modifications
 - Memory based storage: It will not persist changes, so all labels will be 
 lost when RM restart
 - FileSystem based storage: It will persist/recover to/from FileSystem (like 
 HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated

2014-09-22 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143699#comment-14143699
 ] 

Jason Lowe commented on YARN-2312:
--

bq. One idea is to add id for upper 32 bits of container Id to ID class.

The ID class is used by much more than just JvmID objects.  I'm not a fan of 
making all IDs pay for this extra storage when we only need it for this one 
case.  I'd rather store the extra bits in JvmID.

Actually I don't think it's critical that JvmID derives from ID.  We could have 
JvmID store the long itself rather than try to hack an extra 4-bytes onto ID 
and then need to explain why JvmID.getId doesn't do what one would expect.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store


 [ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2320:
--
Attachment: YARN-2320.1.patch

 Removing old application history store after we store the history data to 
 timeline store
 

 Key: YARN-2320
 URL: https://issues.apache.org/jira/browse/YARN-2320
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2320.1.patch


 After YARN-2033, we should deprecate application history store set. There's 
 no need to maintain two sets of store interfaces. In addition, we should 
 conclude the outstanding jira's under YARN-321 about the application history 
 store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store


[ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143783#comment-14143783
 ] 

Zhijie Shen commented on YARN-2320:
---

Upload a huge patch, but it doesn't have complex logic, but removing the old 
application history store stack, including:

1. Null|Memory|FileSystemApplicationHistoryStore, the related protobuf classes, 
and ApplicationHistoryManagerImpl based on it.
2. RMApplicationHistoryWriter, the events used by it, and the invokes in the 
scope of RM.
3. Unnecessary configurations in YarnConfiguration.

I addition, I've fixed the test cases based on ApplicationHistoryStore, and 
rename ApplicationHistoryManagerOnTimelineStore to 
ApplicationHistoryManagerImpl.

 Removing old application history store after we store the history data to 
 timeline store
 

 Key: YARN-2320
 URL: https://issues.apache.org/jira/browse/YARN-2320
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2320.1.patch


 After YARN-2033, we should deprecate application history store set. There's 
 no need to maintain two sets of store interfaces. In addition, we should 
 conclude the outstanding jira's under YARN-321 about the application history 
 store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2540) Fair Scheduler : queue filters not working on scheduler page in RM UI

2014-09-22 Thread Ashwin Shankar (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143793#comment-14143793
 ] 

Ashwin Shankar commented on YARN-2540:
--

Hi [~kasha], when you get a chance can you please review/commit the latest 
patch ?

 Fair Scheduler : queue filters not working on scheduler page in RM UI
 -

 Key: YARN-2540
 URL: https://issues.apache.org/jira/browse/YARN-2540
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.5.0, 2.5.1
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: YARN-2540-v1.txt, YARN-2540-v2.txt, YARN-2540-v3.txt


 Steps to reproduce :
 1. Run an app in default queue.
 2. While the app is running, go to the scheduler page on RM UI.
 3. You would see the app in the apptable at the bottom.
 4. Now click on default queue to filter the apptable on root.default.
 5. App disappears from apptable although it is running on default queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store


 [ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2320:
--
Attachment: YARN-2320.2.patch

Remove the unnecessary proto file as well

 Removing old application history store after we store the history data to 
 timeline store
 

 Key: YARN-2320
 URL: https://issues.apache.org/jira/browse/YARN-2320
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2320.1.patch, YARN-2320.2.patch


 After YARN-2033, we should deprecate application history store set. There's 
 no need to maintain two sets of store interfaces. In addition, we should 
 conclude the outstanding jira's under YARN-321 about the application history 
 store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2505) [YARN-796] Support get/add/remove/change labels in RM REST API


[ 
https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143855#comment-14143855
 ] 

Craig Welch commented on YARN-2505:
---

1) 
-re rename all-nodes-to-lables to nodes-to-labels - done

-re node-filter, I don't think that it makes sense to switch it.  While 
code-wise I see where it is awkward to do a value filter, this follows the spec 
and it makes sense from a use case perspective - I expect that the desire is to 
find all of the nodes which have a particular label on them, that is the 
purpose of this filter and it makes sense to me that someone would want to do 
that and it seems to fit in with this api.  I think there are easier ways to 
see what labels are on a node, adding it as a filter to this kind of an api 
call makes little sense to me anyway as it is more-or-less a direct property of 
a node - if it's missing I think it belongs elsewhere else anyway.

Have shortened lines where found



 [YARN-796] Support get/add/remove/change labels in RM REST API
 --

 Key: YARN-2505
 URL: https://issues.apache.org/jira/browse/YARN-2505
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Craig Welch
 Attachments: YARN-2505.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2505) [YARN-796] Support get/add/remove/change labels in RM REST API


 [ 
https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-2505:
--
Attachment: YARN-2505.1.patch

 [YARN-796] Support get/add/remove/change labels in RM REST API
 --

 Key: YARN-2505
 URL: https://issues.apache.org/jira/browse/YARN-2505
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Craig Welch
 Attachments: YARN-2505.1.patch, YARN-2505.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143890#comment-14143890
 ] 

Karthik Kambatla commented on YARN-2578:


Would it be possible to add a test case for this?

 NM does not failover timely if RM node network connection fails
 ---

 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg
 Attachments: YARN-2578.patch


 The NM does not fail over correctly when the network cable of the RM is 
 unplugged or the failure is simulated by a service network stop or a 
 firewall that drops all traffic on the node. The RM fails over to the standby 
 node when the failure is detected as expected. The NM should than re-register 
 with the new active RM. This re-register takes a long time (15 minutes or 
 more). Until then the cluster has no nodes for processing and applications 
 are stuck.
 Reproduction test case which can be used in any environment:
 - create a cluster with 3 nodes
 node 1: ZK, NN, JN, ZKFC, DN, RM, NM
 node 2: ZK, NN, JN, ZKFC, DN, RM, NM
 node 3: ZK, JN, DN, NM
 - start all services make sure they are in good health
 - kill the network connection of the RM that is active using one of the 
 network kills from above
 - observe the NN and RM failover
 - the DN's fail over to the new active NN
 - the NM does not recover for a long time
 - the logs show a long delay and traces show no change at all
 The stack traces of the NM all show the same set of threads. The main thread 
 which should be used in the re-register is the Node Status Updater This 
 thread is stuck in:
 {code}
 Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
 Object.wait() [0x7f5a51fc1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
   - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
 {code}
 The client connection which goes through the proxy can be traced back to the 
 ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
 should be using a version which takes the RPC timeout (from the 
 configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2540) FairScheduler: Queue filters not working on scheduler page in RM UI


 [ 
https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2540:
---
Summary: FairScheduler: Queue filters not working on scheduler page in RM 
UI  (was: Fair Scheduler : queue filters not working on scheduler page in RM UI)

 FairScheduler: Queue filters not working on scheduler page in RM UI
 ---

 Key: YARN-2540
 URL: https://issues.apache.org/jira/browse/YARN-2540
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.5.0, 2.5.1
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: YARN-2540-v1.txt, YARN-2540-v2.txt, YARN-2540-v3.txt


 Steps to reproduce :
 1. Run an app in default queue.
 2. While the app is running, go to the scheduler page on RM UI.
 3. You would see the app in the apptable at the bottom.
 4. Now click on default queue to filter the apptable on root.default.
 5. App disappears from apptable although it is running on default queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2569) Log Handling for LRS API Changes


 [ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2569:

Attachment: YARN-2569.4.patch

fix all the latest comments

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, 
 YARN-2569.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143951#comment-14143951
 ] 

Vinod Kumar Vavilapalli commented on YARN-2578:
---

bq. The NM does not fail over correctly when the network cable of the RM is 
unplugged or the failure is simulated by a service network stop or a firewall 
that drops all traffic on the node. The RM fails over to the standby node when 
the failure is detected as expected.
I am surprised that RM itself fails over (in the context of firewall rule that 
drops traffic) - we never implemented health monitoring like in ZKFC with HDFS. 
It seems like if the rpc port gets blocked the RM will not failover as the 
embedded ZK continues to use the local loop-back and so doesn't detect the 
network failure.

 NM does not failover timely if RM node network connection fails
 ---

 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg
 Attachments: YARN-2578.patch


 The NM does not fail over correctly when the network cable of the RM is 
 unplugged or the failure is simulated by a service network stop or a 
 firewall that drops all traffic on the node. The RM fails over to the standby 
 node when the failure is detected as expected. The NM should than re-register 
 with the new active RM. This re-register takes a long time (15 minutes or 
 more). Until then the cluster has no nodes for processing and applications 
 are stuck.
 Reproduction test case which can be used in any environment:
 - create a cluster with 3 nodes
 node 1: ZK, NN, JN, ZKFC, DN, RM, NM
 node 2: ZK, NN, JN, ZKFC, DN, RM, NM
 node 3: ZK, JN, DN, NM
 - start all services make sure they are in good health
 - kill the network connection of the RM that is active using one of the 
 network kills from above
 - observe the NN and RM failover
 - the DN's fail over to the new active NN
 - the NM does not recover for a long time
 - the logs show a long delay and traces show no change at all
 The stack traces of the NM all show the same set of threads. The main thread 
 which should be used in the re-register is the Node Status Updater This 
 thread is stuck in:
 {code}
 Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
 Object.wait() [0x7f5a51fc1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
   - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
 {code}
 The client connection which goes through the proxy can be traced back to the 
 ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
 should be using a version which takes the RPC timeout (from the 
 configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1959) Fix headroom calculation in Fair Scheduler

2014-09-22 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1959:

Attachment: YARN-1959.001.patch

Addressed feedback

 Fix headroom calculation in Fair Scheduler
 --

 Key: YARN-1959
 URL: https://issues.apache.org/jira/browse/YARN-1959
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Anubhav Dhoot
 Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch


 The Fair Scheduler currently always sets the headroom to 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182

2014-09-22 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143964#comment-14143964
 ] 

Vinod Kumar Vavilapalli commented on YARN-2562:
---

How about {{container_e17_1410901177871_0001_01_05}}? A number at the end 
for me always pointed to the container-id. We also don't need to be verbose 
with epoch. And we can still parse it in a backwards compatible fashion.

If nothing, my fourth preference is to have something like 
{{container_1410901177871_0001_01_05_e17}}, the first three preferences are 
what I proposed above :P

 ContainerId@toString() is unreadable for epoch 0 after YARN-2182
 -

 Key: YARN-2562
 URL: https://issues.apache.org/jira/browse/YARN-2562
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-2562.1.patch


 ContainerID string format is unreadable for RMs that restarted at least once 
 (epoch  0) after YARN-2182. For e.g, 
 container_1410901177871_0001_01_05_17.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2540) FairScheduler: Queue filters not working on scheduler page in RM UI


[ 
https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143967#comment-14143967
 ] 

Karthik Kambatla commented on YARN-2540:


Verified the patch fixes the issue on a pseudo-dist cluster. +1. Committing 
this. 

 FairScheduler: Queue filters not working on scheduler page in RM UI
 ---

 Key: YARN-2540
 URL: https://issues.apache.org/jira/browse/YARN-2540
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.5.0, 2.5.1
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: YARN-2540-v1.txt, YARN-2540-v2.txt, YARN-2540-v3.txt


 Steps to reproduce :
 1. Run an app in default queue.
 2. While the app is running, go to the scheduler page on RM UI.
 3. You would see the app in the apptable at the bottom.
 4. Now click on default queue to filter the apptable on root.default.
 5. App disappears from apptable although it is running on default queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-22 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143974#comment-14143974
]

Jason Lowe commented on YARN-90:

Thanks, Varun! Comments on the latest patch:

It's a bit odd to have a hash map to map disk error types to lists of
directories, fill them all in, but we only in practice actually look at one
type in the map and that's DISK_FULL. It'd be simpler (and faster and less
space since there's no hashmap involved) to just track full disks as a separate
collection like we already do for localDirs and failedDirs.

Nit: DISK_ERROR_CAUSE should be DiskErrorCause (if we keep the enum) to match
the style of other enum types in the code.

In verifyDirUsingMkdir, if an error occurs during the finally clause then that
exception will mask the original exception

isDiskUsageUnderPercentageLimit is named backwards. Disk usage being under the
configured limit shouldn't be a full disk error, and the error message is
inconsistent with the method name (method talks about being under but error
message says its above).
{code}
if (isDiskUsageUnderPercentageLimit(testDir)) {
msg =
used space above threshold of
+ diskUtilizationPercentageCutoff
+ %, removing from the list of valid directories.;
{code}

We should only call getDisksHealthReport() once in the following code:
{code}
+String report = getDisksHealthReport();
+if (!report.isEmpty()) {
+ LOG.info(Disk(s) failed. + getDisksHealthReport());
{code}

Should updateDirsAfterTest always say Disk(s) failed if the report isn't
empty? Thinking of the case where two disks go bad, then one later is
restored. The health report will still have something, but that last update is
a disk turning good not failing. Before this code was only called when a new
disk failed, and now that's not always the case. Maybe it should just be
something like Disk health update: instead?

Is it really necessary to stat a directory before we try to delete it? Seems
like we can just try to delete it.

The idiom of getting the directories and adding the full directories seems
pretty common. Might be good to have dirhandler methods that already do this,
like getLocalDirsForCleanup or getLogDirsForCleanup.

I'm a bit worried that getInitializedLocalDirs could potentially try to delete
an entire directory tree for a disk. If this fails in some sector-specific way
but other containers are currently using their files from other sectors just
fine on the same disk, removing these files from underneath active containers
could be very problematic and difficult to debug.

NodeManager should identify failed disks becoming good back again
-

Key: YARN-90
URL: https://issues.apache.org/jira/browse/YARN-90
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch,
YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch,
apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch

MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes
down, it is marked as failed forever. To reuse that disk (after it becomes
good), NodeManager needs restart. This JIRA is to improve NodeManager to
reuse good disks(which could be bad some time back).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2540) FairScheduler: Queue filters not working on scheduler page in RM UI

2014-09-22 Thread Ashwin Shankar (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143980#comment-14143980
 ] 

Ashwin Shankar commented on YARN-2540:
--

Thanks [~kasha], [~ywskycn] !

 FairScheduler: Queue filters not working on scheduler page in RM UI
 ---

 Key: YARN-2540
 URL: https://issues.apache.org/jira/browse/YARN-2540
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.5.0, 2.5.1
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Fix For: 2.6.0

 Attachments: YARN-2540-v1.txt, YARN-2540-v2.txt, YARN-2540-v3.txt


 Steps to reproduce :
 1. Run an app in default queue.
 2. While the app is running, go to the scheduler page on RM UI.
 3. You would see the app in the apptable at the bottom.
 4. Now click on default queue to filter the apptable on root.default.
 5. App disappears from apptable although it is running on default queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels

[
https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143993#comment-14143993
]

Craig Welch commented on YARN-2496:
---

So, re the headroom issue (2) - the short version - I don't think we can put
off addressing this, because I think it is going to be a typical case and will
be problematic. I think the most realistic solution is to support only a short
list of pre-configured label expressions per queue. Another option is to limit
nodes to supporting only 1 label per node (which, realistically, might be
sufficient). A third option is to limit the number of labels which a queue can
access to a very small value + the all value (1-2). Basically, one of the
factors pushing the large set of possible values which must be considered to
properly calculate headroom needs to be made finite/drastically reduced.

longer version...
I don't think we should move forward without addressing it. I say this because
I think it is likely to be a typical situation to have a queue which has more
than one label associated with it- most likely, the simple case of a queue
which can address all nodes some of which have a label and some of which do
not. Jobs entering these queues using a restrictive label expression will hit
this headroom issue - it's especially true in cases where there are lower
resources, which is what one would expect from a small set of special
machines (e.g. typical node label case). It's important to make sure headroom
is correctly handled as we add node labels, and as things stand, we know it is
not.

I'm afraid it is something of a design issue, allowing arbitrary node label
expressions with multiple labels on queues, etc, is leading to something of a
combinatory explosion. It may be that the right solution is to narrow the
feature set a bit for this iteration. We could choose to only support a
restricted set of expressions on a given queue. This could even mean only
supporting the default label expression - I'm concerned that this may be too
restrictive - and so that we would need to support a set of expressions. This
could then be a finite list which is pre-calculated. I think, in practical
terms, this will probably meet people's needs. A second option is to restrict
the number of labels supported on a queue, a small enough set could be
pre-calculated for all possibilities. I suspicious of this latter option,
though, it would have to be a very small number of labels to be manageable and
I think it reduces, realistically, to the restricted set of expressions.

I also don't see any performant way to support arbitrary nodelable expressions
on every request with unlimited labels per queue and node - things as they are.
It appears to me you would need to keep track of all resource values for
intersection of all label combinations. If we limited the number of possible
labels on a node to one then we could calculate based on expressions at runtime
(possibly for a very small number 1, but again, growth is exponential? I
believe... and functionally complex)

[YARN-796] Changes for capacity scheduler to support allocate resource
respect labels
-

Key: YARN-2496
URL: https://issues.apache.org/jira/browse/YARN-2496
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch,
YARN-2496.patch

This JIRA Includes:
- Add/parse labels option to {{capacity-scheduler.xml}} similar to other
options of queue like capacity/maximum-capacity, etc.
- Include a default-label-expression option in queue config, if an app
doesn't specify label-expression, default-label-expression of queue will be
used.
- Check if labels can be accessed by the queue when submit an app with
labels-expression to queue or update ResourceRequest with label-expression
- Check labels on NM when trying to allocate ResourceRequest on the NM with
label-expression
- Respect labels when calculate headroom/user-limit

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2539) FairScheduler: Update the default value for maxAMShare


[ 
https://issues.apache.org/jira/browse/YARN-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143997#comment-14143997
 ] 

Karthik Kambatla commented on YARN-2539:


+1

 FairScheduler: Update the default value for maxAMShare
 --

 Key: YARN-2539
 URL: https://issues.apache.org/jira/browse/YARN-2539
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Minor
 Attachments: YARN-2539-1.patch


 Currently, the maxAMShare per queue is -1 in default, which disables the AM 
 share constraint. Change to 0.5f would be good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2505) [YARN-796] Support get/add/remove/change labels in RM REST API


[ 
https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143998#comment-14143998
 ] 

Hadoop QA commented on YARN-2505:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670512/YARN-2505.1.patch
  against trunk revision 23e17ce.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5073//console

This message is automatically generated.

 [YARN-796] Support get/add/remove/change labels in RM REST API
 --

 Key: YARN-2505
 URL: https://issues.apache.org/jira/browse/YARN-2505
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Craig Welch
 Attachments: YARN-2505.1.patch, YARN-2505.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2539) FairScheduler: Set the default value for maxAMShare to 0.5


 [ 
https://issues.apache.org/jira/browse/YARN-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2539:
---
Summary: FairScheduler: Set the default value for maxAMShare to 0.5  (was: 
FairScheduler: Update the default value for maxAMShare)

 FairScheduler: Set the default value for maxAMShare to 0.5
 --

 Key: YARN-2539
 URL: https://issues.apache.org/jira/browse/YARN-2539
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
Priority: Minor
 Attachments: YARN-2539-1.patch


 Currently, the maxAMShare per queue is -1 in default, which disables the AM 
 share constraint. Change to 0.5f would be good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144004#comment-14144004
 ] 

Hadoop QA commented on YARN-2578:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670359/YARN-2578.patch
  against trunk revision 23e17ce.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5071//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5071//console

This message is automatically generated.

 NM does not failover timely if RM node network connection fails
 ---

 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg
 Attachments: YARN-2578.patch


 The NM does not fail over correctly when the network cable of the RM is 
 unplugged or the failure is simulated by a service network stop or a 
 firewall that drops all traffic on the node. The RM fails over to the standby 
 node when the failure is detected as expected. The NM should than re-register 
 with the new active RM. This re-register takes a long time (15 minutes or 
 more). Until then the cluster has no nodes for processing and applications 
 are stuck.
 Reproduction test case which can be used in any environment:
 - create a cluster with 3 nodes
 node 1: ZK, NN, JN, ZKFC, DN, RM, NM
 node 2: ZK, NN, JN, ZKFC, DN, RM, NM
 node 3: ZK, JN, DN, NM
 - start all services make sure they are in good health
 - kill the network connection of the RM that is active using one of the 
 network kills from above
 - observe the NN and RM failover
 - the DN's fail over to the new active NN
 - the NM does not recover for a long time
 - the logs show a long delay and traces show no change at all
 The stack traces of the NM all show the same set of threads. The main thread 
 which should be used in the re-register is the Node Status Updater This 
 thread is stuck in:
 {code}
 Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
 Object.wait() [0x7f5a51fc1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
   - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
 {code}
 The client connection which goes through the proxy can be traced back to the 
 ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
 should be using a version which takes the RPC timeout (from the 
 configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2129) Add scheduling priority to the WindowsSecureContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144006#comment-14144006
 ] 

Hadoop QA commented on YARN-2129:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12649565/YARN-2129.2.patch
  against trunk revision 43efdd3.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5074//console

This message is automatically generated.

 Add scheduling priority to the WindowsSecureContainerExecutor
 -

 Key: YARN-2129
 URL: https://issues.apache.org/jira/browse/YARN-2129
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-2129.1.patch, YARN-2129.2.patch


 The WCE (YARN-1972) could and should honor 
 NM_CONTAINER_EXECUTOR_SCHED_PRIORITY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144016#comment-14144016
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

To address [~vinodkv] comments: The active RM is completely shut off from the 
network so are all the other services on the node, including zookeeper. The RM 
can update zookeeper but that will never be propagated outside of the node to 
the other zookeeper nodes. It can thus not be seen by the standby RM. The 
standby RM detects no updates in zookeeper for the timeout period and becomes 
the active node. That is the normal HA behaviour from the standby node as if 
the RM would have crashed.


 NM does not failover timely if RM node network connection fails
 ---

 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg
 Attachments: YARN-2578.patch


 The NM does not fail over correctly when the network cable of the RM is 
 unplugged or the failure is simulated by a service network stop or a 
 firewall that drops all traffic on the node. The RM fails over to the standby 
 node when the failure is detected as expected. The NM should than re-register 
 with the new active RM. This re-register takes a long time (15 minutes or 
 more). Until then the cluster has no nodes for processing and applications 
 are stuck.
 Reproduction test case which can be used in any environment:
 - create a cluster with 3 nodes
 node 1: ZK, NN, JN, ZKFC, DN, RM, NM
 node 2: ZK, NN, JN, ZKFC, DN, RM, NM
 node 3: ZK, JN, DN, NM
 - start all services make sure they are in good health
 - kill the network connection of the RM that is active using one of the 
 network kills from above
 - observe the NN and RM failover
 - the DN's fail over to the new active NN
 - the NM does not recover for a long time
 - the logs show a long delay and traces show no change at all
 The stack traces of the NM all show the same set of threads. The main thread 
 which should be used in the re-register is the Node Status Updater This 
 thread is stuck in:
 {code}
 Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
 Object.wait() [0x7f5a51fc1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
   - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
 {code}
 The client connection which goes through the proxy can be traced back to the 
 ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
 should be using a version which takes the RPC timeout (from the 
 configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144022#comment-14144022
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

I looked into automated testing but like in HDFS-4858 I have not been able to 
find a way to test this using junit tests. Manual testing is really simple 
using the above reproduction scenario.

 NM does not failover timely if RM node network connection fails
 ---

 Key: YARN-2578
 URL: https://issues.apache.org/jira/browse/YARN-2578
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Wilfred Spiegelenburg
 Attachments: YARN-2578.patch


 The NM does not fail over correctly when the network cable of the RM is 
 unplugged or the failure is simulated by a service network stop or a 
 firewall that drops all traffic on the node. The RM fails over to the standby 
 node when the failure is detected as expected. The NM should than re-register 
 with the new active RM. This re-register takes a long time (15 minutes or 
 more). Until then the cluster has no nodes for processing and applications 
 are stuck.
 Reproduction test case which can be used in any environment:
 - create a cluster with 3 nodes
 node 1: ZK, NN, JN, ZKFC, DN, RM, NM
 node 2: ZK, NN, JN, ZKFC, DN, RM, NM
 node 3: ZK, JN, DN, NM
 - start all services make sure they are in good health
 - kill the network connection of the RM that is active using one of the 
 network kills from above
 - observe the NN and RM failover
 - the DN's fail over to the new active NN
 - the NM does not recover for a long time
 - the logs show a long delay and traces show no change at all
 The stack traces of the NM all show the same set of threads. The main thread 
 which should be used in the re-register is the Node Status Updater This 
 thread is stuck in:
 {code}
 Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
 Object.wait() [0x7f5a51fc1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
   - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
 {code}
 The client connection which goes through the proxy can be traced back to the 
 ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
 should be using a version which takes the RPC timeout (from the 
 configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated


[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144037#comment-14144037
 ] 

Tsuyoshi OZAWA commented on YARN-2312:
--

Talked with Jian offline.

{quote}
2. Priority. Can we change the definition of Proto? It's used widely and one 
concern is backward compatibility.
{quote}

Priority class is used with ContainerId#getId only in test code(e.g. 
ApplicationHistoryStoreTestUtils). We can leave it for now.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler


[ 
https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144043#comment-14144043
 ] 

Karthik Kambatla commented on YARN-1959:


Thanks Anubhav. 

Thought about this a little more, and I wonder if we need to have separate 
headroom calculations for policies. Would DRF#getHeadroom not work for other 
policies? 

 Fix headroom calculation in Fair Scheduler
 --

 Key: YARN-1959
 URL: https://issues.apache.org/jira/browse/YARN-1959
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Anubhav Dhoot
 Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch


 The Fair Scheduler currently always sets the headroom to 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store

[
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144053#comment-14144053
]

Hadoop QA commented on YARN-2320:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12670492/YARN-2320.2.patch
against trunk revision 23e17ce.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 23 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

org.apache.hadoop.yarn.server.TestContainerManagerSecurity

The following test timeouts occurred in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage
org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens
org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5072//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5072//console

This message is automatically generated.

Removing old application history store after we store the history data to
timeline store

Key: YARN-2320
URL: https://issues.apache.org/jira/browse/YARN-2320
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Attachments: YARN-2320.1.patch, YARN-2320.2.patch

After YARN-2033, we should deprecate application history store set. There's
no need to maintain two sets of store interfaces. In addition, we should
conclude the outstanding jira's under YARN-321 about the application history
store.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes


[ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144075#comment-14144075
 ] 

Hadoop QA commented on YARN-2569:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670525/YARN-2569.4.patch
  against trunk revision 43efdd3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5075//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5075//console

This message is automatically generated.

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, 
 YARN-2569.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler

2014-09-22 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144076#comment-14144076
 ] 

Anubhav Dhoot commented on YARN-1959:
-

The queue fair share for fifo and fair policies, sets CPU to zero always. Thus 
using DRF calculations would cause the headroom to always be set to zero CPU. 
That can be incorrectly interpreted by the user as having no headroom for CPU 
(instead of don't care about CPU headroom). 

 Fix headroom calculation in Fair Scheduler
 --

 Key: YARN-1959
 URL: https://issues.apache.org/jira/browse/YARN-1959
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Anubhav Dhoot
 Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch


 The Fair Scheduler currently always sets the headroom to 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2056) Disable preemption at Queue level

2014-09-22 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144087#comment-14144087
 ] 

Eric Payne commented on YARN-2056:
--

[~leftnoteasy]: Good catch! It's actually even worse than what you specified. 
The way the patch is written now, if the disable preemption queue is 1) over 
capacity and 2) asking for more resources, it will preempt from other queues 
and make them go below their guarantee!

I don't have a good suggestion to fix the problem you have outlined other than 
stating the following:
If a queue is over capacity and has untouchable resources in its pool, it 
cannot preempt other queues at that level. In other words, if you disable 
preemption on a queue, the only way it will get over it's capacity is when 
other resources free up. Those other resources won't be preempted to fulfill a 
non-preemptable queues request if that non-preemptable queue is already over 
capacity.

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144090#comment-14144090
 ] 

Craig Welch commented on YARN-796:
--

It looks like the FileSystemNodeLabelManager will just append changes to the 
edit log forever, until it is restarted, is that correct?  If so, a 
long-running cluster with lots of changes could result in a rather large edit 
log.  I think every so many writes (N writes) a recovery should be forced to 
clean up the edit log and consolidate state (do a recover...)

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, 
 Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, 
 YARN-796.node-label.consolidate.1.patch, 
 YARN-796.node-label.consolidate.2.patch, 
 YARN-796.node-label.consolidate.3.patch, 
 YARN-796.node-label.consolidate.4.patch, 
 YARN-796.node-label.consolidate.5.patch, 
 YARN-796.node-label.consolidate.6.patch, 
 YARN-796.node-label.consolidate.7.patch, 
 YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, 
 YARN-796.patch, YARN-796.patch4


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler


[ 
https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144099#comment-14144099
 ] 

Hadoop QA commented on YARN-1959:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670524/YARN-1959.001.patch
  against trunk revision 43efdd3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens
org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup
org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage
org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5076//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5076//console

This message is automatically generated.

 Fix headroom calculation in Fair Scheduler
 --

 Key: YARN-1959
 URL: https://issues.apache.org/jira/browse/YARN-1959
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Anubhav Dhoot
 Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch


 The Fair Scheduler currently always sets the headroom to 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler


[ 
https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144111#comment-14144111
 ] 

Karthik Kambatla commented on YARN-1959:


Thanks for the clarification here and offline. I understand why the headroom 
needs to policy specific. Couple of nits:
# In FifoPolicy and FairSharePolicy, we can avoid one instance of Resource - 
{{queueAvailable}}, and use an int for memory instead. May be, we should just 
use two ints in DRFPolicy as well. 
# TestFSAppAttempt#VerifyHeadroom should be verifyHeadroom. 



 Fix headroom calculation in Fair Scheduler
 --

 Key: YARN-1959
 URL: https://issues.apache.org/jira/browse/YARN-1959
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Anubhav Dhoot
 Attachments: YARN-1959.001.patch, YARN-1959.prelim.patch


 The Fair Scheduler currently always sets the headroom to 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1959) Fix headroom calculation in Fair Scheduler

2014-09-22 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1959:

Attachment: YARN-1959.002.patch

Addressed feedback

 Fix headroom calculation in Fair Scheduler
 --

 Key: YARN-1959
 URL: https://issues.apache.org/jira/browse/YARN-1959
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Anubhav Dhoot
 Attachments: YARN-1959.001.patch, YARN-1959.002.patch, 
 YARN-1959.prelim.patch


 The Fair Scheduler currently always sets the headroom to 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2168) SCM/Client/NM/Admin protocols

2014-09-22 Thread Chris Trezzo (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144152#comment-14144152
]

Chris Trezzo commented on YARN-2168:

Thanks for the comments [~vinodkv]. I will make changes to reflect all of these
comments in the appropriate sub-patches.

SCM/Client/NM/Admin protocols
-

Key: YARN-2168
URL: https://issues.apache.org/jira/browse/YARN-2168
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
Attachments: YARN-2168-trunk-v1.patch, YARN-2168-trunk-v2.patch

This jira is meant to be used to review the main shared cache APIs. They are
as follows:
* ClientSCMProtocol - The protocol between the yarn client and the cache
manager. This protocol controls how resources in the cache are claimed and
released.
** UseSharedCacheResourceRequest
** UseSharedCacheResourceResponse
** ReleaseSharedCacheResourceRequest
** ReleaseSharedCacheResourceResponse
* SCMAdminProtocol - This is an administrative protocol for the cache
manager. It allows administrators to manually trigger cleaner runs.
** RunSharedCacheCleanerTaskRequest
** RunSharedCacheCleanerTaskResponse
* NMCacheUploaderSCMProtocol - The protocol between the NodeManager and the
cache manager. This allows the NodeManager to coordinate with the cache
manager when uploading new resources to the shared cache.
** NotifySCMRequest
** NotifySCMResponse

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2569) Log Handling for LRS API Changes


 [ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2569:

Attachment: YARN-2569.4.1.patch

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, 
 YARN-2569.4.1.patch, YARN-2569.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler


[ 
https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144184#comment-14144184
 ] 

Hadoop QA commented on YARN-1959:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670555/YARN-1959.002.patch
  against trunk revision 7b8df93.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup
org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5077//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5077//console

This message is automatically generated.

 Fix headroom calculation in Fair Scheduler
 --

 Key: YARN-1959
 URL: https://issues.apache.org/jira/browse/YARN-1959
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Anubhav Dhoot
 Attachments: YARN-1959.001.patch, YARN-1959.002.patch, 
 YARN-1959.prelim.patch


 The Fair Scheduler currently always sets the headroom to 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes


[ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144198#comment-14144198
 ] 

Hadoop QA commented on YARN-2569:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670568/YARN-2569.4.1.patch
  against trunk revision 7b8df93.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5078//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5078//console

This message is automatically generated.

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, 
 YARN-2569.4.1.patch, YARN-2569.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes


[ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144268#comment-14144268
 ] 

Zhijie Shen commented on YARN-2569:
---

LGTM in general. Some comments about the patch.

1. Per discussion offline, is it a bit aggressive to mark the new APIs 
\@Stable? In particular when the class is marked \@Evolving. BTW, should we 
make LogAggregationContext \@Public?

2. It's good to describe what kind of pattern the user should use? Wildcard 
patten? 
http://en.wikipedia.org/wiki/Wildcard_character#File_and_directory_patterns

3. Miss a full stop?
{code}
+ * how often the logAggregationSerivce uploads container logs in seconds
{code}

4. The description is broken?
{code}
+   *  to set
{code}

5. It shouldn't be part of API?
{code}
+
+  @Private
+  public abstract LogAggregationContextProto getProto();
{code}

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, 
 YARN-2569.4.1.patch, YARN-2569.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2581) NMs need to find a way to get LogAggregationContext


 [ 
https://issues.apache.org/jira/browse/YARN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2581:

Attachment: YARN-2581.1.patch

 NMs need to find a way to get LogAggregationContext
 ---

 Key: YARN-2581
 URL: https://issues.apache.org/jira/browse/YARN-2581
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2581.1.patch


 After YARN-2569, we have LogAggregationContext for application in 
 ApplicationSubmissionContext. NMs need to find a way to get this information.
 We have this requirement: For all containers in the same application should 
 honor the same LogAggregationContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2584) TestContainerManagerSecurity fails on trunk

Zhijie Shen created YARN-2584:
-

 Summary: TestContainerManagerSecurity fails on trunk
 Key: YARN-2584
 URL: https://issues.apache.org/jira/browse/YARN-2584
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Zhijie Shen


{code}
Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec  
FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity
testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
  Time elapsed: 259.553 sec   FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertFalse(Assert.java:64)
at org.junit.Assert.assertFalse(Assert.java:74)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)

testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
  Time elapsed: 258.762 sec   FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertFalse(Assert.java:64)
at org.junit.Assert.assertFalse(Assert.java:74)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store


[ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144301#comment-14144301
 ] 

Zhijie Shen commented on YARN-2320:
---

The console log only shows TestContainerManagerSecurity, which seems to fail on 
trunk as well. File a Jira for it: YARN-2584

 Removing old application history store after we store the history data to 
 timeline store
 

 Key: YARN-2320
 URL: https://issues.apache.org/jira/browse/YARN-2320
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2320.1.patch, YARN-2320.2.patch


 After YARN-2033, we should deprecate application history store set. There's 
 no need to maintain two sets of store interfaces. In addition, we should 
 conclude the outstanding jira's under YARN-321 about the application history 
 store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2585) TestContainerManagerSecurity failed on trunk

2014-09-22 Thread Junping Du (JIRA)

Junping Du created YARN-2585:


 Summary: TestContainerManagerSecurity failed on trunk
 Key: YARN-2585
 URL: https://issues.apache.org/jira/browse/YARN-2585
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Jian He






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2584) TestContainerManagerSecurity fails on trunk


 [ 
https://issues.apache.org/jira/browse/YARN-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He reassigned YARN-2584:
-

Assignee: Jian He

 TestContainerManagerSecurity fails on trunk
 ---

 Key: YARN-2584
 URL: https://issues.apache.org/jira/browse/YARN-2584
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Zhijie Shen
Assignee: Jian He

 {code}
 Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec 
  FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity
 testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
   Time elapsed: 259.553 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
 testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
   Time elapsed: 258.762 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2584) TestContainerManagerSecurity fails on trunk


 [ 
https://issues.apache.org/jira/browse/YARN-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2584:
--
Attachment: YARN-2584.1.patch

uploaded a patch

 TestContainerManagerSecurity fails on trunk
 ---

 Key: YARN-2584
 URL: https://issues.apache.org/jira/browse/YARN-2584
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Zhijie Shen
Assignee: Jian He
 Attachments: YARN-2584.1.patch


 {code}
 Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec 
  FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity
 testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
   Time elapsed: 259.553 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
 testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
   Time elapsed: 258.762 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2585) TestContainerManagerSecurity failed on trunk

2014-09-22 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du resolved YARN-2585.
--
Resolution: Duplicate

 TestContainerManagerSecurity failed on trunk
 

 Key: YARN-2585
 URL: https://issues.apache.org/jira/browse/YARN-2585
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Jian He





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2581) NMs need to find a way to get LogAggregationContext


 [ 
https://issues.apache.org/jira/browse/YARN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2581:

Attachment: YARN-2581.2.patch

 NMs need to find a way to get LogAggregationContext
 ---

 Key: YARN-2581
 URL: https://issues.apache.org/jira/browse/YARN-2581
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2581.1.patch, YARN-2581.2.patch


 After YARN-2569, we have LogAggregationContext for application in 
 ApplicationSubmissionContext. NMs need to find a way to get this information.
 We have this requirement: For all containers in the same application should 
 honor the same LogAggregationContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2584) TestContainerManagerSecurity fails on trunk

2014-09-22 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144321#comment-14144321
 ] 

Junping Du commented on YARN-2584:
--

Patch looks good to me. +1 pending on Jenkins result.

 TestContainerManagerSecurity fails on trunk
 ---

 Key: YARN-2584
 URL: https://issues.apache.org/jira/browse/YARN-2584
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Zhijie Shen
Assignee: Jian He
 Attachments: YARN-2584.1.patch


 {code}
 Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec 
  FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity
 testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
   Time elapsed: 259.553 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
 testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
   Time elapsed: 258.762 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2569) Log Handling for LRS API Changes


 [ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2569:

Attachment: YARN-2569.5.patch

Addressed all the comments

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, 
 YARN-2569.4.1.patch, YARN-2569.4.patch, YARN-2569.5.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated


 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312.1.patch

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated


[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144335#comment-14144335
 ] 

Tsuyoshi OZAWA commented on YARN-2312:
--

Attached a first patch.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2468) Log handling for LRS


 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.7.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2584) TestContainerManagerSecurity fails on trunk


[ 
https://issues.apache.org/jira/browse/YARN-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144349#comment-14144349
 ] 

Hadoop QA commented on YARN-2584:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670603/YARN-2584.1.patch
  against trunk revision 7b8df93.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5079//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5079//console

This message is automatically generated.

 TestContainerManagerSecurity fails on trunk
 ---

 Key: YARN-2584
 URL: https://issues.apache.org/jira/browse/YARN-2584
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Zhijie Shen
Assignee: Jian He
 Attachments: YARN-2584.1.patch


 {code}
 Tests run: 4, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 561.964 sec 
  FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity
 testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
   Time elapsed: 259.553 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
 testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
   Time elapsed: 258.762 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:365)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:304)
   at 
 org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:149)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes


[ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144367#comment-14144367
 ] 

Hadoop QA commented on YARN-2569:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670606/YARN-2569.5.patch
  against trunk revision 7b8df93.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5080//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5080//console

This message is automatically generated.

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch, YARN-2569.2.patch, YARN-2569.3.patch, 
 YARN-2569.4.1.patch, YARN-2569.4.patch, YARN-2569.5.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes