[jira] [Created] (YARN-4332) UI timestamps are unconditionally rendered in browser timezone

2015-11-04 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4332:


 Summary: UI timestamps are unconditionally rendered in browser 
timezone
 Key: YARN-4332
 URL: https://issues.apache.org/jira/browse/YARN-4332
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Jason Lowe


Timestamps are being rendered in the browser local timezone which makes it hard 
to line up with events in task logfiles when the cluster isn't in the same 
timezone as the browser.  This either needs to be restored to UTC time or at 
least be configurable whether this behavior is desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834

2015-11-04 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990399#comment-14990399
 ] 

Karthik Kambatla commented on YARN-4032:


[~jianhe]'s suggestion makes sense to me. Maybe do the following:
{code}
if (app-recovery-fails) { 
  if (previous attempt is FINISHED) {
 skip this application
  } else if (fail-fast is false) {
 fail application
  } else {
 crash RM
  }
}
{code}

> Corrupted state from a previous version can still cause RM to fail with NPE 
> due to same reasons as YARN-2834
> 
>
> Key: YARN-4032
> URL: https://issues.apache.org/jira/browse/YARN-4032
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Anubhav Dhoot
>Priority: Critical
> Attachments: YARN-4032.prelim.patch
>
>
> YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if 
> someone is upgrading from a previous version, the state can still be 
> inconsistent and then RM will still fail with NPE after upgrade to 2.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages

2015-11-04 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena reassigned YARN-4330:
--

Assignee: Varun Saxena

> MiniYARNCluster prints multiple  Failed to instantiate default resource 
> calculator warning messages
> ---
>
> Key: YARN-4330
> URL: https://issues.apache.org/jira/browse/YARN-4330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.8.0
> Environment: OSX, JUnit
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Blocker
>
> Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I 
> see multiple stack traces warning me that a resource calculator plugin could 
> not be created
> {code}
> (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - 
> java.lang.UnsupportedOperationException: Could not determine OS: Failed to 
> instantiate default resource calculator.
> java.lang.UnsupportedOperationException: Could not determine OS
> {code}
> This is a minicluster. It doesn't need resource calculation. It certainly 
> doesn't need test logs being cluttered with even more stack traces which will 
> only generate false alarms about tests failing. 
> There needs to be a way to turn this off, and the minicluster should have it 
> that way by default.
> Being ruthless and marking as a blocker, because its a fairly major 
> regression for anyone testing with the minicluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4314) Adding container wait time as a metric at queue level and application level.

2015-11-04 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990346#comment-14990346
 ] 

Karthik Kambatla commented on YARN-4314:


bq. I feel adding timestamp to each resource request will be costly and all the 
existing applications will need to migrate to use this metric.
I was not suggesting the AM set it. It might not be a bad idea to let the AMs 
set it optionally. 

I was thinking the RM could set this on receiving a ResourceRequest, and use it 
to determine duration.

> Adding container wait time as a metric at queue level and application level.
> 
>
> Key: YARN-4314
> URL: https://issues.apache.org/jira/browse/YARN-4314
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
>
> There is a need for adding the container wait-time which can be tracked at 
> the queue and application level. 
> An application can have two kinds of wait times. One is AM wait time after 
> submission and another is total container wait time between AM asking for 
> containers and getting them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990565#comment-14990565
 ] 

Jason Lowe commented on YARN-4311:
--

Thanks for the patch, Kuhu!  Test failures are related, please look into them.

In general the patch seems like a reasonable approach.  There needs to be some 
way for admins to remove nodes that are no longer relevant to the cluster, and 
AFAIK there's no supported way to do this short of restarting the 
resourcemanager.  As nodes churn in and out of the cluster, they will simply 
accumulate in the decommissioned or lost nodes buckets until the next 
resourcemanager restart.

My main concern is the behavior when someone botches the include list (e.g.: 
accidentally truncates the includes list file and refreshes).  At that point 
all of the cluster nodes will disappear from the resourcemanager with no 
indication of what happened (except potentially the shutdown metric will 
increment by the number of nodes lost).  Today they will all go into the 
decommissioned bucket, but with this patch they'll simply disappear.  This 
either needs to be "as designed" behavior, or we'd have to implement a separate 
mechanism outside of the include/exclude lists to direct the RM to "forget" a 
node.  I believe HDFS recently was changed to behave this was as well wrt. the 
include/exclude lists and forgetting nodes (see HDFS-8950), so I'm inclined to 
be consistent with that and say it's "as designed."

Some comments on the patch itself:

isInvalidAndAbsent doesn't have the same handling of IP's as isValidNode does.

It might also be clearer if isInvalidAndAbsent were just named isUntracked or 
isUntrackedNode indicating those are nodes we aren't tracking in any way.

isInvalidAndAbsent doesn't lock hostsReader like isValidNode does.

What about refreshNodesGracefully?  That also refreshes the host 
include/exclude lists and arguably needs similar logic.  We need to discuss 
what it means to gracefully refresh the list when the node completely 
disappears from both the include and exclude list.  Should it still gracefully 
decommission, and how do we make sure that node is properly tracked?  If 
graceful, does it automatically disappear when the decommission completes since 
it's not in either list?

Nit: While looping over the nodes, if the node is valid then there's no reason 
to check if its not valid and absent.  So it could be simplified to the 
following:
{code}
for (NodeId nodeId: rmContext.getRMNodes().keySet()) {
  if (!isValidNode(nodeId.getHost())) {
RMNodeEventType nodeEventType = isInvalidAndAbsent(nodeId.getHost()) ? 
RMNodeEventType.SHUTDOWN : RMNodeEventType.DECOMMISSION;
this.rmContext.getDispatcher().getEventHandler().handle(
new RMNodeEvent(nodeId, nodeEventType));
  }
}
{code}


> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-570) Time strings are formated in different timezone

2015-11-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990603#comment-14990603
 ] 

Jason Lowe commented on YARN-570:
-

So this changed timestamps to be rendered unconditionally in local time?  
That's unfortunate.  See [~aw]'s comment in 
https://issues.apache.org/jira/browse/YARN-2348?focusedCommentId=14073218=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14073218.
  Unfortunately local timezone isn't always the right thing to do because the 
timestamps of the log files for tasks will _not_ be the local timezone of the 
browser when running jobs in a distant colo.  So this change makes it worse for 
users that are lining up events based on what they see in the UI with what they 
see in the logs.

This minimally should have been configurable.  Filed YARN-4332 to either revert 
this change or make it configurable.

> Time strings are formated in different timezone
> ---
>
> Key: YARN-570
> URL: https://issues.apache.org/jira/browse/YARN-570
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.2.0
>Reporter: Peng Zhang
>Assignee: Akira AJISAKA
> Fix For: 2.7.0
>
> Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, 
> YARN-570.3.patch, YARN-570.4.patch, YARN-570.5.patch
>
>
> Time strings on different page are displayed in different timezone.
> If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as 
> "Wed, 10 Apr 2013 08:29:56 GMT"
> If it is formatted by format() in yarn.util.Times, it appears as "10-Apr-2013 
> 16:29:56"
> Same value, but different timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2885) LocalRM: distributed scheduling decisions for queueable containers

2015-11-04 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh reassigned YARN-2885:
-

Assignee: Arun Suresh

> LocalRM: distributed scheduling decisions for queueable containers
> --
>
> Key: YARN-2885
> URL: https://issues.apache.org/jira/browse/YARN-2885
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Arun Suresh
>
> We propose to add a Local ResourceManager (LocalRM) to the NM in order to 
> support distributed scheduling decisions. 
> Architecturally we leverage the RMProxy, introduced in YARN-2884. 
> The LocalRM makes distributed decisions for queuable containers requests. 
> Guaranteed-start requests are still handled by the central RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2882) Introducing container types

2015-11-04 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990881#comment-14990881
 ] 

Arun Suresh commented on YARN-2882:
---

Thanks for the patch [~kkaranasos]. The patch looks mostly good. Few minor nits 
:
# I feel that instead of adding another *newInstance* method in 
*ResourceRequest* class, maybe we replace this with some sort of builder 
pattern. for eg : something like so :
{noformat}
ReseourceRequest req = new 
ResourceRequestBuilder().setPripority(pri).setHostName(hostname).setContainerType(QUEUEABLE)...build();
{noformat}
(I understand.. this might impact other parts of the code, but I believe it 
would make it more extensible in the future.)
# in the *yarn_protos.proto* file, can we add *container_type* after the 
*node_label_expression* field (I feel newer fields should come later)

Also, looks like the patch does not apply correctly anymore, can you please 
rebase ?

> Introducing container types
> ---
>
> Key: YARN-2882
> URL: https://issues.apache.org/jira/browse/YARN-2882
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Attachments: yarn-2882.patch
>
>
> This JIRA introduces the notion of container types.
> We propose two initial types of containers: guaranteed-start and queueable 
> containers.
> Guaranteed-start are the existing containers, which are allocated by the 
> central RM and are instantaneously started, once allocated.
> Queueable is a new type of container, which allows containers to be queued in 
> the NM, thus their execution may be arbitrarily delayed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3223) Resource update during NM graceful decommission

2015-11-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990702#comment-14990702
 ] 

Hadoop QA commented on YARN-3223:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 38s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 24s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
59s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
44s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
45s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 27s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 25s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 25s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 56s 
{color} | {color:red} Patch generated 1 new checkstyle issues in root (total 
was 242, now 242). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
56s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 0s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 50s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_60. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 22s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 53s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
24s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 151m 51s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-11-04 |
| JIRA Patch URL | 

[jira] [Created] (YARN-4333) Fair scheduler should support preemption within queue

2015-11-04 Thread Tao Jie (JIRA)
Tao Jie created YARN-4333:
-

 Summary: Fair scheduler should support preemption within queue
 Key: YARN-4333
 URL: https://issues.apache.org/jira/browse/YARN-4333
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Tao Jie


Now each app in fair scheduler is allocated its fairshare, however  fairshare 
resource is not ensured even if fairSharePreemption is enabled.
Consider: 
1, When the cluster is idle, we submit app1 to queueA,which takes maxResource 
of queueA.  
2, Then the cluster becomes busy, but app1 does not release any resource, 
queueA resource usage is over its fairshare
3, Then we submit app2(maybe with higher priority) to queueA. Now app2 has its 
own fairshare, but could not obtain any resource, since queueA is still over 
its fairshare and resource will not assign to queueA anymore. Also, preemption 
is not triggered in this case.
So we should allow preemption within queue, when app is starved for fairshare.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991069#comment-14991069
 ] 

Bikas Saha commented on YARN-2047:
--

>From the description it seems like the original scope was making sure that a 
>lost NM's containers are marked expired by the RM even across RM restart. For 
>that, wont it be enough to save a dead/decommissioned NM info in the state 
>store. Upon restart, repopulate the decommissioned/dead status from the state 
>store. It can take appropriate action at that time - e.g. cancelling an AM 
>containers for those NMs when the AM re-registers or asking those NMs to 
>restart and re-register if they heartbeat again.


If this is a required action then it would also imply that saving a such nodes 
would be a critical state change operation. So, e.g. decommission command from 
the admin should not complete until the store has been updated. Is that the 
case?

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API

2015-11-04 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991253#comment-14991253
 ] 

Sunil G commented on YARN-4292:
---

Test case failures seems unrelated.

> ResourceUtilization should be a part of NodeInfo REST API
> -
>
> Key: YARN-4292
> URL: https://issues.apache.org/jira/browse/YARN-4292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch, 
> 0003-YARN-4292.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2934) Improve handling of container's stderr

2015-11-04 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991260#comment-14991260
 ] 

Naganarasimha G R commented on YARN-2934:
-

Can one of the watchers, please take a look at the patch ?


> Improve handling of container's stderr 
> ---
>
> Key: YARN-2934
> URL: https://issues.apache.org/jira/browse/YARN-2934
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Gera Shegalov
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-2934.v1.001.patch, YARN-2934.v1.002.patch, 
> YARN-2934.v1.003.patch
>
>
> Most YARN applications redirect stderr to some file. That's why when 
> container launch fails with {{ExitCodeException}} the message is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-11-04 Thread Inigo Goiri (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-3980:
--
Attachment: YARN-3980-v4.patch

Adding utilizaiton to FIFO and Fair scheduler.

> Plumb resource-utilization info in node heartbeat through to the scheduler
> --
>
> Key: YARN-3980
> URL: https://issues.apache.org/jira/browse/YARN-3980
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Karthik Kambatla
>Assignee: Inigo Goiri
> Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, 
> YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch
>
>
> YARN-1012 and YARN-3534 collect resource utilization information for all 
> containers and the node respectively and send it to the RM on node heartbeat. 
> We should plumb it through to the scheduler so the scheduler can make use of 
> it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-11-04 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990284#comment-14990284
 ] 

Inigo Goiri commented on YARN-3980:
---

I added the info to the FIFO and Fair schedulers.
I'm rerunning the checks because I couldn't figure the errors in the Javadoc 
and the report is out.

Regarding the test, we will add a unit test with the mini cluster.

> Plumb resource-utilization info in node heartbeat through to the scheduler
> --
>
> Key: YARN-3980
> URL: https://issues.apache.org/jira/browse/YARN-3980
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Karthik Kambatla
>Assignee: Inigo Goiri
> Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, 
> YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch
>
>
> YARN-1012 and YARN-3534 collect resource utilization information for all 
> containers and the node respectively and send it to the RM on node heartbeat. 
> We should plumb it through to the scheduler so the scheduler can make use of 
> it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API

2015-11-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990210#comment-14990210
 ] 

Hadoop QA commented on YARN-4292:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 0s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 35s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
4s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
51s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
49s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 40s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 34s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 4s 
{color} | {color:red} Patch generated 14 new checkstyle issues in root (total 
was 127, now 141). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 41s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 50s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.8.0_60. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 3s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 53s 
{color} | {color:green} hadoop-sls in the patch passed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
24s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 154m 24s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-11-04 |
| JIRA Patch URL | 

[jira] [Updated] (YARN-4331) Restarting NodeManager leaves orphaned containers

2015-11-04 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-4331:
-
Summary: Restarting NodeManager leaves orphaned containers  (was: Killing 
NodeManager leaves orphaned containers)

Note that the killing of the nodemanager itself with SIGKILL should not cause 
the containers to be killed in itself.  Instead the problem seems to be that 
when the nodemanager restarts it is either failing to reacquire the containers 
that were running or it reacquires them and the RM fails to tell the NM to kill 
them when it re-registers.  Updating the summary accordingly.  Also by "the AM 
and its container" I assume you mean the application master and some other 
container that the AM launched.  Please correct me if I'm wrong.  

Is work-preserving nodemanager restart enabled on this cluster?  Without it 
nodemanagers cannot track containers that were previously running, so it will 
not be able to reacquire them and kill them.  If they don't exit on their own 
then they will "leak" and continue running outside of YARN's knowledge.  If 
that feature is not enabled on the nodemanager then this behavior is expected, 
since killing it with SIGKILL gave the nodemanager no chance to perform any 
container cleanup on its own.

If restart is enabled on the nodemanager then this behavior could be correct if 
the application running told the RM that containers should not be killed when 
AM attempts fail.  In that case the container should be left running and its up 
to the AM to reacquire it via some means.  (I believe the RM does provide a bit 
of help there in the AM-RM protocol.)

If the containers were supposed to be killed when the AM attempt failed then we 
need to figure out which of the two possibilities above is the problem.  Could 
you look in the NM logs and see if it said it was able to reacquire the 
previously running containers before it was killed?  If it didn't then we need 
to figure out why, and log snippets around the restart/recovery would be a big 
help.  If it did reacquire the containers and register to the RM with those 
containers then apparently the RM didn't tell the NM to kill the undesired 
containers.  In that case the log from the RM side around the time the NM 
re-registered would be helpful.

> Restarting NodeManager leaves orphaned containers
> -
>
> Key: YARN-4331
> URL: https://issues.apache.org/jira/browse/YARN-4331
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.7.1
>Reporter: Joseph
>Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by 
> killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
> jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the 
> orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989637#comment-14989637
 ] 

MENG DING commented on YARN-1510:
-

I just ran these tests locally with latest trunk and YARN-1510 applied, and 
they all passed:

{code}
---
 T E S T S
---

---
 T E S T S
---
Running org.apache.hadoop.yarn.client.TestGetGroups
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.886 sec - in 
org.apache.hadoop.yarn.client.TestGetGroups
Running org.apache.hadoop.yarn.client.api.impl.TestYarnClient
Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.187 sec - 
in org.apache.hadoop.yarn.client.api.impl.TestYarnClient

Results :

Tests run: 28, Failures: 0, Errors: 0, Skipped: 0
{code}

> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1510) Make NMClient support change container resources

2015-11-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989648#comment-14989648
 ] 

MENG DING commented on YARN-1510:
-

Also ran the following tests, they passed:

{code}
---
 T E S T S
---
Running org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 51.842 sec - 
in org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
Running org.apache.hadoop.yarn.client.api.impl.TestNMClient
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 73.733 sec - in 
org.apache.hadoop.yarn.client.api.impl.TestNMClient

Results :

Tests run: 12, Failures: 0, Errors: 0, Skipped: 0

{code}


> Make NMClient support change container resources
> 
>
> Key: YARN-1510
> URL: https://issues.apache.org/jira/browse/YARN-1510
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Wangda Tan (No longer used)
>Assignee: MENG DING
> Attachments: YARN-1510-YARN-1197.1.patch, 
> YARN-1510-YARN-1197.2.patch, YARN-1510.3.patch, YARN-1510.4.patch, 
> YARN-1510.5.patch, YARN-1510.6.patch, YARN-1510.7.patch
>
>
> As described in YARN-1197, YARN-1449, we need add API in NMClient to support
> 1) sending request of increase/decrease container resource limits
> 2) get succeeded/failed changed containers response from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler

2015-11-04 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989664#comment-14989664
 ] 

Karthik Kambatla commented on YARN-3980:


Thanks Inigo. Looks generally good. Comments:
# javadoc error in SchedulerNode
# Can we make the necessary changes on FairScheduler and FifoScheduler as well? 
It should be similar and straight-forward. 

Not sure if there is a simple way to test this. Can we leverage the SLS to 
verify the node utilization passed in the heartbeat shows up in the scheduler? 
If not, I am comfortable with checking this in without a test. 

> Plumb resource-utilization info in node heartbeat through to the scheduler
> --
>
> Key: YARN-3980
> URL: https://issues.apache.org/jira/browse/YARN-3980
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.7.1
>Reporter: Karthik Kambatla
>Assignee: Inigo Goiri
> Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, 
> YARN-3980-v2.patch, YARN-3980-v3.patch
>
>
> YARN-1012 and YARN-3534 collect resource utilization information for all 
> containers and the node respectively and send it to the RM on node heartbeat. 
> We should plumb it through to the scheduler so the scheduler can make use of 
> it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2015-11-04 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989697#comment-14989697
 ] 

Kuhu Shukla commented on YARN-4311:
---

[~jlowe], [~leftnoteasy], request for comments. Thanks a lot.

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-04 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989812#comment-14989812
 ] 

Jun Gong commented on YARN-2047:


For case 1, RM could save dead NMs in StateStore, when these NM registers with 
containers, RM could let NM kill these containers.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API

2015-11-04 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-4292:
--
Attachment: 0003-YARN-4292.patch

Thank You [~leftnoteasy]
Yes, its better to have a separate class for the resourceUtilization details. 
Kindly help to check the update patch.

> ResourceUtilization should be a part of NodeInfo REST API
> -
>
> Key: YARN-4292
> URL: https://issues.apache.org/jira/browse/YARN-4292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch, 
> 0003-YARN-4292.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-04 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989769#comment-14989769
 ] 

Jun Gong commented on YARN-2047:


I think we could list cases which will cause the problem in the issue:

1. When RM restarts, NM stops and could not restart(e.g. the server is down 
forever).
To deal with this case, RM might need save information about NMs and their 
containers, it might not be acceptable as discussed in YARN-3161. 

2. NM stops; after some time, RM1 regards it as dead and complete containers on 
it; RM1 stops and RM2 becomes active RM. Then NM restarts. Those containers 
will become live again when NM registers them with RM2.
This case is more often than the above case. And we need to solve it. How about 
solving the problem in the NM side? My proposal: adding a timestamp in 
NMStateStore, and update it regularly. When NM restarts, it checks current time 
and last updated timestamp, it could know whether it has been regarded as dead 
in RM, and kills contains if it has been regarded as dead. 

If the proposal in case 2 is OK, I could attach a patch.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages

2015-11-04 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990047#comment-14990047
 ] 

Steve Loughran commented on YARN-4330:
--

Looks like YARN-3534 triggered this. 

Full Stack: note the sheer number of repeated traces

{code}
Projects/slider/slider-core/target/teststandalonerest/teststandalonerest-logDir-nm-0_0
2015-11-04 17:49:31,322 [Thread-2] INFO  server.MiniYARNCluster 
(MiniYARNCluster.java:serviceInit(540)) - Starting NM: 0
2015-11-04 17:49:31,383 [Thread-2] INFO  nodemanager.NodeManager 
(NodeManager.java:getNodeHealthScriptRunner(255)) - Node Manager health check 
script is not available or doesn't have execute permission, so not starting the 
node health script runner.
2015-11-04 17:49:31,469 [Thread-2] WARN  util.ResourceCalculatorPlugin 
(ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - 
java.lang.UnsupportedOperationException: Could not determine OS: Failed to 
instantiate default resource calculator.
java.lang.UnsupportedOperationException: Could not determine OS
at org.apache.hadoop.util.SysInfo.newInstance(SysInfo.java:43)
at 
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.(ResourceCalculatorPlugin.java:41)
at 
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getResourceCalculatorPlugin(ResourceCalculatorPlugin.java:182)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl.serviceInit(NodeResourceMonitorImpl.java:73)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:356)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:541)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:273)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.Service$init.call(Unknown Source)
at 
org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120)
at 
org.apache.slider.test.YarnMiniClusterTestBase.createMiniCluster(YarnMiniClusterTestBase.groovy:291)
at 
org.apache.slider.test.YarnZKMiniClusterTestBase.createMiniCluster(YarnZKMiniClusterTestBase.groovy:110)
at 
org.apache.slider.test.YarnZKMiniClusterTestBase.createMiniCluster(YarnZKMiniClusterTestBase.groovy:127)
at 
org.apache.slider.agent.rest.TestStandaloneREST.testStandaloneREST(TestStandaloneREST.groovy:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
2015-11-04 17:49:31,472 [Thread-2] INFO  nodemanager.NodeResourceMonitorImpl 
(NodeResourceMonitorImpl.java:serviceInit(76)) -  Using 
ResourceCalculatorPlugin : null
2015-11-04 17:49:31,475 [Thread-2] INFO  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:serviceInit(261)) - AMRMProxyService is disabled
2015-11-04 17:49:31,475 [Thread-2] INFO  localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:validateConf(224)) - per directory file limit 
= 8192
2015-11-04 17:49:31,549 [Thread-2] WARN  util.ResourceCalculatorPlugin 
(ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - 
java.lang.UnsupportedOperationException: Could not determine OS: Failed to 
instantiate default resource calculator.
java.lang.UnsupportedOperationException: Could not determine OS

[jira] [Commented] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages

2015-11-04 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990051#comment-14990051
 ] 

Steve Loughran commented on YARN-4330:
--

As well as having a way to turn this feature off for miniclusters, the code 
trying to  instantiate the resource calculator should recognise the falure and 
fallback, rather than retry. Retrying isn't going to fix this.

> MiniYARNCluster prints multiple  Failed to instantiate default resource 
> calculator warning messages
> ---
>
> Key: YARN-4330
> URL: https://issues.apache.org/jira/browse/YARN-4330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.8.0
> Environment: OSX, JUnit
>Reporter: Steve Loughran
>Priority: Blocker
>
> Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I 
> see multiple stack traces warning me that a resource calculator plugin could 
> not be created
> {code}
> (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - 
> java.lang.UnsupportedOperationException: Could not determine OS: Failed to 
> instantiate default resource calculator.
> java.lang.UnsupportedOperationException: Could not determine OS
> {code}
> This is a minicluster. It doesn't need resource calculation. It certainly 
> doesn't need test logs being cluttered with even more stack traces which will 
> only generate false alarms about tests failing. 
> There needs to be a way to turn this off, and the minicluster should have it 
> that way by default.
> Being ruthless and marking as a blocker, because its a fairly major 
> regression for anyone testing with the minicluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4331) Killing NodeManager leaves orphaned containers

2015-11-04 Thread Joseph (JIRA)
Joseph created YARN-4331:


 Summary: Killing NodeManager leaves orphaned containers
 Key: YARN-4331
 URL: https://issues.apache.org/jira/browse/YARN-4331
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, yarn
Affects Versions: 2.7.1
Reporter: Joseph
Priority: Critical


We are seeing a lot of orphaned containers running in our production clusters.
I tried to simulate this locally on my machine and can replicate the issue by 
killing nodemanager.
I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
jobs.
Steps:
1. Deploy a job 
2. Issue a kill -9 signal to nodemanager 
3. We should see the AM and its container running without nodemanager
4. AM should die but the container still keeps running
5. Restarting nodemanager brings up new AM and container but leaves the 
orphaned container running in the background

This is effectively causing double processing of data.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages

2015-11-04 Thread Steve Loughran (JIRA)
Steve Loughran created YARN-4330:


 Summary: MiniYARNCluster prints multiple  Failed to instantiate 
default resource calculator warning messages
 Key: YARN-4330
 URL: https://issues.apache.org/jira/browse/YARN-4330
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test, yarn
Affects Versions: 2.8.0
 Environment: OSX, JUnit
Reporter: Steve Loughran
Priority: Blocker


Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I see 
multiple stack traces warning me that a resource calculator plugin could not be 
created

{code}
(ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - 
java.lang.UnsupportedOperationException: Could not determine OS: Failed to 
instantiate default resource calculator.
java.lang.UnsupportedOperationException: Could not determine OS
{code}

This is a minicluster. It doesn't need resource calculation. It certainly 
doesn't need test logs being cluttered with even more stack traces which will 
only generate false alarms about tests failing. 

There needs to be a way to turn this off, and the minicluster should have it 
that way by default.

Being ruthless and marking as a blocker, because its a fairly major regression 
for anyone testing with the minicluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-11-04 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: YARN-3223-v2.patch

Updated patch based on feedback. Checkstyle errors about CapacityScheduler.java 
file length still there.

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.patch, YARN-3223-v1.patch, 
> YARN-3223-v2.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4331) Killing NodeManager leaves orphaned containers

2015-11-04 Thread Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph updated YARN-4331:
-
Description: 
We are seeing a lot of orphaned containers running in our production clusters.
I tried to simulate this locally on my machine and can replicate the issue by 
killing nodemanager.
I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
jobs.
Steps:
{quote}1. Deploy a job 
2. Issue a kill -9 signal to nodemanager 
3. We should see the AM and its container running without nodemanager
4. AM should die but the container still keeps running
5. Restarting nodemanager brings up new AM and container but leaves the 
orphaned container running in the background
{quote}
This is effectively causing double processing of data.


  was:
We are seeing a lot of orphaned containers running in our production clusters.
I tried to simulate this locally on my machine and can replicate the issue by 
killing nodemanager.
I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
jobs.
Steps:
1. Deploy a job 
2. Issue a kill -9 signal to nodemanager 
3. We should see the AM and its container running without nodemanager
4. AM should die but the container still keeps running
5. Restarting nodemanager brings up new AM and container but leaves the 
orphaned container running in the background

This is effectively causing double processing of data.



> Killing NodeManager leaves orphaned containers
> --
>
> Key: YARN-4331
> URL: https://issues.apache.org/jira/browse/YARN-4331
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.7.1
>Reporter: Joseph
>Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by 
> killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza 
> jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the 
> orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-11-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989338#comment-14989338
 ] 

Hadoop QA commented on YARN-3432:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
27s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 1s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 1s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
22s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 128m 42s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-11-04 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12737159/YARN-3432-002.patch |
| JIRA Issue | YARN-3432 |
| Optional Tests |  asflicense  javac  javadoc  mvninstall  unit  findbugs  
checkstyle  compile  |
| uname | Linux 75c368b9f110 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3