[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread q79969786 (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

q79969786 updated YARN-2198:

Description: 
YARN-1972 introduces a Secure Windows Container Executor. However this executor 
requires a process launching the container to be LocalSystem or a member of the 
a local Administrators group. Since the process in question is the NodeManager, 
the requirement translates to the entire NM to run as a privileged account, a 
very large surface area to review and protect.

This proposal is to move the privileged operations into a dedicated NT service. 
The NM can run as a low privilege account and communicate with the privileged 
NT service when it needs to launch a container. This would reduce the surface 
exposed to the high privileges. 

There has to exist a secure, authenticated and authorized channel of 
communication between the NM and the privileged NT service. Possible 
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be 
to use Windows LPC (Local Procedure Calls), which is a Windows platform 
specific inter-process communication channel that satisfies all requirements 
and is easy to deploy. The privileged NT service would register and listen on 
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with 
libwinutils which would host the LPC client code. The client would connect to 
the LPC port (NtConnectPort) and send a message requesting a container launch 
(NtRequestWaitReplyPort). LPC provides authentication and the privileged NT 
service can use authorization API (AuthZ) to validate the caller.

  was:
YARN-1972 introduces a Secure Windows Container Executor. However this executor 
requires a the process launching the container to be LocalSystem or a member of 
the a local Administrators group. Since the process in question is the 
NodeManager, the requirement translates to the entire NM to run as a privileged 
account, a very large surface area to review and protect.

This proposal is to move the privileged operations into a dedicated NT service. 
The NM can run as a low privilege account and communicate with the privileged 
NT service when it needs to launch a container. This would reduce the surface 
exposed to the high privileges. 

There has to exist a secure, authenticated and authorized channel of 
communication between the NM and the privileged NT service. Possible 
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be 
to use Windows LPC (Local Procedure Calls), which is a Windows platform 
specific inter-process communication channel that satisfies all requirements 
and is easy to deploy. The privileged NT service would register and listen on 
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with 
libwinutils which would host the LPC client code. The client would connect to 
the LPC port (NtConnectPort) and send a message requesting a container launch 
(NtRequestWaitReplyPort). LPC provides authentication and the privileged NT 
service can use authorization API (AuthZ) to validate the caller.


> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires a process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Wi

[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-29 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151466#comment-14151466
 ] 

Jun Gong commented on YARN-2617:


[~jianhe], thank you for the review!

{quote}
I think we should explicitly check if apps are at 
FINISHING_CONTAINERS_WAIT/APPLICATION_RESOURCES_CLEANINGUP/FINISHED state. 
{quote}
My concern is that we will need to modify the code when we add a new state for 
ApplicationImpl. It will be OK if it is not a problem. BTW: is there any case 
that APP has containers but APP is not in RUNNING state?

{quote}
The code needs to be moved inside the following check {{ if 
(containerStatus.getState().equals(ContainerState.COMPLETE))}} ...
{quote}
OK. I will change it.

And I will add an unit test.

> NM does not need to send finished container whose APP is not running to RM
> --
>
> Key: YARN-2617
> URL: https://issues.apache.org/jira/browse/YARN-2617
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Jun Gong
>Assignee: Jun Gong
> Fix For: 2.6.0
>
> Attachments: YARN-2617.patch
>
>
> We([~chenchun]) are testing RM work preserving restart and found the 
> following logs when we ran a simple MapReduce task "PI". NM continuously 
> reported completed containers whose Application had already finished while AM 
> had finished. 
> {code}
> 2014-09-26 17:00:42,228 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:42,228 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:43,230 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:43,230 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:44,233 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:44,233 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> {code}
> In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
> up already completed applications. But it will only remove appId from  
> 'app.context.getApplications()' when ApplicaitonImpl received evnet 
> 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
> receive this event for a long time or could not receive. 
> * For NonAggregatingLogHandler, it wait for 
> YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
> then it will be scheduled to delete Application logs and send the event.
> * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
> write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2198:
---
Description: 
YARN-1972 introduces a Secure Windows Container Executor. However this executor 
requires the process launching the container to be LocalSystem or a member of 
the a local Administrators group. Since the process in question is the 
NodeManager, the requirement translates to the entire NM to run as a privileged 
account, a very large surface area to review and protect.

This proposal is to move the privileged operations into a dedicated NT service. 
The NM can run as a low privilege account and communicate with the privileged 
NT service when it needs to launch a container. This would reduce the surface 
exposed to the high privileges. 

There has to exist a secure, authenticated and authorized channel of 
communication between the NM and the privileged NT service. Possible 
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be 
to use Windows LPC (Local Procedure Calls), which is a Windows platform 
specific inter-process communication channel that satisfies all requirements 
and is easy to deploy. The privileged NT service would register and listen on 
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with 
libwinutils which would host the LPC client code. The client would connect to 
the LPC port (NtConnectPort) and send a message requesting a container launch 
(NtRequestWaitReplyPort). LPC provides authentication and the privileged NT 
service can use authorization API (AuthZ) to validate the caller.

  was:
YARN-1972 introduces a Secure Windows Container Executor. However this executor 
requires a process launching the container to be LocalSystem or a member of the 
a local Administrators group. Since the process in question is the NodeManager, 
the requirement translates to the entire NM to run as a privileged account, a 
very large surface area to review and protect.

This proposal is to move the privileged operations into a dedicated NT service. 
The NM can run as a low privilege account and communicate with the privileged 
NT service when it needs to launch a container. This would reduce the surface 
exposed to the high privileges. 

There has to exist a secure, authenticated and authorized channel of 
communication between the NM and the privileged NT service. Possible 
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be 
to use Windows LPC (Local Procedure Calls), which is a Windows platform 
specific inter-process communication channel that satisfies all requirements 
and is easy to deploy. The privileged NT service would register and listen on 
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with 
libwinutils which would host the LPC client code. The client would connect to 
the LPC port (NtConnectPort) and send a message requesting a container launch 
(NtRequestWaitReplyPort). LPC provides authentication and the privileged NT 
service can use authorization API (AuthZ) to validate the caller.


> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which i

[jira] [Updated] (YARN-2493) [YARN-796] API changes for users

2014-09-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2493:
-
Attachment: YARN-2493.patch

Hi [~vinodkv],
Thanks for your careful review, all comments are make sense to me. Attached a 
new patch according to your suggestions.

Wangda

> [YARN-796] API changes for users
> 
>
> Key: YARN-2493
> URL: https://issues.apache.org/jira/browse/YARN-2493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, 
> YARN-2493.patch
>
>
> This JIRA includes API changes for users of YARN-796, like changes in 
> {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common 
> part of YARN-796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2493) [YARN-796] API changes for users

2014-09-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2493:
-
Attachment: (was: YARN-2493.patch)

> [YARN-796] API changes for users
> 
>
> Key: YARN-2493
> URL: https://issues.apache.org/jira/browse/YARN-2493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, 
> YARN-2493.patch
>
>
> This JIRA includes API changes for users of YARN-796, like changes in 
> {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common 
> part of YARN-796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2493) [YARN-796] API changes for users

2014-09-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2493:
-
Attachment: YARN-2493.patch

> [YARN-796] API changes for users
> 
>
> Key: YARN-2493
> URL: https://issues.apache.org/jira/browse/YARN-2493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, 
> YARN-2493.patch
>
>
> This JIRA includes API changes for users of YARN-796, like changes in 
> {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common 
> part of YARN-796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2493) [YARN-796] API changes for users

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151528#comment-14151528
 ] 

Hadoop QA commented on YARN-2493:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671760/YARN-2493.patch
  against trunk revision b38e52b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

  {color:red}-1 javac{color}.  The applied patch generated 1281 javac 
compiler warnings (more than the trunk's current 1265 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5169//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5169//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5169//console

This message is automatically generated.

> [YARN-796] API changes for users
> 
>
> Key: YARN-2493
> URL: https://issues.apache.org/jira/browse/YARN-2493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2493.patch, YARN-2493.patch, YARN-2493.patch, 
> YARN-2493.patch
>
>
> This JIRA includes API changes for users of YARN-796, like changes in 
> {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common 
> part of YARN-796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-09-29 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312.2-2.patch

Let me attach same patch again.

> Marking ContainerId#getId as deprecated
> ---
>
> Key: YARN-2312
> URL: https://issues.apache.org/jira/browse/YARN-2312
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
> YARN-2312.2-2.patch, YARN-2312.2.patch
>
>
> {{ContainerId#getId}} will only return partial value of containerId, only 
> sequence number of container id without epoch, after YARN-2229. We should 
> mark {{ContainerId#getId}} as deprecated and use 
> {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-29 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151658#comment-14151658
 ] 

Wangda Tan commented on YARN-2494:
--

Hi [~vinodkv] and [~cwelch],
Thanks for reply! Still working on handling your last comments, will upload 
patch soon.

Regarding method name of NodeLabelManager, I think following suggestion make 
sense to me:
bq. What I really want is to convey is that these are just system recognized 
nodelabels as opposed to node-lables that are actually mapped against a node. 
How about addToNodeLabelsCollection(), removeFromNodeLabelsCollection(), 
addLabelsToNode() and removeLabelsFromNode(). The point about 
addToNodeLabelsCollection() is that it clearly conveys that there is a 
NodeLabelsCollection - a set of node-labels known by the system.

And regarding
bq. Once you have the store abstraction, this will be less of a problem? 
Clearly NodeLabelsManager is not something that the client needs access to?
I think it still has problem: Even if we have store abstraction, we still need 
some logic to guarantee labels being added are valid (e.g. we need check if a 
label existed in collection, and label existed in node when we trying to remove 
some labels from a node). That makes we need put a greater chunk of logic to 
the store abstraction -- it isn't a simple store abstraction if we do this.
I suggest to keep it in common to make node label major logic are live together.

Thanks,
Wangda

> [YARN-796] Node label manager API and storage implementations
> -
>
> Key: YARN-2494
> URL: https://issues.apache.org/jira/browse/YARN-2494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
> YARN-2494.patch, YARN-2494.patch, YARN-2494.patch
>
>
> This JIRA includes APIs and storage implementations of node label manager,
> NodeLabelManager is an abstract class used to manage labels of nodes in the 
> cluster, it has APIs to query/modify
> - Nodes according to given label
> - Labels according to given hostname
> - Add/remove labels
> - Set labels of nodes in the cluster
> - Persist/recover changes of labels/labels-on-nodes to/from storage
> And it has two implementations to store modifications
> - Memory based storage: It will not persist changes, so all labels will be 
> lost when RM restart
> - FileSystem based storage: It will persist/recover to/from FileSystem (like 
> HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151668#comment-14151668
 ] 

Hadoop QA commented on YARN-2312:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671773/YARN-2312.2-2.patch
  against trunk revision b38e52b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.mapred.pipes.TestPipeApplication
  org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat
  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5170//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5170//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5170//console

This message is automatically generated.

> Marking ContainerId#getId as deprecated
> ---
>
> Key: YARN-2312
> URL: https://issues.apache.org/jira/browse/YARN-2312
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
> YARN-2312.2-2.patch, YARN-2312.2.patch
>
>
> {{ContainerId#getId}} will only return partial value of containerId, only 
> sequence number of container id without epoch, after YARN-2229. We should 
> mark {{ContainerId#getId}} as deprecated and use 
> {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM

2014-09-29 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-2617:
---
Attachment: YARN-2617.2.patch

> NM does not need to send finished container whose APP is not running to RM
> --
>
> Key: YARN-2617
> URL: https://issues.apache.org/jira/browse/YARN-2617
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Jun Gong
>Assignee: Jun Gong
> Fix For: 2.6.0
>
> Attachments: YARN-2617.2.patch, YARN-2617.patch
>
>
> We([~chenchun]) are testing RM work preserving restart and found the 
> following logs when we ran a simple MapReduce task "PI". NM continuously 
> reported completed containers whose Application had already finished while AM 
> had finished. 
> {code}
> 2014-09-26 17:00:42,228 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:42,228 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:43,230 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:43,230 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:44,233 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:44,233 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> {code}
> In the patch for YARN-1372, ApplicationImpl on NM should guarantee to  clean 
> up already completed applications. But it will only remove appId from  
> 'app.context.getApplications()' when ApplicaitonImpl received evnet 
> 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might 
> receive this event for a long time or could not receive. 
> * For NonAggregatingLogHandler, it wait for 
> YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, 
> then it will be scheduled to delete Application logs and send the event.
> * For LogAggregationService, it might fail(e.g. if user does not have HDFS 
> write permission), and it will not send the event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151706#comment-14151706
 ] 

Jason Lowe commented on YARN-1769:
--

+1 lgtm.  Committing this.

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151726#comment-14151726
 ] 

Hudson commented on YARN-1769:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6135 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6135/])
YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas 
Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java
* hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java


> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 2.6.0
>
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades

2014-09-29 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151796#comment-14151796
 ] 

Junping Du commented on YARN-2613:
--

Thanks [~jianhe] for the patch. I am reviewing your patch, and some initiative 
comments below. More comments may come later.
{code}
-  public static final int DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS =
+  public static final long DEFAULT_RESOURCEMANAGER_CONNECT_MAX_WAIT_MS =
   15 * 60 * 1000;
+  public static final int DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS =
+  15 * 60 * 1000;
+  public static final long DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS
+  = 10 * 1000;
{code}
I think it is better to keep consistent to use int or long for time intervals 
or wait. IMO, int should be fine enough as it supports up to (2 ^ 31) 
millseconds ~ 50 days.

{code}
-//TO DO: after HADOOP-9576,  IOException can be changed to EOFException
-exceptionToPolicyMap.put(IOException.class, retryPolicy);
{code}
Do we have plan to get HADOOP-9576 in? If yes, shall we keep the todo comments 
here?

> NMClient doesn't have retries for supporting rolling-upgrades
> -
>
> Key: YARN-2613
> URL: https://issues.apache.org/jira/browse/YARN-2613
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2613.1.patch, YARN-2613.2.patch
>
>
> While NM is rolling upgrade, client should retry NM until it comes up. This 
> jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
> support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2606:

Attachment: YARN-2606.patch

Refining the patch to remove the unwanted serviceInit() as all the work is done 
in serviceStart()

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2606:

Attachment: YARN-2606.patch

Yet some more refining. Attached updated patch.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
> YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151851#comment-14151851
 ] 

Hadoop QA commented on YARN-2606:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671803/YARN-2606.patch
  against trunk revision 4666440.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5171//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5171//console

This message is automatically generated.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
> YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades

2014-09-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151914#comment-14151914
 ] 

Jian He commented on YARN-2613:
---

bq. I think it is better to keep consistent to use int or long
good catch, I  changed the one for RMProxy, but missed this. 
bq. Do we have plan to get HADOOP-9576 in? If yes, shall we keep the todo 
comments here?
I forgot my initial intent to add this comment. As now I followed 
FailoverOnNetworkExceptionRetry for the exception-retry policy, I found maybe 
we don't need to do this for now.

> NMClient doesn't have retries for supporting rolling-upgrades
> -
>
> Key: YARN-2613
> URL: https://issues.apache.org/jira/browse/YARN-2613
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2613.1.patch, YARN-2613.2.patch
>
>
> While NM is rolling upgrade, client should retry NM until it comes up. This 
> jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
> support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151977#comment-14151977
 ] 

Karthik Kambatla commented on YARN-2179:


[~vinodkv] - do you have any further comments on this? 

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
> YARN-2179-trunk-v9.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2618) Add API support for disk I/O resources

2014-09-29 Thread Wei Yan (JIRA)
Wei Yan created YARN-2618:
-

 Summary: Add API support for disk I/O resources
 Key: YARN-2618
 URL: https://issues.apache.org/jira/browse/YARN-2618
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan


Subtask of YARN-2139. Add API support for introducing disk I/O as the 3rd type 
resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2619) NodeManager: Add cgroups support for disk I/O isolation

2014-09-29 Thread Wei Yan (JIRA)
Wei Yan created YARN-2619:
-

 Summary: NodeManager: Add cgroups support for disk I/O isolation
 Key: YARN-2619
 URL: https://issues.apache.org/jira/browse/YARN-2619
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2620) FairScheduler: Add disk I/O resource to the DRF implementation

2014-09-29 Thread Wei Yan (JIRA)
Wei Yan created YARN-2620:
-

 Summary: FairScheduler: Add disk I/O resource to the DRF 
implementation
 Key: YARN-2620
 URL: https://issues.apache.org/jira/browse/YARN-2620
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wei Yan
Assignee: Wei Yan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2610:
---
Summary: Hamlet should close table tags  (was: Hamlet doesn't close table 
tags)

> Hamlet should close table tags
> --
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch, YARN-2610-02.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152042#comment-14152042
 ] 

Remus Rusanu commented on YARN-2198:


the last QA -1 is for delta.10.patch, which is not trunk diff.

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152075#comment-14152075
 ] 

Hadoop QA commented on YARN-2606:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671811/YARN-2606.patch
  against trunk revision b3d5d26.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5172//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5172//console

This message is automatically generated.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
> YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-29 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152078#comment-14152078
 ] 

Craig Welch commented on YARN-2494:
---

Not to dither about names - but "Collection" is still not terribly clear to me 
(overly generic), I was thinking previously about "Cluster" as the 
differentiator, so:

addToClusterNodeLabels(), removeFromClusterNodeLabels(), addLabelsToNode() and 
removeLabelsFromNode(). 

I think this conveys the different notions of what the operations are applying 
to in a pretty clear way.  Thoughts?

> [YARN-796] Node label manager API and storage implementations
> -
>
> Key: YARN-2494
> URL: https://issues.apache.org/jira/browse/YARN-2494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
> YARN-2494.patch, YARN-2494.patch, YARN-2494.patch
>
>
> This JIRA includes APIs and storage implementations of node label manager,
> NodeLabelManager is an abstract class used to manage labels of nodes in the 
> cluster, it has APIs to query/modify
> - Nodes according to given label
> - Labels according to given hostname
> - Add/remove labels
> - Set labels of nodes in the cluster
> - Persist/recover changes of labels/labels-on-nodes to/from storage
> And it has two implementations to store modifications
> - Memory based storage: It will not persist changes, so all labels will be 
> lost when RM restart
> - FileSystem based storage: It will persist/recover to/from FileSystem (like 
> HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152077#comment-14152077
 ] 

Jonathan Eagles commented on YARN-2606:
---

+1. Will commit at the end of the day in case any one else has comments.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
> YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152084#comment-14152084
 ] 

Vinod Kumar Vavilapalli commented on YARN-2179:
---

Looking now..

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
> YARN-2179-trunk-v9.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152103#comment-14152103
 ] 

Vinod Kumar Vavilapalli commented on YARN-2179:
---

Looks so much better now. One minor suggestion - in the test, instead of 
overriding all of YarnClient, you could simply mock it to override behaviour of 
only those methods that you are interested in.

+1 otherwise.

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
> YARN-2179-trunk-v9.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-09-29 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152137#comment-14152137
 ] 

Jian Fang commented on YARN-1680:
-

Hi, any update on the fix? We saw quick some jobs failed due to this issue.

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> --
>
> Key: YARN-1680
> URL: https://issues.apache.org/jira/browse/YARN-1680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0, 2.3.0
> Environment: SuSE 11 SP2 + Hadoop-2.3 
>Reporter: Rohith
>Assignee: Chen He
> Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2056) Disable preemption at Queue level

2014-09-29 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152153#comment-14152153
 ] 

Eric Payne commented on YARN-2056:
--

[~leftnoteasy]. Thanks again for helping to review this patch. Have you had a 
chance to look over the updated changes?

> Disable preemption at Queue level
> -
>
> Key: YARN-2056
> URL: https://issues.apache.org/jira/browse/YARN-2056
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
>Assignee: Eric Payne
> Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
> YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
> YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
> YARN-2056.201409232329.txt, YARN-2056.201409242210.txt
>
>
> We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-09-29 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152154#comment-14152154
 ] 

Chen He commented on YARN-1680:
---

Thank you for remaindering me, [~john.jian.fang]. I will post the updated patch 
before end of tomorrow.

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> --
>
> Key: YARN-1680
> URL: https://issues.apache.org/jira/browse/YARN-1680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0, 2.3.0
> Environment: SuSE 11 SP2 + Hadoop-2.3 
>Reporter: Rohith
>Assignee: Chen He
> Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-09-29 Thread Jian Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152156#comment-14152156
 ] 

Jian Fang commented on YARN-1680:
-

Thanks. Looking forward to your patch.

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> --
>
> Key: YARN-1680
> URL: https://issues.apache.org/jira/browse/YARN-1680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.2.0, 2.3.0
> Environment: SuSE 11 SP2 + Hadoop-2.3 
>Reporter: Rohith
>Assignee: Chen He
> Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152160#comment-14152160
 ] 

Mit Desai commented on YARN-2610:
-

Why is the change specific to some tags and not the others?

> Hamlet should close table tags
> --
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch, YARN-2610-02.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152176#comment-14152176
 ] 

Ray Chiang commented on YARN-2610:
--

I would have been fine with changing all the tags to close cleanly, except for 
the feedback from MAPREDUCE-2993.  So, I limited these changes to just the 
table rendering ones--which tends to cause the most problems anyhow.

Or is there some table related tag that I missed?

> Hamlet should close table tags
> --
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch, YARN-2610-02.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152206#comment-14152206
 ] 

Karthik Kambatla commented on YARN-2610:


I just ran all YARN tests with the latest patch to be safe. None of the test 
failures are related.

+1. I ll commit this later today if no one objects. 

> Hamlet should close table tags
> --
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch, YARN-2610-02.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet should close table tags

2014-09-29 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152219#comment-14152219
 ] 

Mit Desai commented on YARN-2610:
-

[~rchiang], I did not see the comments on that MAPREDUCE-2993 before. Just 
wanted to know the reason behind leaving some tags open.
The patch looks good to me.
+1 (non-binding)

> Hamlet should close table tags
> --
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch, YARN-2610-02.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-29 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152270#comment-14152270
 ] 

Vinod Kumar Vavilapalli commented on YARN-2494:
---

bq. I think it still has problem: Even if we have store abstraction, we still 
need some logic to guarantee labels being added are valid (e.g. we need check 
if a label existed in collection, and label existed in node when we trying to 
remove some labels from a node).
Then that validation code needs to get pulled out in a common layer. My goal it 
not put the entire NodelabelsManager in yarn-common - it just doesn't belong 
there.

bq. How about addToNodeLabelsCollection(), removeFromNodeLabelsCollection(), 
addLabelsToNode() and removeLabelsFromNode()
bq. addToClusterNodeLabels(), removeFromClusterNodeLabels(), addLabelsToNode() 
and removeLabelsFromNode(). 
[~leftnoteasy], [~cwelch], I'm okay with either of the above. Or should we call 
it {{ClusterNodeLabelsCollection}}? :)

> [YARN-796] Node label manager API and storage implementations
> -
>
> Key: YARN-2494
> URL: https://issues.apache.org/jira/browse/YARN-2494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
> YARN-2494.patch, YARN-2494.patch, YARN-2494.patch
>
>
> This JIRA includes APIs and storage implementations of node label manager,
> NodeLabelManager is an abstract class used to manage labels of nodes in the 
> cluster, it has APIs to query/modify
> - Nodes according to given label
> - Labels according to given hostname
> - Add/remove labels
> - Set labels of nodes in the cluster
> - Persist/recover changes of labels/labels-on-nodes to/from storage
> And it has two implementations to store modifications
> - Memory based storage: It will not persist changes, so all labels will be 
> lost when RM restart
> - FileSystem based storage: It will persist/recover to/from FileSystem (like 
> HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-09-29 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152301#comment-14152301
 ] 

Anubhav Dhoot commented on YARN-1879:
-

The patch needs to be updated 

> Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
> ---
>
> Key: YARN-1879
> URL: https://issues.apache.org/jira/browse/YARN-1879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Tsuyoshi OZAWA
>Priority: Critical
> Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
> YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
> YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, 
> YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
> YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152307#comment-14152307
 ] 

Jason Lowe commented on YARN-90:


Thanks for updating the patch, Varun.

bq. I've changed it to "Disk(s) health report: ". My only concern with this is 
that there might be scripts looking for the "Disk(s) failed" log line for 
monitoring. What do you think?

If that's true then the code should bother to do a diff between the old disk 
list and the new one, logging which disks turned bad using the "Disk(s) failed" 
line and which disks became healthy with some other log message.

bq. Directories are only cleaned up during startup. The code tests for 
existence of the directories and the correct permissions. This does mean that 
container directories left behind for any reason won't get cleaned up unit the 
NodeManager is restarted. Is that ok?

This could still be problematic for the NM work-preserving restart case, as we 
could try to delete an entire disk tree with active containers on it due to a 
hiccup when the NM restarts.  I think a better approach is a periodic cleanup 
scan that looks for directories under yarn-local and yarn-logs that shouldn't 
be there.  This could be part of the health check scan or done separately.  
That way we don't have to wait for a disk to turn good or bad to catch leaked 
entities on the disk due to some hiccup.  Sorta like an fsck for the NM state 
on disk.  That is best done as a separate JIRA, as I think this functionality 
is still an incremental improvement without it.

Other comments:

checkDirs unnecessarily calls union(errorDirs, fullDirs) twice.

isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if the 
free space is under the limit.

getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc comments 
like the other methods.

Nit: The union utility function doesn't technically perform a union but rather 
a concatenation, and it'd be a little clearer if the name reflected that.  Also 
the function should leverage the fact that it knows how big the ArrayList will 
be after the operations and give it the appropriate hint to its constructor to 
avoid reallocations.


> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user

2014-09-29 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152313#comment-14152313
 ] 

Zhijie Shen commented on YARN-2446:
---

bq. Get domains API: "If callerUGI is not the owner or the admin of the domain, 
we need to hide the details from him, and only allow him to see the ID": Why is 
that, I think we should just not allow non-owners to see anything. Is there a 
user-case for this?

bq. Based on the above decision, 
TestTimelineWebServices.testGetDomainsYarnACLsEnabled() should be changed to 
either validate that only IDs are visible or nothing is visible.

The rationale before is to let users to check whether the namespace Id is 
occupied or not before putting one. Talked to vindo offline, since it cannot 
save the race condition of multiple putting requests anyway, let's simplify the 
behavior as is suggested above. It's not related to code in this patch. Let me 
file a separate Jira for it.

bq. Shouldn't the server completely own DEFAULT_DOMAIN_ID, instead of letting 
anyone create it with potentially arbitrary permission?

Yes, DEFAULT_DOMAIN_ID is owned by the timeline server. When 
TimelineDataManager is constructed, if the default domain is not created 
before, the timeline server is going to create one. Users can not create or 
modify the domain with DEFAULT_DOMAIN_ID.

bq. testGetEntitiesWithYarnACLsEnabled()

The test cases seem to be problematic. I've updated these test cases and add 
the validation of cross-domain entity relationship.

One more issue I've noticed that after this patch, we should make RM put the 
application metrics into a secured domain instead of the default one. Will file 
a Jira for it as well.

> Using TimelineNamespace to shield the entities of a user
> 
>
> Key: YARN-2446
> URL: https://issues.apache.org/jira/browse/YARN-2446
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch
>
>
> Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the 
> entities, preventing them from being accessed or affected by other users' 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2446) Using TimelineNamespace to shield the entities of a user

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2446:
--
Attachment: YARN-2446.3.patch

> Using TimelineNamespace to shield the entities of a user
> 
>
> Key: YARN-2446
> URL: https://issues.apache.org/jira/browse/YARN-2446
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch
>
>
> Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the 
> entities, preventing them from being accessed or affected by other users' 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)

2014-09-29 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2621:
-

 Summary: Simplify the output when the user doesn't have the access 
for getDomain(s) 
 Key: YARN-2621
 URL: https://issues.apache.org/jira/browse/YARN-2621
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen


Per discussion in 
[YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272],
 we should simply reject the user if it doesn't have access the domain(s), 
instead of returning the entity without detail information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2622:
--
Component/s: timelineserver

> RM should put the application related timeline data into a secured domain
> -
>
> Key: YARN-2622
> URL: https://issues.apache.org/jira/browse/YARN-2622
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
> application related timeline data is put into the default domain. It is not 
> secured. We should let RM to choose a secured domain to put the system 
> metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2622) RM should put the application related timeline data into a secured domain

2014-09-29 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2622:
-

 Summary: RM should put the application related timeline data into 
a secured domain
 Key: YARN-2622
 URL: https://issues.apache.org/jira/browse/YARN-2622
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen


After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
application related timeline data is put into the default domain. It is not 
secured. We should let RM to choose a secured domain to put the system metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2622:
--
Affects Version/s: 2.6.0

> RM should put the application related timeline data into a secured domain
> -
>
> Key: YARN-2622
> URL: https://issues.apache.org/jira/browse/YARN-2622
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
> application related timeline data is put into the default domain. It is not 
> secured. We should let RM to choose a secured domain to put the system 
> metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2622) RM should put the application related timeline data into a secured domain

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2622:
--
Target Version/s: 2.6.0

> RM should put the application related timeline data into a secured domain
> -
>
> Key: YARN-2622
> URL: https://issues.apache.org/jira/browse/YARN-2622
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> After YARN-2446, SystemMetricsPublisher doesn't specify any domain, and the 
> application related timeline data is put into the default domain. It is not 
> secured. We should let RM to choose a secured domain to put the system 
> metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152348#comment-14152348
 ] 

Jonathan Eagles commented on YARN-2606:
---

Committed to trunk and branch-2

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Fix For: 2.6.0
>
> Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
> YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152352#comment-14152352
 ] 

Hudson commented on YARN-2606:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #6146 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6146/])
YARN-2606. Application History Server tries to access hdfs before doing secure 
login (Mit Desai via jeagles) (jeagles: rev 
e10eeaabce2a21840cfd5899493c9d2d4fe2e322)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java


> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Fix For: 2.6.0
>
> Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, 
> YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Chris Trezzo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152355#comment-14152355
 ] 

Chris Trezzo commented on YARN-2179:


[~vinodkv] Mocking YarnClient seems to be tricky due to it being an 
AbstractService. Would extending YarnClientImpl and only overriding methods I 
need to stub be a more reasonable approach? For this approach I would need to 
make the serviceStart and serviceStop methods in YarnClientImpl publicly 
visible for testing. It is still a little tricky due to the serviceStart and 
serviceStop methods of YarnClientImpl using ClientRMProxy. That is originally 
why I decided to just create a different dummy YarnClient implementation. Any 
thoughts on these alternative approaches, or am I just missing an easy way to 
mock YarnClient (which is highly possible)?

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
> YARN-2179-trunk-v9.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152362#comment-14152362
 ] 

Hadoop QA commented on YARN-2446:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671870/YARN-2446.3.patch
  against trunk revision 7f0efe9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5173//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5173//console

This message is automatically generated.

> Using TimelineNamespace to shield the entities of a user
> 
>
> Key: YARN-2446
> URL: https://issues.apache.org/jira/browse/YARN-2446
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch
>
>
> Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the 
> entities, preventing them from being accessed or affected by other users' 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-29 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152360#comment-14152360
 ] 

Karthik Kambatla commented on YARN-2566:


We should probably have the same mechanism of picking directories in both the 
default and linux container-executors. It appears LCE picks these at random. 
Can we do the same here? I understand picking directories at random might 
result in a skew due to not-so-random randomness or different applications 
localizing different sizes of data. 

May be, in the future, we could pick the directory with most available space? 

> IOException happen in startLocalizer of DefaultContainerExecutor due to not 
> enough disk space for the first localDir.
> -
>
> Key: YARN-2566
> URL: https://issues.apache.org/jira/browse/YARN-2566
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2566.000.patch, YARN-2566.001.patch
>
>
> startLocalizer in DefaultContainerExecutor will only use the first localDir 
> to copy the token file, if the copy is failed for first localDir due to not 
> enough disk space in the first localDir, the localization will be failed even 
> there are plenty of disk space in other localDirs. We see the following error 
> for this case:
> {code}
> 2014-09-13 23:33:25,171 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
> create app directory 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
> java.io.IOException: mkdir of 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
>   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
>   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,185 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.io.FileNotFoundException: File 
> file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
>   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344)
>   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
>   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
>   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,186 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container

[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-29 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2387:

Attachment: YARN-2387.patch

Updated the patch

> Resource Manager crashes with NPE due to lack of synchronization
> 
>
> Key: YARN-2387
> URL: https://issues.apache.org/jira/browse/YARN-2387
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.5.0
>Reporter: Mit Desai
>Assignee: Mit Desai
>Priority: Blocker
> Attachments: YARN-2387.patch, YARN-2387.patch
>
>
> We recently came across a 0.23 RM crashing with an NPE. Here is the 
> stacktrace for it.
> {noformat}
> 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
> at
> org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
> at java.lang.String.valueOf(String.java:2854)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
> at java.lang.String.valueOf(String.java:2854)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
> at java.lang.Thread.run(Thread.java:722)
> 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}
> On investigating a on the issue we found that the ContainerStatusPBImpl has 
> methods that are called by different threads and are not synchronized. Even 
> the 2.X code looks alike.
> We need to make these methods synchronized so that we do not encounter this 
> problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.

2014-09-29 Thread zhihai xu (JIRA)
zhihai xu created YARN-2623:
---

 Summary: Linux container executor only use the first local 
directory to copy token file in container-executor.c.
 Key: YARN-2623
 URL: https://issues.apache.org/jira/browse/YARN-2623
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
 Environment: Linux container executor only use the first local 
directory to copy token file in container-executor.c.
Reporter: zhihai xu
Assignee: zhihai xu


Linux container executor only use the first local directory to copy token file 
in container-executor.c. if It failed to copy token file to the first local 
directory, the  localization failure event will happen. Even though it can copy 
token file to the other local directory successfully. The correct way should be 
to copy token file  to the next local directory  if the first one failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-29 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152428#comment-14152428
 ] 

zhihai xu commented on YARN-2566:
-

For linux container-executors, it is done at C file container-executor.c:
It also pick the first directory to copy the token file:
see the following code in container-executor.c:
{code}
  char *primary_app_dir = NULL;
  for(nm_root=local_dirs; *nm_root != NULL; ++nm_root) {
char *app_dir = get_app_directory(*nm_root, user, app_id);
if (app_dir == NULL) {
  // try the next one
} else if (mkdirs(app_dir, permissions) != 0) {
  free(app_dir);
} else if (primary_app_dir == NULL) {
  primary_app_dir = app_dir;
} else {
  free(app_dir);
}
  }
  char *cred_file_name = concatenate("%s/%s", "cred file", 2,
   primary_app_dir, 
basename(nmPrivate_credentials_file_copy));
  if (copy_file(cred_file, nmPrivate_credentials_file,
  cred_file_name, S_IRUSR|S_IWUSR) != 0){
free(nmPrivate_credentials_file_copy);
return -1;
  }
{code}

I created a new jira YARN-2623 for LCE.


> IOException happen in startLocalizer of DefaultContainerExecutor due to not 
> enough disk space for the first localDir.
> -
>
> Key: YARN-2566
> URL: https://issues.apache.org/jira/browse/YARN-2566
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2566.000.patch, YARN-2566.001.patch
>
>
> startLocalizer in DefaultContainerExecutor will only use the first localDir 
> to copy the token file, if the copy is failed for first localDir due to not 
> enough disk space in the first localDir, the localization will be failed even 
> there are plenty of disk space in other localDirs. We see the following error 
> for this case:
> {code}
> 2014-09-13 23:33:25,171 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
> create app directory 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
> java.io.IOException: mkdir of 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
>   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
>   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,185 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.io.FileNotFoundException: File 
> file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
>   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344)
>   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
>   at org.apache.hadoop.fs.FileContext$Util.copy(FileC

[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152446#comment-14152446
 ] 

Hadoop QA commented on YARN-2387:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671880/YARN-2387.patch
  against trunk revision c88c6c5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5174//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5174//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5174//console

This message is automatically generated.

> Resource Manager crashes with NPE due to lack of synchronization
> 
>
> Key: YARN-2387
> URL: https://issues.apache.org/jira/browse/YARN-2387
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.5.0
>Reporter: Mit Desai
>Assignee: Mit Desai
>Priority: Blocker
> Attachments: YARN-2387.patch, YARN-2387.patch
>
>
> We recently came across a 0.23 RM crashing with an NPE. Here is the 
> stacktrace for it.
> {noformat}
> 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
> at
> org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
> at java.lang.String.valueOf(String.java:2854)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
> at java.lang.String.valueOf(String.java:2854)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
> at java.lang.Thread.run(Thread.java:722)
> 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}
> On investigating a on the issue we found that the ContainerStatusPBImpl has 
> methods that are called by different threads and are not synchronized. Even 
> the 2.X code looks alike.
> We need to make these methods synchronized so that we do not encounter this 
> problem in future.



--
This messag

[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-09-29 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152483#comment-14152483
 ] 

Anubhav Dhoot commented on YARN-1879:
-

Nit in ProtocolHATestBase

> method will be re-entry
 method will be re-entered

>the entire logic test.
the entire logic of the test?

>APIs that added trigger flag.
APIs that added Idempotent/AtOnce annotation?

Looks good otherwise

> Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
> ---
>
> Key: YARN-1879
> URL: https://issues.apache.org/jira/browse/YARN-1879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Tsuyoshi OZAWA
>Priority: Critical
> Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
> YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
> YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, 
> YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
> YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager

2014-09-29 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152486#comment-14152486
 ] 

Zhijie Shen commented on YARN-2527:
---

The patch almost looks good to me, in particular the additional test cases for 
ApplicationACLsManager. Just one nit:

1. The logic here is a bit counter-intuitive. Can we just assign 
acls.get(applicationAccessType) to applicationACL only when it is not null?
{code}
  applicationACL = acls.get(applicationAccessType);
  if (applicationACL == null) {
if (LOG.isDebugEnabled()) {
  LOG.debug("ACL not found for access-type " + applicationAccessType
  + " for application " + applicationId + " owned by "
  + applicationOwner + ". Using default ["
  + YarnConfiguration.DEFAULT_YARN_APP_ACL + "]");
}
applicationACL = DEFAULT_YARN_APP_ACL;
{code}

> NPE in ApplicationACLsManager
> -
>
> Key: YARN-2527
> URL: https://issues.apache.org/jira/browse/YARN-2527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Benoy Antony
>Assignee: Benoy Antony
> Attachments: YARN-2527.patch, YARN-2527.patch
>
>
> NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error.
> The relevant stacktrace snippet from the ResourceManager logs is as below
> {code}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
> {code}
> This issue was reported by [~miguenther].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2301) Improve yarn container command

2014-09-29 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152503#comment-14152503
 ] 

Naganarasimha G R commented on YARN-2301:
-

Attaching patch with corrected test cases.

> Improve yarn container command
> --
>
> Key: YARN-2301
> URL: https://issues.apache.org/jira/browse/YARN-2301
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Naganarasimha G R
>  Labels: usability
> Attachments: YARN-2301.01.patch
>
>
> While running yarn container -list  command, some 
> observations:
> 1) the scheme (e.g. http/https  ) before LOG-URL is missing
> 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
> print as time format.
> 3) finish-time is 0 if container is not yet finished. May be "N/A"
> 4) May have an option to run as yarn container -list  OR  yarn 
> application -list-containers  also.  
> As attempt Id is not shown on console, this is easier for user to just copy 
> the appId and run it, may  also be useful for container-preserving AM 
> restart. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-09-29 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152504#comment-14152504
 ] 

Craig Welch commented on YARN-1063:
---

When looking this over to pickup context for 2198, I noticed a couple things:

libwinutils.c CreateLogonForUser - confusing name, makes me think a new
account is being created - CreateLogonTokenForUser?  LogonUser?

TestWinUtils - can we add testing specific to security?

> Winutils needs ability to create task as domain user
> 
>
> Key: YARN-1063
> URL: https://issues.apache.org/jira/browse/YARN-1063
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
> Environment: Windows
>Reporter: Kyle Leckie
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
> YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch
>
>
> h1. Summary:
> Securing a Hadoop cluster requires constructing some form of security 
> boundary around the processes executed in YARN containers. Isolation based on 
> Windows user isolation seems most feasible. This approach is similar to the 
> approach taken by the existing LinuxContainerExecutor. The current patch to 
> winutils.exe adds the ability to create a process as a domain user. 
> h1. Alternative Methods considered:
> h2. Process rights limited by security token restriction:
> On Windows access decisions are made by examining the security token of a 
> process. It is possible to spawn a process with a restricted security token. 
> Any of the rights granted by SIDs of the default token may be restricted. It 
> is possible to see this in action by examining the security tone of a 
> sandboxed process launch be a web browser. Typically the launched process 
> will have a fully restricted token and need to access machine resources 
> through a dedicated broker process that enforces a custom security policy. 
> This broker process mechanism would break compatibility with the typical 
> Hadoop container process. The Container process must be able to utilize 
> standard function calls for disk and network IO. I performed some work 
> looking at ways to ACL the local files to the specific launched without 
> granting rights to other processes launched on the same machine but found 
> this to be an overly complex solution. 
> h2. Relying on APP containers:
> Recent versions of windows have the ability to launch processes within an 
> isolated container. Application containers are supported for execution of 
> WinRT based executables. This method was ruled out due to the lack of 
> official support for standard windows APIs. At some point in the future 
> windows may support functionality similar to BSD jails or Linux containers, 
> at that point support for containers should be added.
> h1. Create As User Feature Description:
> h2. Usage:
> A new sub command was added to the set of task commands. Here is the syntax:
> winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
> Some notes:
> * The username specified is in the format of "user@domain"
> * The machine executing this command must be joined to the domain of the user 
> specified
> * The domain controller must allow the account executing the command access 
> to the user information. For this join the account to the predefined group 
> labeled "Pre-Windows 2000 Compatible Access"
> * The account running the command must have several rights on the local 
> machine. These can be managed manually using secpol.msc: 
> ** "Act as part of the operating system" - SE_TCB_NAME
> ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME
> ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME
> * The launched process will not have rights to the desktop so will not be 
> able to display any information or create UI.
> * The launched process will have no network credentials. Any access of 
> network resources that requires domain authentication will fail.
> h2. Implementation:
> Winutils performs the following steps:
> # Enable the required privileges for the current process.
> # Register as a trusted process with the Local Security Authority (LSA).
> # Create a new logon for the user passed on the command line.
> # Load/Create a profile on the local machine for the new logon.
> # Create a new environment for the new logon.
> # Launch the new process in a job with the task name specified and using the 
> created logon.
> # Wait for the JOB to exit.
> h2. Future work:
> The following work was scoped out of this check in:
> * Support for non-domain users or machine that are not domain joined.
> * Support for privilege isolation by running the task launcher in a high 
> privilege service with access over an ACLed named pipe.



--
T

[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-09-29 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152507#comment-14152507
 ] 

Craig Welch commented on YARN-1972:
---

ContainerLaunch
launchContainer - nit, why "userName" here, it's user everywhere else
getLocalWrapperScriptBuilder - why not an override instead of conditional (see 
below wrt WindowsContainerExecutor)

WindowsSecureContainerExecutor - I really think there should be a 
"WindowsContainerExecutor" and that we should go ahead and have differences 
move generally to inheritance rather than conditional (as far as 
reasonable/related to the change, and incrementally as we go forward, no need 
to boil the ocean, but it would be good to set a good foundation here)  Windows 
specific logic, secure or not, should be based in this class.  If the 
differences required for security specific logic are significant enough, by all 
means also have a WindowsSecureContainerExecutor which inherits from 
WindowsContainerExecutor.  I think, as much as possible, the logic should be 
the same for both - with only the security specific functionality as a delta 
(right now, it looks like non-secure windows uses default for implementation, 
and may differ more from the "windows secure" than it should)

> Implement secure Windows Container Executor
> ---
>
> Key: YARN-1972
> URL: https://issues.apache.org/jira/browse/YARN-1972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, 
> YARN-1972.delta.4.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, 
> YARN-1972.trunk.5.patch
>
>
> h1. Windows Secure Container Executor (WCE)
> YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
> user as a solution for the problem of having a security boundary between 
> processes executed in YARN containers and the Hadoop services. The WCE is a 
> container executor that leverages the winutils capabilities introduced in 
> YARN-1063 and launches containers as an OS process running as the job 
> submitter user. A description of the S4U infrastructure used by YARN-1063 
> alternatives considered can be read on that JIRA.
> The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
> drive the flow of execution, but it overwrrides some emthods to the effect of:
> * change the DCE created user cache directories to be owned by the job user 
> and by the nodemanager group.
> * changes the actual container run command to use the 'createAsUser' command 
> of winutils task instead of 'create'
> * runs the localization as standalone process instead of an in-process Java 
> method call. This in turn relies on the winutil createAsUser feature to run 
> the localization as the job user.
>  
> When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
> differences:
> * it does no delegate the creation of the user cache directories to the 
> native implementation.
> * it does no require special handling to be able to delete user files
> The approach on the WCE came from a practical trial-and-error approach. I had 
> to iron out some issues around the Windows script shell limitations (command 
> line length) to get it to work, the biggest issue being the huge CLASSPATH 
> that is commonplace in Hadoop environment container executions. The job 
> container itself is already dealing with this via a so called 'classpath 
> jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
> as a separate container the same issue had to be resolved and I used the same 
> 'classpath jar' approach.
> h2. Deployment Requirements
> To use the WCE one needs to set the 
> `yarn.nodemanager.container-executor.class` to 
> `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
> and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
> Windows security group name that is the nodemanager service principal is a 
> member of (equivalent of LCE 
> `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
> does not require any configuration outside of the Hadoop own's yar-site.xml.
> For WCE to work the nodemanager must run as a service principal that is 
> member of the local Administrators group or LocalSystem. this is derived from 
> the need to invoke LoadUserProfile API which mention these requirements in 
> the specifications. This is in addition to the SE_TCB privilege mentioned in 
> YARN-1063, but this requirement will automatically imply that the SE_TCB 
> privilege is held by the nodemanager. For the Linux speakers in the audience, 
> the requirement is basically to run NM as root.
> h2. 

[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-29 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152523#comment-14152523
 ] 

Craig Welch commented on YARN-2198:
---

pom.xml - don’t see a /etc/hadoop or a wsce-site.xml, missed?

RawLocalFileSystem

Is someone from HDFS looking at this?

protected boolean mkOneDir(File p2f) throws IOException - nit, generalize arg 
name pls

return (parent == null || parent2f.exists() || mkdirs(parent)) &&
+  (mkOneDir(p2f) || p2f.isDirectory());

so, I don't get this logic, & believe it will fail if the path exists and is 
not a directory.  Why not just do if p2f doesn't exist mkdirs(p2f)? seems much 
simpler, and drops the need for mkOneDir

NativeIO

Elevated class - I believe this is Windows specific, "WindowsElevated" or 
"ElevatedWindows"?  Why doesn't it extend "Windows" - I don't think secure and 
insecure windows should become "wholly dissimilar"

createTaskAsUser, killTask, ProcessStub:

These aren't really "io", I think they should be factored out to their own 
process-specific class


> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, 
> YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, 
> YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, 
> YARN-2198.separation.patch, YARN-2198.trunk.10.patch, 
> YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, 
> YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires the process launching the container to be LocalSystem or a 
> member of the a local Administrators group. Since the process in question is 
> the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS

2014-09-29 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152532#comment-14152532
 ] 

Zhijie Shen commented on YARN-2583:
---

Some thoughts about the log deletion service of LRS:

1. I'm not sure if it's good to do normal log deletion in 
AggregatedLogDeletionService, while deleting rolling logs in 
AppLogAggregatorImpl. AggregatedLogDeletionService (inside JHS) will still try 
to delete the whole log dir while the LRS is still running.

2. Usually we do retention by time instead of by size, and it's inconsistent 
between AggregatedLogDeletionService and AppLogAggregatorImpl. While 
AggregatedLogDeletionService keeps all the logs newer than T1, 
AppLogAggregatorImpl may have already deleted logs newer than T1 to limit the 
number of logs of the LRS. It's going to be unpredictable after what time the 
logs should be still available for access.

3. Another problem w.r.t. NM_LOG_AGGREGATION_RETAIN_RETENTION_SIZE_PER_APP is 
that the config is favor of the longer rollingIntervalSeconds. For example, 
NM_LOG_AGGREGATION_RETAIN_RETENTION_SIZE_PER_APP = 10. If a LRS sets 
rollingIntervalSeconds = 1D, after 10D, it's still going to keep all the logs. 
However, If the LRS sets rollingIntervalSeconds = 0.5D, after 10D, it can only 
keep the last 5D's logs, even though the amount of generated logs is the same.

4. Assume we want to do deletion in AppLogAggregatorImpl, should we do deletion 
first and uploading next to avoid that the number of logs can go beyond the cap 
temporally?

> Modify the LogDeletionService to support Log aggregation for LRS
> 
>
> Key: YARN-2583
> URL: https://issues.apache.org/jira/browse/YARN-2583
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2583.1.patch
>
>
> Currently, AggregatedLogDeletionService will delete old logs from HDFS. It 
> will check the cut-off-time, if all logs for this application is older than 
> this cut-off-time. The app-log-dir from HDFS will be deleted. This will not 
> work for LRS. We expect a LRS application can keep running for a long time. 
> Two different scenarios: 
> 1) If we configured the rollingIntervalSeconds, the new log file will be 
> always uploaded to HDFS. The number of log files for this application will 
> become larger and larger. And there is no log files will be deleted.
> 2) If we did not configure the rollingIntervalSeconds, the log file can only 
> be uploaded to HDFS after the application is finished. It is very possible 
> that the logs are uploaded after the cut-off-time. It will cause problem 
> because at that time the app-log-dir for this application in HDFS has been 
> deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2598) GHS should show N/A instead of null for the inaccessible information

2014-09-29 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152534#comment-14152534
 ] 

Mayank Bansal commented on YARN-2598:
-

+1 LGTM.
Will run tests and if succeeds will commit

Thanks,
Mayank

> GHS should show N/A instead of null for the inaccessible information
> 
>
> Key: YARN-2598
> URL: https://issues.apache.org/jira/browse/YARN-2598
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2598.1.patch
>
>
> When the user doesn't have the access to an application, the app attempt 
> information is not visible to the user. ClientRMService will output N/A, but 
> GHS is showing null, which is not user-friendly.
> {code}
> 14/09/24 22:07:20 INFO impl.TimelineClientImpl: Timeline service address: 
> http://nn.example.com:8188/ws/v1/timeline/
> 14/09/24 22:07:20 INFO client.RMProxy: Connecting to ResourceManager at 
> nn.example.com/240.0.0.11:8050
> 14/09/24 22:07:21 INFO client.AHSProxy: Connecting to Application History 
> server at nn.example.com/240.0.0.11:10200
> Application Report : 
>   Application-Id : application_1411586934799_0001
>   Application-Name : Sleep job
>   Application-Type : MAPREDUCE
>   User : hrt_qa
>   Queue : default
>   Start-Time : 1411586956012
>   Finish-Time : 1411586989169
>   Progress : 100%
>   State : FINISHED
>   Final-State : SUCCEEDED
>   Tracking-URL : null
>   RPC Port : -1
>   AM Host : null
>   Aggregate Resource Allocation : N/A
>   Diagnostics : null
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2301) Improve yarn container command

2014-09-29 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-2301:

Attachment: YARN-2303.patch

Attaching patch for the unit test failures.


> Improve yarn container command
> --
>
> Key: YARN-2301
> URL: https://issues.apache.org/jira/browse/YARN-2301
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Naganarasimha G R
>  Labels: usability
> Attachments: YARN-2301.01.patch, YARN-2303.patch
>
>
> While running yarn container -list  command, some 
> observations:
> 1) the scheme (e.g. http/https  ) before LOG-URL is missing
> 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
> print as time format.
> 3) finish-time is 0 if container is not yet finished. May be "N/A"
> 4) May have an option to run as yarn container -list  OR  yarn 
> application -list-containers  also.  
> As attempt Id is not shown on console, this is easier for user to just copy 
> the appId and run it, may  also be useful for container-preserving AM 
> restart. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store

2014-09-29 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152559#comment-14152559
 ] 

Mayank Bansal commented on YARN-2320:
-

I think  overal looks ok however Have to run.

some small comments

shouldn't we use N/A in convertToApplicationAttemptReport instead of null ?
Similarly for convertToApplicationReport?
Similary for convertToContainerReport?

> Removing old application history store after we store the history data to 
> timeline store
> 
>
> Key: YARN-2320
> URL: https://issues.apache.org/jira/browse/YARN-2320
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2320.1.patch, YARN-2320.2.patch
>
>
> After YARN-2033, we should deprecate application history store set. There's 
> no need to maintain two sets of store interfaces. In addition, we should 
> conclude the outstanding jira's under YARN-321 about the application history 
> store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2301) Improve yarn container command

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152579#comment-14152579
 ] 

Hadoop QA commented on YARN-2301:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671912/YARN-2303.patch
  against trunk revision c88c6c5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5175//console

This message is automatically generated.

> Improve yarn container command
> --
>
> Key: YARN-2301
> URL: https://issues.apache.org/jira/browse/YARN-2301
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Naganarasimha G R
>  Labels: usability
> Attachments: YARN-2301.01.patch, YARN-2303.patch
>
>
> While running yarn container -list  command, some 
> observations:
> 1) the scheme (e.g. http/https  ) before LOG-URL is missing
> 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to 
> print as time format.
> 3) finish-time is 0 if container is not yet finished. May be "N/A"
> 4) May have an option to run as yarn container -list  OR  yarn 
> application -list-containers  also.  
> As attempt Id is not shown on console, this is easier for user to just copy 
> the appId and run it, may  also be useful for container-preserving AM 
> restart. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2468) Log handling for LRS

2014-09-29 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.9.patch

> Log handling for LRS
> 
>
> Key: YARN-2468
> URL: https://issues.apache.org/jira/browse/YARN-2468
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
> YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
> YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
> YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
> YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
> YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch
>
>
> Currently, when application is finished, NM will start to do the log 
> aggregation. But for Long running service applications, this is not ideal. 
> The problems we have are:
> 1) LRS applications are expected to run for a long time (weeks, months).
> 2) Currently, all the container logs (from one NM) will be written into a 
> single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2179:
---
Attachment: YARN-2179-trunk-v10.patch

[~vinodkv] [~kasha]

Attached is v10.

Here is a new approach where I extend YarnClientImpl, stub out the service 
init/start/stop methods and mock the relevant methods to test. Does this seem 
like a cleaner approach to you guys?

I tried to do a straight mocking without extending the abstract class, but 
continually ran into the issue that AbstractService.stateModel is initialized 
in the constructor. This creates a problem when trying to stub 
AbstractService.getServiceState(), which is required for the AbstractService to 
work with a CompositeService.

Let me know if you don't like this approach or you know of an easier method and 
I can readjust the patch. Thanks!

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
> YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
> YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
> YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-29 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152599#comment-14152599
 ] 

Xuan Gong commented on YARN-2468:
-

bq. Why is the test in TestAggregatedLogsBlock ignored?

We will have YARN-2583 for web UI related changes. This test will be failed 
right now. So, I add @ignored

bq. pendingUploadFiles is really not neded to be a class field. Rename 
getNumOfLogFilesToUpload() to be getPendingLogFilesToUploadForThisContainer() 
and return the set of pending files. LogValue.write() can then take Set 
pendingLogFilesToUpload as one of the arguments.

I would like to check how many log files we can upload this time. If the number 
is 0, we can skip this time. And this check is also happened before 
LogKey.write(), otherwise, we will write key, but without value.

bq. If deletion of previously uploaded file takes a while and the file remains 
by the time of the next cycle, we will upload it again? It seems to be, let's 
validate this via a test-case.

No, it will not. That is why I saved many information, such as 
allExistingFiles, alreadyUploadedFiles and etc. We will those to check whether 
the logs have been uploaded before.

bq. testLogAggregationServiceWithInterval: doLogAggregationOutOfBand + 
Thread.sleep() is unreliable. Use a clock and refactor AppLogAggregatorImpl to 
have the cyclic aggregation directly callable via a method.

The Thread.sleep() is not used to trigger the logAggregation. It is used to 
make sure the logs has been uploaded into the remote directory. But, deleted 
those Thread.sleep() from the testcases.

> Log handling for LRS
> 
>
> Key: YARN-2468
> URL: https://issues.apache.org/jira/browse/YARN-2468
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
> YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
> YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
> YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
> YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
> YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch
>
>
> Currently, when application is finished, NM will start to do the log 
> aggregation. But for Long running service applications, this is not ideal. 
> The problems we have are:
> 1) LRS applications are expected to run for a long time (weeks, months).
> 2) Currently, all the container logs (from one NM) will be written into a 
> single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-29 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152600#comment-14152600
 ] 

Xuan Gong commented on YARN-2468:
-

New patch addressed all other comments

> Log handling for LRS
> 
>
> Key: YARN-2468
> URL: https://issues.apache.org/jira/browse/YARN-2468
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
> YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
> YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
> YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
> YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
> YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch
>
>
> Currently, when application is finished, NM will start to do the log 
> aggregation. But for Long running service applications, this is not ideal. 
> The problems we have are:
> 1) LRS applications are expected to run for a long time (weeks, months).
> 2) Currently, all the container logs (from one NM) will be written into a 
> single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2624) Resource Localization fails on a secure cluster until nm are restarted

2014-09-29 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2624:
---

 Summary: Resource Localization fails on a secure cluster until nm 
are restarted
 Key: YARN-2624
 URL: https://issues.apache.org/jira/browse/YARN-2624
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


We have found resource localization fails on a secure cluster with following 
error in certain cases. This happens at some indeterminate point after which it 
will keep failing until NM is restarted.

{noformat}
INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Failed to download rsrc { { 
hdfs://:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
 1412027745352, FILE, null 
},pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
java.io.IOException: Rename cannot overwrite non empty destination directory 
/data/yarn/nm/filecache/27
at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
at 
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2624) Resource Localization fails on a secure cluster until nm are restarted

2014-09-29 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2624:

Component/s: nodemanager

> Resource Localization fails on a secure cluster until nm are restarted
> --
>
> Key: YARN-2624
> URL: https://issues.apache.org/jira/browse/YARN-2624
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> We have found resource localization fails on a secure cluster with following 
> error in certain cases. This happens at some indeterminate point after which 
> it will keep failing until NM is restarted.
> {noformat}
> INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc { { 
> hdfs://:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
>  1412027745352, FILE, null 
> },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
> java.io.IOException: Rename cannot overwrite non empty destination directory 
> /data/yarn/nm/filecache/27
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
>   at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
>   at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152637#comment-14152637
 ] 

Hadoop QA commented on YARN-2179:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671924/YARN-2179-trunk-v10.patch
  against trunk revision c88c6c5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5176//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5176//console

This message is automatically generated.

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, 
> YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, 
> YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, 
> YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)

2014-09-29 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2621:
--
Attachment: YARN-2621.1.patch

Create a patch to fix the problem

> Simplify the output when the user doesn't have the access for getDomain(s) 
> ---
>
> Key: YARN-2621
> URL: https://issues.apache.org/jira/browse/YARN-2621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2621.1.patch
>
>
> Per discussion in 
> [YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272],
>  we should simply reject the user if it doesn't have access the domain(s), 
> instead of returning the entity without detail information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2621) Simplify the output when the user doesn't have the access for getDomain(s)

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152695#comment-14152695
 ] 

Hadoop QA commented on YARN-2621:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671931/YARN-2621.1.patch
  against trunk revision 0577eb3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5177//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5177//console

This message is automatically generated.

> Simplify the output when the user doesn't have the access for getDomain(s) 
> ---
>
> Key: YARN-2621
> URL: https://issues.apache.org/jira/browse/YARN-2621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2621.1.patch
>
>
> Per discussion in 
> [YARN-2446|https://issues.apache.org/jira/browse/YARN-2446?focusedCommentId=14151272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14151272],
>  we should simply reject the user if it doesn't have access the domain(s), 
> instead of returning the entity without detail information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152717#comment-14152717
 ] 

Hadoop QA commented on YARN-2468:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671923/YARN-2468.9.patch
  against trunk revision 0577eb3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5178//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5178//console

This message is automatically generated.

> Log handling for LRS
> 
>
> Key: YARN-2468
> URL: https://issues.apache.org/jira/browse/YARN-2468
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
> YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
> YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
> YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
> YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
> YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.patch
>
>
> Currently, when application is finished, NM will start to do the log 
> aggregation. But for Long running service applications, this is not ideal. 
> The problems we have are:
> 1) LRS applications are expected to run for a long time (weeks, months).
> 2) Currently, all the container logs (from one NM) will be written into a 
> single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-29 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2387:

Attachment: YARN-2387.patch

> Resource Manager crashes with NPE due to lack of synchronization
> 
>
> Key: YARN-2387
> URL: https://issues.apache.org/jira/browse/YARN-2387
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.5.0
>Reporter: Mit Desai
>Assignee: Mit Desai
>Priority: Blocker
> Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch
>
>
> We recently came across a 0.23 RM crashing with an NPE. Here is the 
> stacktrace for it.
> {noformat}
> 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
> at
> org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
> at java.lang.String.valueOf(String.java:2854)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
> at java.lang.String.valueOf(String.java:2854)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
> at java.lang.Thread.run(Thread.java:722)
> 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}
> On investigating a on the issue we found that the ContainerStatusPBImpl has 
> methods that are called by different threads and are not synchronized. Even 
> the 2.X code looks alike.
> We need to make these methods synchronized so that we do not encounter this 
> problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2468) Log handling for LRS

2014-09-29 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.9.1.patch

> Log handling for LRS
> 
>
> Key: YARN-2468
> URL: https://issues.apache.org/jira/browse/YARN-2468
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
> YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
> YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
> YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
> YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
> YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, 
> YARN-2468.9.1.patch, YARN-2468.9.patch
>
>
> Currently, when application is finished, NM will start to do the log 
> aggregation. But for Long running service applications, this is not ideal. 
> The problems we have are:
> 1) LRS applications are expected to run for a long time (weeks, months).
> 2) Currently, all the container logs (from one NM) will be written into a 
> single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152752#comment-14152752
 ] 

Hadoop QA commented on YARN-2387:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671946/YARN-2387.patch
  against trunk revision 0577eb3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5179//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5179//console

This message is automatically generated.

> Resource Manager crashes with NPE due to lack of synchronization
> 
>
> Key: YARN-2387
> URL: https://issues.apache.org/jira/browse/YARN-2387
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.5.0
>Reporter: Mit Desai
>Assignee: Mit Desai
>Priority: Blocker
> Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch
>
>
> We recently came across a 0.23 RM crashing with an NPE. Here is the 
> stacktrace for it.
> {noformat}
> 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type NODE_UPDATE to the scheduler
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
> at
> org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
> at java.lang.String.valueOf(String.java:2854)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
> at java.lang.String.valueOf(String.java:2854)
> at java.lang.StringBuilder.append(StringBuilder.java:128)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
> at java.lang.Thread.run(Thread.java:722)
> 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}
> On investigating a on the issue we found that the ContainerStatusPBImpl has 
> methods that are called by different threads and are not synchronized. Even 
> the 2.X code looks alike.
> We need to make these methods synchronized so that we do not encounter this 
> problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2545) RMApp should transit to FAILED when AM calls finishApplicationMaster with FAILED

2014-09-29 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152804#comment-14152804
 ] 

Hong Zhiguo commented on YARN-2545:
---

[~leftnoteasy], [~jianhe], [~ozawa], please have a look, should we set state of 
app/appAttempt to FAILED instead of FINISHED, or just count it as "Apps Failed" 
instead of "Apps Completed"?

> RMApp should transit to FAILED when AM calls finishApplicationMaster with 
> FAILED
> 
>
> Key: YARN-2545
> URL: https://issues.apache.org/jira/browse/YARN-2545
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Hong Zhiguo
>Assignee: Hong Zhiguo
>Priority: Minor
>
> If AM calls finishApplicationMaster with getFinalApplicationStatus()==FAILED, 
> and then exits, the corresponding RMApp and RMAppAttempt transit to state 
> FINISHED.
> I think this is wrong and confusing. On RM WebUI, this application is 
> displayed as "State=FINISHED, FinalStatus=FAILED", and is counted as "Apps 
> Completed", not as "Apps Failed".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152835#comment-14152835
 ] 

Hadoop QA commented on YARN-2468:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671951/YARN-2468.9.1.patch
  against trunk revision 0577eb3.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5181//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5181//console

This message is automatically generated.

> Log handling for LRS
> 
>
> Key: YARN-2468
> URL: https://issues.apache.org/jira/browse/YARN-2468
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
> YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
> YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
> YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
> YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
> YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, 
> YARN-2468.9.1.patch, YARN-2468.9.patch
>
>
> Currently, when application is finished, NM will start to do the log 
> aggregation. But for Long running service applications, this is not ideal. 
> The problems we have are:
> 1) LRS applications are expected to run for a long time (weeks, months).
> 2) Currently, all the container logs (from one NM) will be written into a 
> single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-29 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152842#comment-14152842
 ] 

zhihai xu commented on YARN-2566:
-

Picking the directory with most available space is a good suggestion.  I will 
implement it in my new patch.
thanks

> IOException happen in startLocalizer of DefaultContainerExecutor due to not 
> enough disk space for the first localDir.
> -
>
> Key: YARN-2566
> URL: https://issues.apache.org/jira/browse/YARN-2566
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2566.000.patch, YARN-2566.001.patch
>
>
> startLocalizer in DefaultContainerExecutor will only use the first localDir 
> to copy the token file, if the copy is failed for first localDir due to not 
> enough disk space in the first localDir, the localization will be failed even 
> there are plenty of disk space in other localDirs. We see the following error 
> for this case:
> {code}
> 2014-09-13 23:33:25,171 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
> create app directory 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
> java.io.IOException: mkdir of 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
>   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
>   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,185 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.io.FileNotFoundException: File 
> file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
>   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344)
>   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
>   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
>   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
>   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
>   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,186 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1410663092546_0004_01_01 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> 2014-09-13 23:33:25,187 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCA