date:20141013

zhihai xu created YARN-2679:
---

 Summary: add container launch prepare time metrics to NM.
 Key: YARN-2679
 URL: https://issues.apache.org/jira/browse/YARN-2679
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


add metrics in NodeManagerMetrics to get prepare time to launch container.
The prepare time is the duration between sending 
ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving  
ContainerEventType.CONTAINER_LAUNCHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2679) add container launch prepare time metrics to NM.


 [ 
https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2679:

Attachment: YARN-2679.000.patch

 add container launch prepare time metrics to NM.
 

 Key: YARN-2679
 URL: https://issues.apache.org/jira/browse/YARN-2679
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2679.000.patch


 add metrics in NodeManagerMetrics to get prepare time to launch container.
 The prepare time is the duration between sending 
 ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving  
 ContainerEventType.CONTAINER_LAUNCHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2636) Windows Secure Container Executor: add unit tests for WSCE

2014-10-13 Thread Remus Rusanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2636:
---
Attachment: YARN-2636.delta.1.patch

delta.1.path requires YARN-2198. Adds new unit test for 
WindowsSecureContainerExecutor.

 Windows Secure Container Executor: add unit tests for WSCE
 --

 Key: YARN-2636
 URL: https://issues.apache.org/jira/browse/YARN-2636
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Priority: Critical
 Attachments: YARN-2636.delta.1.patch


 As title says.
 The WSCE has no check-in unit tests. Much of the functionality depends on 
 elevated hadoopwinutilsvc service and cannot be tested, but lets test what is 
 possible to be mocked in Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-10-13 Thread Remus Rusanu (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169132#comment-14169132
]

Remus Rusanu commented on YARN-2198:

[~cwelch] I have added the suggested WSCE unit test in YARN-2636

Remove the need to run NodeManager as privileged account for Windows Secure
Container Executor
--

Key: YARN-2198
URL: https://issues.apache.org/jira/browse/YARN-2198
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Labels: security, windows
Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch,
YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.2.patch, YARN-2198.3.patch,
YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch,
YARN-2198.delta.7.patch, YARN-2198.separation.patch,
YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch,
YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch

YARN-1972 introduces a Secure Windows Container Executor. However this
executor requires the process launching the container to be LocalSystem or a
member of the a local Administrators group. Since the process in question is
the NodeManager, the requirement translates to the entire NM to run as a
privileged account, a very large surface area to review and protect.
This proposal is to move the privileged operations into a dedicated NT
service. The NM can run as a low privilege account and communicate with the
privileged NT service when it needs to launch a container. This would reduce
the surface exposed to the high privileges.
There has to exist a secure, authenticated and authorized channel of
communication between the NM and the privileged NT service. Possible
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would
be to use Windows LPC (Local Procedure Calls), which is a Windows platform
specific inter-process communication channel that satisfies all requirements
and is easy to deploy. The privileged NT service would register and listen on
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop
with libwinutils which would host the LPC client code. The client would
connect to the LPC port (NtConnectPort) and send a message requesting a
container launch (NtRequestWaitReplyPort). LPC provides authentication and
the privileged NT service can use authorization API (AuthZ) to validate the
caller.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2680) Node shouldn't be listed as RUNNING when NM daemon is stop even when recovery work is enabled.

2014-10-13 Thread Junping Du (JIRA)

Junping Du created YARN-2680:


 Summary: Node shouldn't be listed as RUNNING when NM daemon is 
stop even when recovery work is enabled.
 Key: YARN-2680
 URL: https://issues.apache.org/jira/browse/YARN-2680
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


After YARN-1336 (specifically saying YARN-1337), we now support container 
preserving during NM restart. During NM is down, the node shouldn't be listed 
as RUNNING from yarn node CLI or watched from RM website.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2680) Node shouldn't be listed as RUNNING when NM daemon is stop even when recovery work is enabled.

2014-10-13 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2680:
-
Assignee: (was: Junping Du)

 Node shouldn't be listed as RUNNING when NM daemon is stop even when recovery 
 work is enabled.
 --

 Key: YARN-2680
 URL: https://issues.apache.org/jira/browse/YARN-2680
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Junping Du
Priority: Critical

 After YARN-1336 (specifically saying YARN-1337), we now support container 
 preserving during NM restart. During NM is down, the node shouldn't be listed 
 as RUNNING from yarn node CLI or watched from RM website.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS

cntic created YARN-2681:
---

 Summary: Support bandwidth enforcement for containers while 
reading from HDFS
 Key: YARN-2681
 URL: https://issues.apache.org/jira/browse/YARN-2681
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, nodemanager, resourcemanager
Affects Versions: 2.4.0
 Environment: Linux
Reporter: cntic






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS


 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Description: 
To read/write data from HDFS on data node, applications establise TCP/IP 
connections with the datanode. The HDFS read can be controled by setting Linux 
Traffic Control  (TC) rules on the data node to make filters on appropriate 
connections.

The current cgroups net_cls concept can not be applied on the node where the 
container is launched, netheir on data node since:
-   TC hanldes outgoing bandwidth only, so it can be set on container node 
(HDFS read = incoming data for the container)
-   Since HDFS data node is handled by only one process,  it is not possible to 
use net_cls to separate connections from different containers to the datanode.

Tasks:
1) Extend Resource model to define bandwidth enforcement rate
2) Monitor TCP/IP connection estabilised by container handling process and its 
child processes
3) Set Linux Traffic Control rules on data node base on address:port pairs in 
order to enforce bandwidth of outgoing data

 Support bandwidth enforcement for containers while reading from HDFS
 

 Key: YARN-2681
 URL: https://issues.apache.org/jira/browse/YARN-2681
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, nodemanager, resourcemanager
Affects Versions: 2.4.0
 Environment: Linux
Reporter: cntic

 To read/write data from HDFS on data node, applications establise TCP/IP 
 connections with the datanode. The HDFS read can be controled by setting 
 Linux Traffic Control  (TC) rules on the data node to make filters on 
 appropriate connections.
 The current cgroups net_cls concept can not be applied on the node where the 
 container is launched, netheir on data node since:
 -   TC hanldes outgoing bandwidth only, so it can be set on container node 
 (HDFS read = incoming data for the container)
 -   Since HDFS data node is handled by only one process,  it is not possible 
 to use net_cls to separate connections from different containers to the 
 datanode.
 Tasks:
 1) Extend Resource model to define bandwidth enforcement rate
 2) Monitor TCP/IP connection estabilised by container handling process and 
 its child processes
 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
 order to enforce bandwidth of outgoing data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS


 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Description: 
To read/write data from HDFS on data node, applications establise TCP/IP 
connections with the datanode. The HDFS read can be controled by setting Linux 
Traffic Control  (TC) subsystem on the data node to make filters on appropriate 
connections.

The current cgroups net_cls concept can not be applied on the node where the 
container is launched, netheir on data node since:
-   TC hanldes outgoing bandwidth only, so it can be set on container node 
(HDFS read = incoming data for the container)
-   Since HDFS data node is handled by only one process,  it is not possible to 
use net_cls to separate connections from different containers to the datanode.

Tasks:
1) Extend Resource model to define bandwidth enforcement rate
2) Monitor TCP/IP connection estabilised by container handling process and its 
child processes
3) Set Linux Traffic Control rules on data node base on address:port pairs in 
order to enforce bandwidth of outgoing data

  was:
To read/write data from HDFS on data node, applications establise TCP/IP 
connections with the datanode. The HDFS read can be controled by setting Linux 
Traffic Control  (TC) rules on the data node to make filters on appropriate 
connections.

The current cgroups net_cls concept can not be applied on the node where the 
container is launched, netheir on data node since:
-   TC hanldes outgoing bandwidth only, so it can be set on container node 
(HDFS read = incoming data for the container)
-   Since HDFS data node is handled by only one process,  it is not possible to 
use net_cls to separate connections from different containers to the datanode.

Tasks:
1) Extend Resource model to define bandwidth enforcement rate
2) Monitor TCP/IP connection estabilised by container handling process and its 
child processes
3) Set Linux Traffic Control rules on data node base on address:port pairs in 
order to enforce bandwidth of outgoing data


 Support bandwidth enforcement for containers while reading from HDFS
 

 Key: YARN-2681
 URL: https://issues.apache.org/jira/browse/YARN-2681
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, nodemanager, resourcemanager
Affects Versions: 2.4.0
 Environment: Linux
Reporter: cntic

 To read/write data from HDFS on data node, applications establise TCP/IP 
 connections with the datanode. The HDFS read can be controled by setting 
 Linux Traffic Control  (TC) subsystem on the data node to make filters on 
 appropriate connections.
 The current cgroups net_cls concept can not be applied on the node where the 
 container is launched, netheir on data node since:
 -   TC hanldes outgoing bandwidth only, so it can be set on container node 
 (HDFS read = incoming data for the container)
 -   Since HDFS data node is handled by only one process,  it is not possible 
 to use net_cls to separate connections from different containers to the 
 datanode.
 Tasks:
 1) Extend Resource model to define bandwidth enforcement rate
 2) Monitor TCP/IP connection estabilised by container handling process and 
 its child processes
 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
 order to enforce bandwidth of outgoing data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS


 [ 
https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cntic updated YARN-2681:

Attachment: Traffic Control Design.png

 Support bandwidth enforcement for containers while reading from HDFS
 

 Key: YARN-2681
 URL: https://issues.apache.org/jira/browse/YARN-2681
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacityscheduler, nodemanager, resourcemanager
Affects Versions: 2.4.0
 Environment: Linux
Reporter: cntic
 Attachments: Traffic Control Design.png


 To read/write data from HDFS on data node, applications establise TCP/IP 
 connections with the datanode. The HDFS read can be controled by setting 
 Linux Traffic Control  (TC) subsystem on the data node to make filters on 
 appropriate connections.
 The current cgroups net_cls concept can not be applied on the node where the 
 container is launched, netheir on data node since:
 -   TC hanldes outgoing bandwidth only, so it can be set on container node 
 (HDFS read = incoming data for the container)
 -   Since HDFS data node is handled by only one process,  it is not possible 
 to use net_cls to separate connections from different containers to the 
 datanode.
 Tasks:
 1) Extend Resource model to define bandwidth enforcement rate
 2) Monitor TCP/IP connection estabilised by container handling process and 
 its child processes
 3) Set Linux Traffic Control rules on data node base on address:port pairs in 
 order to enforce bandwidth of outgoing data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)

2014-10-13 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169211#comment-14169211
 ] 

Naganarasimha G R commented on YARN-2495:
-

hi [~wangda]  [~aw] ,
Few queries :
Some Requirements which make more sense for Centralized Configuration and not 
for Distributed configuration
* ADD_LABELS (add valid labels which can be assigned to nodes of 
cluster)
* REMOVE_LABELS (Remove from valid labels of cluster)

As by configuration in yarnsite or dynamic scripting we are setting the labels 
for each Node, My opinion is to not support for distributed configuration for 
above cluster labels.
But If we require to support this 
* as the valid lables information is only available with the RM, after 
the node sends heartbeat RM retains only valid labels for that node( log if any 
invalid labels) or we can just ignore all the labels for that node if invalid 
labels exists
* We need to store the valid cluster node lables in the file but need 
not store labels for cluster nodes as during every registration nodes label can 
be got.




 Allow admin specify labels in each NM (Distributed configuration)
 -

 Key: YARN-2495
 URL: https://issues.apache.org/jira/browse/YARN-2495
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R

 Target of this JIRA is to allow admin specify labels in each NM, this covers
 - User can set labels in each NM (by setting yarn-site.xml or using script 
 suggested by [~aw])
 - NM will send labels to RM via ResourceTracker API
 - RM will set labels in NodeLabelManager when NM register/update labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2680) Node shouldn't be listed as RUNNING when NM daemon is stop even when recovery work is enabled.


[ 
https://issues.apache.org/jira/browse/YARN-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169289#comment-14169289
 ] 

Jason Lowe commented on YARN-2680:
--

[~djp] could you elaborate more on the use-case?  During NM restart the RM is 
unaware that the NM is being restarted because it's normally only down for a 
few seconds.  If the RM was aware of the node being down then that could easily 
cause some undesired chaos (e.g.: RM informs AM of lost/removed node, AM 
decides to kill container and start a new one, etc.).

 Node shouldn't be listed as RUNNING when NM daemon is stop even when recovery 
 work is enabled.
 --

 Key: YARN-2680
 URL: https://issues.apache.org/jira/browse/YARN-2680
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.1
Reporter: Junping Du
Priority: Critical

 After YARN-1336 (specifically saying YARN-1337), we now support container 
 preserving during NM restart. During NM is down, the node shouldn't be listed 
 as RUNNING from yarn node CLI or watched from RM website.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

[
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169299#comment-14169299
]

Karthik Kambatla commented on YARN-2641:

bq. If NodeListManager#refreshNodes happens right after
NodeListManager#isValidNode and before create a new RMNode in
ResourceTrackerService#registerNodeManager.

Issuing yarn rmadmin -refreshNodes on the RM and the NM registering with the
RM are inherently racy, and I don't think we need to serialize them.

improve node decommission latency in RM.

Key: YARN-2641
URL: https://issues.apache.org/jira/browse/YARN-2641
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Attachments: YARN-2641.000.patch, YARN-2641.001.patch,
YARN-2641.002.patch

improve node decommission latency in RM.
Currently the node decommission only happened after RM received nodeHeartbeat
from the Node Manager. The node heartbeat interval is configurable. The
default value is 1 second.
It will be better to do the decommission during RM Refresh(NodesListManager)
instead of nodeHeartbeat(ResourceTrackerService).
This will be a much more serious issue:
After RM is refreshed (refreshNodes), If the NM to be decommissioned is
killed before NM sent heartbeat to RM. The RMNode will never be
decommissioned in RM. The RMNode will only expire in RM after
yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169302#comment-14169302
 ] 

Karthik Kambatla commented on YARN-2566:


Submitted patch to kick off Jenkins. 

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169344#comment-14169344
 ] 

Hadoop QA commented on YARN-2566:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674400/YARN-2566.008.patch
  against trunk revision e8a31f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5374//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5374//artifact/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5374//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5374//console

This message is automatically generated.

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)

[jira] [Updated] (YARN-2667) Fix the release audit warning caused by hadoop-yarn-registry


 [ 
https://issues.apache.org/jira/browse/YARN-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2667:
-
 Target Version/s: 2.6.0  (was: 2.7.0)
Affects Version/s: 2.6.0
 Assignee: Yi Liu
  Summary: Fix the release audit warning caused by 
hadoop-yarn-registry  (was: Fix the release auit warning caused by 
hadoop-yarn-registry)
 Hadoop Flags: Reviewed

+1 lgtm.  Committing this.

 Fix the release audit warning caused by hadoop-yarn-registry
 

 Key: YARN-2667
 URL: https://issues.apache.org/jira/browse/YARN-2667
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Minor
 Attachments: YARN-2667.001.patch


 ? 
 /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep
 Lines that start with ? in the release audit report indicate files that 
 do not have an Apache license header.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2665) Audit warning of registry project


 [ 
https://issues.apache.org/jira/browse/YARN-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-2665.
--
Resolution: Duplicate

Closing this as a duplicate of YARN-2667, as that already has a patch.  

 Audit warning of registry project
 -

 Key: YARN-2665
 URL: https://issues.apache.org/jira/browse/YARN-2665
 Project: Hadoop YARN
  Issue Type: Bug
  Components: site
Reporter: Wangda Tan
Assignee: Steve Loughran
Priority: Minor

 I encountered one audit warning today:
 See:
 https://issues.apache.org/jira/browse/YARN-2544?focusedCommentId=14164515page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14164515
 It seems caused by recent committed registry project.
 {code}
 !? 
 /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep
 Lines that start with ? in the release audit report indicate files that 
 do not have an Apache license header.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2667) Fix the release audit warning caused by hadoop-yarn-registry


[ 
https://issues.apache.org/jira/browse/YARN-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169410#comment-14169410
 ] 

Hudson commented on YARN-2667:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6247 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6247/])
YARN-2667. Fix the release audit warning caused by hadoop-yarn-registry. 
Contributed by Yi Liu (jlowe: rev 344a10ad5e26c25abd62eda65eec2820bb808a74)
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/pom.xml
* hadoop-yarn-project/CHANGES.txt


 Fix the release audit warning caused by hadoop-yarn-registry
 

 Key: YARN-2667
 URL: https://issues.apache.org/jira/browse/YARN-2667
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Minor
 Attachments: YARN-2667.001.patch


 ? 
 /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep
 Lines that start with ? in the release audit report indicate files that 
 do not have an Apache license header.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169474#comment-14169474
 ] 

zhihai xu commented on YARN-2566:
-

1. The two Findbugs warning is not related to my change
Inconsistent synchronization of 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.delegationTokenSequenceNumber;
 locked 71% of time
Dereference of the result of readLine() without nullcheck in 
org.apache.hadoop.tracing.SpanReceiverHost.getUniqueLocalTraceFileName()
2. audit warnings is also not related to my change.
/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep





 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169500#comment-14169500
 ] 

Karthik Kambatla commented on YARN-2566:


+1

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with

[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)

2014-10-13 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169528#comment-14169528
 ] 

Allen Wittenauer commented on YARN-2495:


I don't fully understand your question, but I'll admit I've been distracted 
with other JIRAs lately.  Hopefully I'm guessing correctly though... :)

The idea that there are 'valid' labels and that one turns them on/off seems to 
conflict with the goals that I want to see fulfilled with this feature.  Let me 
give a more concrete example, since I know a lot of people are confused as to 
why labels should be dynamic anyway:

Here's node A's configuration:

{code}
$ hadoop version
Hadoop 2.6.0
Subversion http://github.com/apache/hadoop -r 
1e6d81a8869ceeb2f0f81f2ee4b89833f2b22cd4
Compiled by aw on 2014-10-10T19:31Z
Compiled with protoc 2.5.0
From source with checksum 201dad5b10939faa6e5841378c8c94
This command was run using 
/sw/hadoop/hadoop-2.6.0-SNAPSHOT/share/hadoop/common/hadoop-common-2.6.0-SNAPSHOT.jar
$ java -version
java version 1.6.0_32
OpenJDK Runtime Environment (IcedTea6 1.13.4) (rhel-7.1.13.4.el6_5-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
{code}

Here's node B's configuration:
{code}
$ ~/HADOOP/hadoop-3.0.0-SNAPSHOT/bin/hadoop version
Hadoop 3.0.0
Source code repository https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
7b29f99ad23b2a87eac17fdcc7b5b29cd6c9b0c0
Compiled by aw on 2014-10-08T21:12Z
Compiled with protoc 2.5.0
From source with checksum ae703b1a38a35d19f4584495dc31944
This command was run using 
/Users/aw/HADOOP/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/hadoop-common-3.0.0-SNAPSHOT.jar
$ java -version
java version 1.7.0_67
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
{code}

(Ignore the fact that they probably can't be in the same cluster. That's not 
really relevant.)

I want to provide a script similar to this one:
{code}
#!/usr/bin/env bash

HADOOPVERSION=$(${HADOOP_PREFIX}/bin/hadoop version 2/dev/null | /usr/bin/head 
-1 |/usr/bin/cut -f2 -d\ )
JAVAVERSION=$(${JAVA_HOME}/bin/java -version 21 |/usr/bin/head -1| 
/usr/bin/cut -f2 -d\ )
JAVAMMM=$(echo ${JAVAVERSION}| /usr/bin/cut -f1 -d_)

echo LABELS=jdk${JAVAVERSION},jdk${JAVAMMM},hadoop_${HADOOPVERSION}
{code}

This script, when run, should tell the RM that:

Node A has labels 1.6.0_32, jdk1.6.0, and hadoop_2.6.0 .
Node B has labels LABELS=jdk1.7.0_67, jdk1.7.0, and hadoop_3.0.0

Users should be able to submit a job that specifies that it only runs with 
'hadoop_3.0.0'.  Or that it requires 'jdk_1.7.0'.

Making labels either valid or invalid based upon a central configuration 
defeats the purpose of having a living, breathing cluster able to adapt to 
change without operations intervention.  If we are rolling out a new version of 
the JDK, I shouldn't have to tell the system that it's ok to broadcast that JDK 
version first.


 Allow admin specify labels in each NM (Distributed configuration)
 -

 Key: YARN-2495
 URL: https://issues.apache.org/jira/browse/YARN-2495
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R

 Target of this JIRA is to allow admin specify labels in each NM, this covers
 - User can set labels in each NM (by setting yarn-site.xml or using script 
 suggested by [~aw])
 - NM will send labels to RM via ResourceTracker API
 - RM will set labels in NodeLabelManager when NM register/update labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169530#comment-14169530
 ] 

Karthik Kambatla commented on YARN-2566:


Can we file a follow-up JIRA to fix this for WindowsSecureContainerExecutor, 
and remove unused methods from DefaultContainerExecutor? 

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:

[jira] [Commented] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext

2014-10-13 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169564#comment-14169564
 ] 

Zhijie Shen commented on YARN-2651:
---

+1 for the patch. Before commit it, [~xgong], would you please summarize why 
this change is going to happen?

 Spin off the LogRollingInterval from LogAggregationContext
 --

 Key: YARN-2651
 URL: https://issues.apache.org/jira/browse/YARN-2651
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext

2014-10-13 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2651:
--
Hadoop Flags: Reviewed

 Spin off the LogRollingInterval from LogAggregationContext
 --

 Key: YARN-2651
 URL: https://issues.apache.org/jira/browse/YARN-2651
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext

2014-10-13 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2651:

Description: Remove per-app rolling interval completely and then have 
nodemanager wake up every so often and upload old files.

 Spin off the LogRollingInterval from LogAggregationContext
 --

 Key: YARN-2651
 URL: https://issues.apache.org/jira/browse/YARN-2651
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch


 Remove per-app rolling interval completely and then have nodemanager wake up 
 every so often and upload old files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.

zhihai xu created YARN-2682:
---

 Summary: WindowsSecureContainerExecutor should not depend on 
DefaultContainerExecutor#getFirstApplicationDir. 
 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu


DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
can't delete getFirstApplicationDir in DefaultContainerExecutor because 
WindowsSecureContainerExecutor uses it.
We should move getFirstApplicationDir function from DefaultContainerExecutor to 
WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169594#comment-14169594
 ] 

zhihai xu commented on YARN-2566:
-

[~Karthik Kambatla], Yes,  it is a good point. I just created a  follow-up JIRA 
YARN-2682 for WindowsSecureContainerExecutor.


 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera

[jira] [Updated] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.


 [ 
https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2682:

Priority: Minor  (was: Major)

 WindowsSecureContainerExecutor should not depend on 
 DefaultContainerExecutor#getFirstApplicationDir. 
 -

 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Priority: Minor

 DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
 can't delete getFirstApplicationDir in DefaultContainerExecutor because 
 WindowsSecureContainerExecutor uses it.
 We should move getFirstApplicationDir function from DefaultContainerExecutor 
 to WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext

2014-10-13 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2651:
--
Description: Remove per-app rolling interval completely and then have 
nodemanager wake up every so often and upload old log files. The wake up time 
is per-NM configuration, and is decoupled with the actual app's log rolling 
interval.  (was: Remove per-app rolling interval completely and then have 
nodemanager wake up every so often and upload old files.)

 Spin off the LogRollingInterval from LogAggregationContext
 --

 Key: YARN-2651
 URL: https://issues.apache.org/jira/browse/YARN-2651
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch


 Remove per-app rolling interval completely and then have nodemanager wake up 
 every so often and upload old log files. The wake up time is per-NM 
 configuration, and is decoupled with the actual app's log rolling interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext


[ 
https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169656#comment-14169656
 ] 

Hudson commented on YARN-2651:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6251 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6251/])
YARN-2651. Spun off LogRollingInterval from LogAggregationContext. Contributed 
by Xuan Gong. (zjshen: rev 4aed2d8e91c7dccc78fbaffc409d3076c3316289)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/LogAggregationContextPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java


 Spin off the LogRollingInterval from LogAggregationContext
 --

 Key: YARN-2651
 URL: https://issues.apache.org/jira/browse/YARN-2651
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch


 Remove per-app rolling interval completely and then have nodemanager wake up 
 every so often and upload old log files. The wake up time is per-NM 
 configuration, and is decoupled with the actual app's log rolling interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2679) add container launch prepare time metrics to NM.


 [ 
https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2679:

Attachment: (was: YARN-2679.000.patch)

 add container launch prepare time metrics to NM.
 

 Key: YARN-2679
 URL: https://issues.apache.org/jira/browse/YARN-2679
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu

 add metrics in NodeManagerMetrics to get prepare time to launch container.
 The prepare time is the duration between sending 
 ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving  
 ContainerEventType.CONTAINER_LAUNCHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2679) add container launch prepare time metrics to NM.


 [ 
https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2679:

Attachment: YARN-2679.000.patch

 add container launch prepare time metrics to NM.
 

 Key: YARN-2679
 URL: https://issues.apache.org/jira/browse/YARN-2679
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2679.000.patch


 add metrics in NodeManagerMetrics to get prepare time to launch container.
 The prepare time is the duration between sending 
 ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving  
 ContainerEventType.CONTAINER_LAUNCHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info


[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169700#comment-14169700
 ] 

Jason Lowe commented on YARN-2377:
--

+1 latest patch lgtm.  The audit failure is unrelated, see YARN-2667.  
Committing this.

 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

2014-10-13 Thread Craig Welch (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-2308:
--
Attachment: YARN-2308.1.patch

Add explicit logging before exception throw, just in case the exception message 
is somehow swallowed.

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2683) document registry config options

Steve Loughran created YARN-2683:


 Summary: document registry config options
 Key: YARN-2683
 URL: https://issues.apache.org/jira/browse/YARN-2683
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran


Add to {{yarn-site}} a page on registry configuration parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info


[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169732#comment-14169732
 ] 

Hudson commented on YARN-2377:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6252 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6252/])
YARN-2377. Localization exception stack traces are not passed as diagnostic 
info. Contributed by Gera Shegalov (jlowe: rev 
a56ea0100215ecf2e1471a18812b668658197239)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/SerializedException.java
* hadoop-yarn-project/CHANGES.txt


 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Fix For: 2.6.0

 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2679) add container launch prepare time metrics to NM.


[ 
https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169771#comment-14169771
 ] 

Hadoop QA commented on YARN-2679:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674553/YARN-2679.000.patch
  against trunk revision 1770bb9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5375//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5375//console

This message is automatically generated.

 add container launch prepare time metrics to NM.
 

 Key: YARN-2679
 URL: https://issues.apache.org/jira/browse/YARN-2679
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2679.000.patch


 add metrics in NodeManagerMetrics to get prepare time to launch container.
 The prepare time is the duration between sending 
 ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving  
 ContainerEventType.CONTAINER_LAUNCHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS


[ 
https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169788#comment-14169788
 ] 

Hadoop QA commented on YARN-2582:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674019/YARN-2582.2.patch
  against trunk revision a56ea01.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5377//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5377//console

This message is automatically generated.

 Log related CLI and Web UI changes for Aggregated Logs in LRS
 -

 Key: YARN-2582
 URL: https://issues.apache.org/jira/browse/YARN-2582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2582.1.patch, YARN-2582.2.patch


 After YARN-2468, we have change the log layout to support log aggregation for 
 Long Running Service. Log CLI and related Web UI should be modified 
 accordingly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169826#comment-14169826
 ] 

Hadoop QA commented on YARN-2308:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674554/YARN-2308.1.patch
  against trunk revision 1770bb9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5376//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5376//console

This message is automatically generated.

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed

2014-10-13 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169843#comment-14169843
 ] 

Craig Welch commented on YARN-2308:
---

The unit test failure does not appear to be related to the changes, I was able 
to run 

org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

successfully on my box with the changes.  (And, I have noticed that this test 
seems to fail sporadically from time to time, having nothing to do with the 
change at hand).

[~jianhe] can you take a moment to review the patch?

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-570) Time strings are formated in different timezone

2014-10-13 Thread Ray Chiang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169849#comment-14169849
 ] 

Ray Chiang commented on YARN-570:
-

I've been playing around with this patch some.  I'm fine with it as is, but if 
there are reservations about it in its current state, is it worth combining 
this patch with a change in Times.java to end up with a uniform time/date 
format, e.g.

EEE MMM dd HH:mm:ss Z 
Mon Oct 13 12:54:20 -0700 2014

Thoughts?


 Time strings are formated in different timezone
 ---

 Key: YARN-570
 URL: https://issues.apache.org/jira/browse/YARN-570
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.2.0
Reporter: Peng Zhang
Assignee: Akira AJISAKA
 Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, YARN-570.3.patch


 Time strings on different page are displayed in different timezone.
 If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as 
 Wed, 10 Apr 2013 08:29:56 GMT
 If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 
 16:29:56
 Same value, but different timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-570) Time strings are formated in different timezone


[ 
https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169854#comment-14169854
 ] 

Hadoop QA commented on YARN-570:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12653110/YARN-570.3.patch
  against trunk revision a56ea01.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5378//console

This message is automatically generated.

 Time strings are formated in different timezone
 ---

 Key: YARN-570
 URL: https://issues.apache.org/jira/browse/YARN-570
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.2.0
Reporter: Peng Zhang
Assignee: Akira AJISAKA
 Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, YARN-570.3.patch


 Time strings on different page are displayed in different timezone.
 If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as 
 Wed, 10 Apr 2013 08:29:56 GMT
 If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 
 16:29:56
 Same value, but different timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user

2014-10-13 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169857#comment-14169857
 ] 

Varun Vasudev commented on YARN-2656:
-

The latest patch looks good to me.

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169892#comment-14169892
 ] 

Jian He commented on YARN-2308:
---

patch looks good to me

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2683) document registry config options


 [ 
https://issues.apache.org/jira/browse/YARN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2683:
-
Attachment: YARN-2683-001.patch

 document registry config options
 

 Key: YARN-2683
 URL: https://issues.apache.org/jira/browse/YARN-2683
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2683-001.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Add to {{yarn-site}} a page on registry configuration parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169899#comment-14169899
 ] 

Jian He commented on YARN-2308:
---

[~lichangleo] , thanks for your previous work ! and thanks [~cwelch] !

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (YARN-2483) TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails due to incorrect AppAttempt state


 [ 
https://issues.apache.org/jira/browse/YARN-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He reopened YARN-2483:
---

reopen this, as I see the test starts failing again recently..

 TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails due to 
 incorrect AppAttempt state
 

 Key: YARN-2483
 URL: https://issues.apache.org/jira/browse/YARN-2483
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
 Fix For: 2.6.0


 From https://builds.apache.org/job/Hadoop-Yarn-trunk/665/console :
 {code}
 testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart)
   Time elapsed: 49.686 sec   FAILURE!
 java.lang.AssertionError: AppAttempt state is not correct (timedout) 
 expected:ALLOCATED but was:SCHEDULED
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:84)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:417)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:582)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:589)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForNewAMToLaunchAndRegister(MockRM.java:182)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:402)
 {code}
 TestApplicationMasterLauncher#testallocateBeforeAMRegistration fails with 
 similar cause.
 These tests failed in build #664 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Umbrella: Add a way to register long-lived services in a YARN cluster


[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169908#comment-14169908
 ] 

Steve Loughran commented on YARN-913:
-

the registry uses curator; we need one that is compatible with Hadoop's own 
dependencies, hence HADOOP-11102

 Umbrella: Add a way to register long-lived services in a YARN cluster
 -

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, 
 YARN-913-013.patch, YARN-913-014.patch, YARN-913-015.patch, 
 YARN-913-016.patch, YARN-913-017.patch, YARN-913-018.patch, 
 YARN-913-019.patch, YARN-913-020.patch, YARN-913-021.patch, yarnregistry.pdf, 
 yarnregistry.pdf, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169959#comment-14169959
 ] 

Jian He commented on YARN-2308:
---

just realized that we probably should do the same for FairScheduler too, 
[~kasha], do you think so ?

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts

Karthik Kambatla created YARN-2684:
--

 Summary: FairScheduler should tolerate queue configuration changes 
across RM restarts
 Key: YARN-2684
 URL: https://issues.apache.org/jira/browse/YARN-2684
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical


YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169969#comment-14169969
 ] 

Karthik Kambatla commented on YARN-2308:


Thanks for the nudge, Jian. Filed YARN-2684 to unblock this JIRA. 

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)


[ 
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169984#comment-14169984
 ] 

Wangda Tan commented on YARN-2495:
--

Hi [~Naganarasimha],
First I think need to take care is, distributed configuration should be a way 
to input and store node labels configuration. In centralized configuration, 
user can input labels and node to labels mappings by REST API and RM admin CLI, 
and they will be persist to some file system like HDFS/ZK.

For distributed configuration, node labels are input by NMs (either by set 
yarn-site.xml or some script), and they will be reported to RM when NM 
registration. We may not need to persist any of them, but RM should know these 
labels existence to do scheduling.

Another question is if we need check labels when they registering, I prefer to 
pre-set them because this affects scheduling behavior. For example, the 
maximum-resource and minimum-resource are setup in RM side, and RackResolver is 
also run in RM side. At least, the label checking should be kept configurable 
in distributed mode. -- just ignore all the labels for that node if invalid 
labels exists might be a good way when it enabled.

For implementation, I suggest you can take a look at YARN-2494 and 
YARN-2496/YARN-2500, the distributed configuration should be an extension of 
NodeLabelsManager, and YARN-2496/YARN-2500 shouldn't be modified too much to 
support the distributed mode in scheduling.

Thanks,
Wangda

 Allow admin specify labels in each NM (Distributed configuration)
 -

 Key: YARN-2495
 URL: https://issues.apache.org/jira/browse/YARN-2495
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R

 Target of this JIRA is to allow admin specify labels in each NM, this covers
 - User can set labels in each NM (by setting yarn-site.xml or using script 
 suggested by [~aw])
 - NM will send labels to RM via ResourceTracker API
 - RM will set labels in NodeLabelManager when NM register/update labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2685) Resource on each label not correct when multiple NMs in a same host and some has label some not

Wangda Tan created YARN-2685:


 Summary: Resource on each label not correct when multiple NMs in a 
same host and some has label some not
 Key: YARN-2685
 URL: https://issues.apache.org/jira/browse/YARN-2685
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


I noticed there's one issue, when we have multiple NMs running in a same host, 
(say NM1-4 running in host1). And we specify some of them has label and some 
not, the total resource on label is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2641) improve node decommission latency in RM.


 [ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2641:

Attachment: YARN-2641.003.patch

 improve node decommission latency in RM.
 

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2685) Resource on each label not correct when multiple NMs in a same host and some has label some not


 [ 
https://issues.apache.org/jira/browse/YARN-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2685:
-
Attachment: YARN-2685-20141013.1.patch

Attached a fix for this, also added a new test to verify when multiple NMs on a 
same host, some has labels some not, the total resource on each label should be 
correct. 

 Resource on each label not correct when multiple NMs in a same host and some 
 has label some not
 ---

 Key: YARN-2685
 URL: https://issues.apache.org/jira/browse/YARN-2685
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2685-20141013.1.patch


 I noticed there's one issue, when we have multiple NMs running in a same 
 host, (say NM1-4 running in host1). And we specify some of them has label and 
 some not, the total resource on label is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170004#comment-14170004
 ] 

Hudson commented on YARN-2308:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6253 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6253/])
YARN-2308. Changed CapacityScheduler to explicitly throw exception if the queue 
(jianhe: rev f9680d9a160ee527c8f2c1494584abf1a1f70f82)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
Missing Changes.txt for YARN-2308 (jianhe: rev 
178bc505da5d06d591a19aac13c040c6a9cf28ad)
* hadoop-yarn-project/CHANGES.txt


 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170003#comment-14170003
 ] 

Jian He commented on YARN-2308:
---

thanks Karthik. committing this.

 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) improve node decommission latency in RM.

[
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170011#comment-14170011
]

zhihai xu commented on YARN-2641:
-

Hi [~kasha], Yes, they are inherently racy. The user should know whether the
node is already registered before decommission the node.
I attached a new patch YARN-2641.003.patch which remove the lock in
ResourceTrackerService#registerNodeManager.

improve node decommission latency in RM.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170026#comment-14170026
 ] 

Wangda Tan commented on YARN-2314:
--

Hi [~jlowe],
Looking at this issue recently. I think it will be not easy to change the under 
layer of RPC.

One question, according to the patch you mentioned in 
https://issues.apache.org/jira/browse/YARN-2314?focusedCommentId=14131901page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14131901,
 do you have any data about is there any performance downgrade of this patch in 
your environment? I'm wondering if we can just disable the cache and make the 
timeout to 0 as the standard behavior.

Thanks,
Wangda

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


 [ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2641:
---
Summary: Decommission nodes on -refreshNodes instead of next NM-RM 
heartbeat  (was: improve node decommission latency in RM.)

 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170030#comment-14170030
 ] 

Karthik Kambatla commented on YARN-2641:


+1, pending Jenkins.

 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-13 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170045#comment-14170045
 ] 

Sangjin Lee commented on YARN-2314:
---

We have been running with the proposed patch (disabling the cache and making 
the timeout to 0) with a good result.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170054#comment-14170054
 ] 

Wangda Tan commented on YARN-2314:
--

[~sjlee0],
Thanks for your reply, it's very helpful to have such experimental result :)

Wangda

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2571) RM to support YARN registry


 [ 
https://issues.apache.org/jira/browse/YARN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-2571:
-
Attachment: YARN-2571-003.patch

patch -003 (against branch-2). Exceptions thrown in the async future actions 
are logged @ warn level before being rethrown. As the RM doesn't issue 
{{future.get()}} calls those exceptions do not propagate to the RM (i.e. 
registry problems do not stop the RM working, simply registry setup/teardown). 
The logging ensures that these problems are logged.

 RM to support YARN registry 
 

 Key: YARN-2571
 URL: https://issues.apache.org/jira/browse/YARN-2571
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2571-001.patch, YARN-2571-002.patch, 
 YARN-2571-003.patch


 The RM needs to (optionally) integrate with the YARN registry:
 # startup: create the /services and /users paths with system ACLs (yarn, hdfs 
 principals)
 # app-launch: create the user directory /users/$username with the relevant 
 permissions (CRD) for them to create subnodes.
 # attempt, container, app completion: remove service records with the 
 matching persistence and ID



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-13 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170063#comment-14170063
 ] 

Craig Welch commented on YARN-1680:
---

Hi [~airbots], did the newly-discovered concern with cluster-app blacklisting 
make sense?  Any luck with a patch to address it?

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170082#comment-14170082
 ] 

Jason Lowe commented on YARN-2314:
--

The patch effectively restores 0.23 behavior in this area, as 0.23 also had the 
client close the connection after each RPC call and did not attempt to cache 
proxies.  We were happy with the performance from 0.23, and we did not measure 
any significant slowdown in this area for 2.x once this patch was applied.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2014-10-13 Thread Chen He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170083#comment-14170083
 ] 

Chen He commented on YARN-1680:
---

Agree with you and [~jlowe]. Maybe post the patch the next week. Thank you for 
remaindering me. 

 availableResources sent to applicationMaster in heartbeat should exclude 
 blacklistedNodes free memory.
 --

 Key: YARN-1680
 URL: https://issues.apache.org/jira/browse/YARN-1680
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
 Environment: SuSE 11 SP2 + Hadoop-2.3 
Reporter: Rohith
Assignee: Chen He
 Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
 YARN-1680-v2.patch, YARN-1680.patch


 There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
 slow start is set to 1.
 Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
 become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
 NodeManager(NM-4). All reducer task are running in cluster now.
 MRAppMaster does not preempt the reducers because for Reducer preemption 
 calculation, headRoom is considering blacklisted nodes memory. This makes 
 jobs to hang forever(ResourceManager does not assing any new containers on 
 blacklisted nodes but returns availableResouce considers cluster free 
 memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2502) Changes in distributed shell to support specify labels


 [ 
https://issues.apache.org/jira/browse/YARN-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2502:
-
Attachment: YARN-2502-20141013.1.patch

Attached a new patch addressed latest comments by [~vinodkv].

 Changes in distributed shell to support specify labels
 --

 Key: YARN-2502
 URL: https://issues.apache.org/jira/browse/YARN-2502
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2502-20141009.1.patch, YARN-2502-20141009.2.patch, 
 YARN-2502-20141013.1.patch, YARN-2502.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170088#comment-14170088
 ] 

Wangda Tan commented on YARN-2314:
--

Thanks [~jlowe]!

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170126#comment-14170126
 ] 

Hadoop QA commented on YARN-2641:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674601/YARN-2641.003.patch
  against trunk revision 178bc50.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5381//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5381//console

This message is automatically generated.

 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170127#comment-14170127
 ] 

Jian He commented on YARN-2641:
---

looks good to me too, thanks Zhihai !

 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2685) Resource on each label not correct when multiple NMs in a same host and some has label some not


[ 
https://issues.apache.org/jira/browse/YARN-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170131#comment-14170131
 ] 

Hadoop QA commented on YARN-2685:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12674602/YARN-2685-20141013.1.patch
  against trunk revision 178bc50.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5380//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5380//console

This message is automatically generated.

 Resource on each label not correct when multiple NMs in a same host and some 
 has label some not
 ---

 Key: YARN-2685
 URL: https://issues.apache.org/jira/browse/YARN-2685
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2685-20141013.1.patch


 I noticed there's one issue, when we have multiple NMs running in a same 
 host, (say NM1-4 running in host1). And we specify some of them has label and 
 some not, the total resource on label is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly


 [ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2566:
---
Summary: DefaultContainerExecutor should pick a working directory randomly  
(was: IOException happen in startLocalizer of DefaultContainerExecutor due to 
not enough disk space for the first localDir.)

 DefaultContainerExecutor should pick a working directory randomly
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl

[jira] [Updated] (YARN-2056) Disable preemption at Queue level

2014-10-13 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-2056:
-
Attachment: YARN-2056.201410132225.txt

[~leftnoteasy], Thanks for all of your help.

After looking through your suggested algorithm and thinking about it some more, 
I think it is important to have a model where the most underserved queues are 
given first chance at the unassigned resources. I think the algorithm should 
build up each queue, reassessing its needs on every pass. Based on this, I have 
rewritten the patch with the following algorithm. Plese let me know what you 
think.
{code}
- Prior to assigning the unused resources, process each queue as follows:
  - If current  guaranteed, idealAssigned = guaranteed + untouchable extra
Else idealAssigned = current;
  - Subtract idealAssigned resources from unassigned.
  - If the queue has all of its needs met (that is, if idealAssigned = current 
+ pending), remove the queue from consideration.
  - Sort queues from most under-guaranteed to most over-guaranteed. Call the 
this queue orderedByNeed
- While there are unsatisfied queues and some unassigned resources exist
  - calculate normalized guaranteed (as today) for all remaining queues at this 
hierarchical level
  - Pull off the underserved queue(s) from orderedByNeed
  - For each underserved queue (or set of queues if multiple are equally 
underserved), offer its share of
the unassigned resources based on its normalized guarantee.
  - After the offer, if the queue is not satisfied, place it back in the 
ordered list of queues (orderedByNeed),
  recalculating its place  in the order of most under-guaranteed to most 
over-guaranteed.
In this  way, the most underserved queue(s) are always handled first.
{code}



 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
 YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, 
 YARN-2056.201410132225.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170149#comment-14170149
 ] 

zhihai xu commented on YARN-2641:
-

I didn't see the failure(TestAMRestart) in my local build based on latest code 
base:
---
 T E S T S
---
Running 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 119.464 sec - 
in 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

Results :

Tests run: 6, Failures: 0, Errors: 0, Skipped: 0


 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2056) Disable preemption at Queue level


[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170163#comment-14170163
 ] 

Hadoop QA commented on YARN-2056:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12674625/YARN-2056.201410132225.txt
  against trunk revision 178bc50.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5383//console

This message is automatically generated.

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
 YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, 
 YARN-2056.201410132225.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2571) RM to support YARN registry


[ 
https://issues.apache.org/jira/browse/YARN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170179#comment-14170179
 ] 

Hadoop QA commented on YARN-2571:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674613/YARN-2571-003.patch
  against trunk revision 178bc50.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 15 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5382//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5382//console

This message is automatically generated.

 RM to support YARN registry 
 

 Key: YARN-2571
 URL: https://issues.apache.org/jira/browse/YARN-2571
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2571-001.patch, YARN-2571-002.patch, 
 YARN-2571-003.patch


 The RM needs to (optionally) integrate with the YARN registry:
 # startup: create the /services and /users paths with system ACLs (yarn, hdfs 
 principals)
 # app-launch: create the user directory /users/$username with the relevant 
 permissions (CRD) for them to create subnodes.
 # attempt, container, app completion: remove service records with the 
 matching persistence and ID



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170199#comment-14170199
 ] 

Karthik Kambatla commented on YARN-2641:


Committing this. 

 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170224#comment-14170224
 ] 

Wangda Tan commented on YARN-2314:
--

Discussed with [~vinodkv] offline, summary of what we have discussed/confirmed,

1) There's no problem when multiple proxy to a same NM in an AM existed (Tokens 
will be simply added to SetToken of subject of UserGroupInformation)
2) There's no thread-leaking in MR-AM side (MRAppMaster has a 
{{yarn.app.mapreduce.am.containerlauncher.thread-count-limit}}) to limit 
maximum number of threads in ContainerLauncher
3) Basically the cache doesn't have more functionalities other than just cache 
the connection.

So it should be safe to remove the cache, but it's better to add an option for 
user to enable/disable the cache. [~jlowe], could you please add an option to 
disable/enable cache in your existing patch or you can assign it to me if you 
don't have bandwidth to do that. 

Thanks,
Wangda

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170231#comment-14170231
 ] 

Karthik Kambatla commented on YARN-2566:


Committing this..

 DefaultContainerExecutor should pick a working directory randomly
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
   APPID=application_1410663092546_0004

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170235#comment-14170235
 ] 

Hudson commented on YARN-2641:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6254 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6254/])
YARN-2641. Decommission nodes on -refreshNodes instead of next NM-RM heartbeat. 
(Zhihai Xu via kasha) (kasha: rev da709a2eac7110026169ed3fc4d0eaf85488d3ef)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java


 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.7.0

 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170246#comment-14170246
 ] 

Hudson commented on YARN-2566:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6255 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6255/])
YARN-2566. DefaultContainerExecutor should pick a working directory randomly. 
(Zhihai Xu via kasha) (kasha: rev cc93e7e683fa74eb1a7aa2b357a36667bd21086a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 DefaultContainerExecutor should pick a working directory randomly
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at

[jira] [Commented] (YARN-2667) Fix the release audit warning caused by hadoop-yarn-registry

2014-10-13 Thread Yi Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170291#comment-14170291
 ] 

Yi Liu commented on YARN-2667:
--

Thanks [~jlowe] for review and commit.

 Fix the release audit warning caused by hadoop-yarn-registry
 

 Key: YARN-2667
 URL: https://issues.apache.org/jira/browse/YARN-2667
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-2667.001.patch


 ? 
 /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep
 Lines that start with ? in the release audit report indicate files that 
 do not have an Apache license header.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2631) Modify DistributedShell to enable LogAggregationContext

2014-10-13 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170328#comment-14170328
 ] 

Xuan Gong commented on YARN-2631:
-

The purpose for this changes is for testing. But after YARN-2651, we have 
already spun off the per-app logAggregationInterval. Instead we are using a new 
configuration in yarn_site.xml: 
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds. So, to test 
this feature, the ds changes is not need.

 Modify DistributedShell to enable LogAggregationContext
 ---

 Key: YARN-2631
 URL: https://issues.apache.org/jira/browse/YARN-2631
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2631.1.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2631) Modify DistributedShell to enable LogAggregationContext

2014-10-13 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong resolved YARN-2631.
-
Resolution: Invalid

 Modify DistributedShell to enable LogAggregationContext
 ---

 Key: YARN-2631
 URL: https://issues.apache.org/jira/browse/YARN-2631
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2631.1.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2683) document registry config options