date:20140918

[
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138633#comment-14138633
]

Hadoop QA commented on YARN-2198:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment

http://issues.apache.org/jira/secure/attachment/12669663/YARN-2198.trunk.6.patch
against trunk revision ee21b13.

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5023//console

This message is automatically generated.

Remove the need to run NodeManager as privileged account for Windows Secure
Container Executor
--

Key: YARN-2198
URL: https://issues.apache.org/jira/browse/YARN-2198
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Labels: security, windows
Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch,
YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch,
YARN-2198.separation.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch,
YARN-2198.trunk.6.patch

YARN-1972 introduces a Secure Windows Container Executor. However this
executor requires a the process launching the container to be LocalSystem or
a member of the a local Administrators group. Since the process in question
is the NodeManager, the requirement translates to the entire NM to run as a
privileged account, a very large surface area to review and protect.
This proposal is to move the privileged operations into a dedicated NT
service. The NM can run as a low privilege account and communicate with the
privileged NT service when it needs to launch a container. This would reduce
the surface exposed to the high privileges.
There has to exist a secure, authenticated and authorized channel of
communication between the NM and the privileged NT service. Possible
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would
be to use Windows LPC (Local Procedure Calls), which is a Windows platform
specific inter-process communication channel that satisfies all requirements
and is easy to deploy. The privileged NT service would register and listen on
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop
with libwinutils which would host the LPC client code. The client would
connect to the LPC port (NtConnectPort) and send a message requesting a
container launch (NtRequestWaitReplyPort). LPC provides authentication and
the privileged NT service can use authorization API (AuthZ) to validate the
caller.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-09-18 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312-wip.patch

Attached a WIP patch as a reference of changes.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-796:

Attachment: YARN-796.node-label.consolidate.6.patch

Attached ver.6 consolidated patch against latest trunk -- 
YARN-796.node-label.consolidate.6.patch

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, 
 Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, 
 YARN-796.node-label.consolidate.1.patch, 
 YARN-796.node-label.consolidate.2.patch, 
 YARN-796.node-label.consolidate.3.patch, 
 YARN-796.node-label.consolidate.4.patch, 
 YARN-796.node-label.consolidate.5.patch, 
 YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.demo.patch.1, 
 YARN-796.patch, YARN-796.patch4


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2498:
-
Attachment: YARN-2498.patch

 [YARN-796] Respect labels in preemption policy of capacity scheduler
 

 Key: YARN-2498
 URL: https://issues.apache.org/jira/browse/YARN-2498
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, 
 yarn-2498-implementation-notes.pdf


 There're 3 stages in ProportionalCapacityPreemptionPolicy,
 # Recursively calculate {{ideal_assigned}} for queue. This is depends on 
 available resource, resource used/pending in each queue and guaranteed 
 capacity of each queue.
 # Mark to-be preempted containers: For each over-satisfied queue, it will 
 mark some containers will be preempted.
 # Notify scheduler about to-be preempted container.
 We need respect labels in the cluster for both #1 and #2:
 For #1, when there're some resource available in the cluster, we shouldn't 
 assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot 
 access such labels
 For #2, when we make decision about whether we need preempt a container, we 
 need make sure, resource this container is *possibly* usable by a queue which 
 is under-satisfied and has pending resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler


[ 
https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138712#comment-14138712
 ] 

Hadoop QA commented on YARN-2498:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669680/YARN-2498.patch
  against trunk revision ee21b13.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5024//console

This message is automatically generated.

 [YARN-796] Respect labels in preemption policy of capacity scheduler
 

 Key: YARN-2498
 URL: https://issues.apache.org/jira/browse/YARN-2498
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, 
 yarn-2498-implementation-notes.pdf


 There're 3 stages in ProportionalCapacityPreemptionPolicy,
 # Recursively calculate {{ideal_assigned}} for queue. This is depends on 
 available resource, resource used/pending in each queue and guaranteed 
 capacity of each queue.
 # Mark to-be preempted containers: For each over-satisfied queue, it will 
 mark some containers will be preempted.
 # Notify scheduler about to-be preempted container.
 We need respect labels in the cluster for both #1 and #2:
 For #1, when there're some resource available in the cluster, we shouldn't 
 assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot 
 access such labels
 For #2, when we make decision about whether we need preempt a container, we 
 need make sure, resource this container is *possibly* usable by a queue which 
 is under-satisfied and has pending resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels


 [ 
https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2500:
-
Description: This patches contains changes in ResourceManager to support 
labels

 [YARN-796] Miscellaneous changes in ResourceManager to support labels
 -

 Key: YARN-2500
 URL: https://issues.apache.org/jira/browse/YARN-2500
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2500.patch, YARN-2500.patch, YARN-2500.patch


 This patches contains changes in ResourceManager to support labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels


[ 
https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138720#comment-14138720
 ] 

Wangda Tan commented on YARN-2500:
--

Highlight major changes for easier review:
* Properly setup and check label expression according to ResourceRequest, 
ApplicationSubmissionContext and queue default label expression
** Located in {{ApplicationMasterService}}, 
{{RMServerUtils#validateResourceRequest}}
* In the past, ResourceRequest of AM container created in two places (one in 
{{RMAppManager}}, another one in {{RMAppAttemptImpl}}), when trying to allocate 
resource from YarnScheduler, the 2nd ResourceRequest doesn't normalized. Now 
reduced this to only one: ResourceRequest will be created in {{RMAppManager}} 
only, and it will pass ResourceRequest to RMAppImpl to allocate resource
* Check if a queue has permission to access all labels in label-expression of a 
ResourceRequest, located in {{SchedulerUtils.validateResourceRequest}}
* Added getLabels() and getDefaultLabelExpression to Queue
* Added get/set NodeLabelManager in RMContext

 [YARN-796] Miscellaneous changes in ResourceManager to support labels
 -

 Key: YARN-2500
 URL: https://issues.apache.org/jira/browse/YARN-2500
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2500.patch, YARN-2500.patch, YARN-2500.patch


 This patches contains changes in ResourceManager to support labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


[ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138809#comment-14138809
 ] 

Hudson commented on YARN-2558:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #684 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/684/])
YARN-2558. Updated ContainerTokenIdentifier#read/write to use 
ContainerId#getContainerId. Contributed by Tsuyoshi OZAWA. (jianhe: rev 
f4886111aa573ec928de69e8ca9328d480bf673e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java
* hadoop-yarn-project/CHANGES.txt


 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher


[ 
https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138814#comment-14138814
 ] 

Hudson commented on YARN-2559:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #684 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/684/])
YARN-2559. Fixed NPE in SystemMetricsPublisher when retrieving 
FinalApplicationStatus. Contributed by Zhijie Shen (jianhe: rev 
ee21b13cbd4654d7181306404174329f12193613)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher
 --

 Key: YARN-2559
 URL: https://issues.apache.org/jira/browse/YARN-2559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Generice History Service is enabled in Timelineserver 
 with 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 So that ResourceManager should Timeline Store for recording application 
 history information 
Reporter: Karam Singh
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2559.1.patch, YARN-2559.2.patch, YARN-2559.3.patch, 
 YARN-2559.4.patch


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2552) Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only

2014-09-18 Thread Remus Rusanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2552:
---
Attachment: YARN-2552.1.patch

This patch adds ValidateLocalPath function, checks all file operations for 
valid paths, and adds a 
yarn.nodemanager.windows-secure-container-executor.local-dirs key in 
wsce-site.xml. The SecureContainer.apt.vm is updated to describe the new config 
key.

 Windows Secure Container Executor: the privileged file operations of 
 hadoopwinutilsvc should be constrained to localdirs only
 -

 Key: YARN-2552
 URL: https://issues.apache.org/jira/browse/YARN-2552
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows, wsce
 Attachments: YARN-2552.1.patch


 YARN-2458 added file manipulation operations executed in an elevated context 
 by hadoopwinutilsvc. W/o any constraint, the NM (or a hijacker that takes 
 over the NM) can manipulate arbitrary OS files under highest possible 
 privileges, an easy elevation attack vector. The service should only allow 
 operations on files/directories that are under the configured NM localdirs. 
 It should read this value from wsce-site.xml, as the yarn-site.xml cannot be 
 trusted, being writable by Hadoop admins (YARN-2551 ensures wsce-site.xml is 
 only writable by system Administrators, not Hadoop admins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2552) Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only

2014-09-18 Thread Remus Rusanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu resolved YARN-2552.

Resolution: Implemented

Fix is included in YARN-2198.delta.7.patch

 Windows Secure Container Executor: the privileged file operations of 
 hadoopwinutilsvc should be constrained to localdirs only
 -

 Key: YARN-2552
 URL: https://issues.apache.org/jira/browse/YARN-2552
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows, wsce
 Attachments: YARN-2552.1.patch


 YARN-2458 added file manipulation operations executed in an elevated context 
 by hadoopwinutilsvc. W/o any constraint, the NM (or a hijacker that takes 
 over the NM) can manipulate arbitrary OS files under highest possible 
 privileges, an easy elevation attack vector. The service should only allow 
 operations on files/directories that are under the configured NM localdirs. 
 It should read this value from wsce-site.xml, as the yarn-site.xml cannot be 
 trusted, being writable by Hadoop admins (YARN-2551 ensures wsce-site.xml is 
 only writable by system Administrators, not Hadoop admins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher


[ 
https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138942#comment-14138942
 ] 

Hudson commented on YARN-2559:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1900 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1900/])
YARN-2559. Fixed NPE in SystemMetricsPublisher when retrieving 
FinalApplicationStatus. Contributed by Zhijie Shen (jianhe: rev 
ee21b13cbd4654d7181306404174329f12193613)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* hadoop-yarn-project/CHANGES.txt


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher
 --

 Key: YARN-2559
 URL: https://issues.apache.org/jira/browse/YARN-2559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Generice History Service is enabled in Timelineserver 
 with 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 So that ResourceManager should Timeline Store for recording application 
 history information 
Reporter: Karam Singh
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2559.1.patch, YARN-2559.2.patch, YARN-2559.3.patch, 
 YARN-2559.4.patch


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


[ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138937#comment-14138937
 ] 

Hudson commented on YARN-2558:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1900 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1900/])
YARN-2558. Updated ContainerTokenIdentifier#read/write to use 
ContainerId#getContainerId. Contributed by Tsuyoshi OZAWA. (jianhe: rev 
f4886111aa573ec928de69e8ca9328d480bf673e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java
* hadoop-yarn-project/CHANGES.txt


 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId


[ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138950#comment-14138950
 ] 

Hudson commented on YARN-2558:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1875 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1875/])
YARN-2558. Updated ContainerTokenIdentifier#read/write to use 
ContainerId#getContainerId. Contributed by Tsuyoshi OZAWA. (jianhe: rev 
f4886111aa573ec928de69e8ca9328d480bf673e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java
* hadoop-yarn-project/CHANGES.txt


 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher


[ 
https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138955#comment-14138955
 ] 

Hudson commented on YARN-2559:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1875 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1875/])
YARN-2559. Fixed NPE in SystemMetricsPublisher when retrieving 
FinalApplicationStatus. Contributed by Zhijie Shen (jianhe: rev 
ee21b13cbd4654d7181306404174329f12193613)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher
 --

 Key: YARN-2559
 URL: https://issues.apache.org/jira/browse/YARN-2559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Generice History Service is enabled in Timelineserver 
 with 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 So that ResourceManager should Timeline Store for recording application 
 history information 
Reporter: Karam Singh
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2559.1.patch, YARN-2559.2.patch, YARN-2559.3.patch, 
 YARN-2559.4.patch


 ResourceManager sometime become un-responsive due to NPE in 
 SystemMetricsPublisher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139066#comment-14139066
 ] 

Junping Du commented on YARN-2561:
--

Hi [~jlowe], can you help to review the patch again given this is a blocker 
bug? Thanks!

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1779) Handle AMRMTokens across RM failover

2014-09-18 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139186#comment-14139186
 ] 

Vinod Kumar Vavilapalli commented on YARN-1779:
---

Looks good, +1. Checking this in..

 Handle AMRMTokens across RM failover
 

 Key: YARN-1779
 URL: https://issues.apache.org/jira/browse/YARN-1779
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Jian He
Priority: Blocker
  Labels: ha
 Attachments: YARN-1779.1.patch, YARN-1779.2.patch, YARN-1779.3.patch, 
 YARN-1779.6.patch


 Verify if AMRMTokens continue to work against RM failover. If not, we will 
 have to do something along the lines of YARN-986. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId

2014-09-18 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2558:
--
Hadoop Flags: Reviewed  (was: Incompatible change,Reviewed)

Dropping the incompatibility flag given we aren't really supporting rolling 
upgrades yet (till YARN-666 is in).

 Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
 --

 Key: YARN-2558
 URL: https://issues.apache.org/jira/browse/YARN-2558
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch


 We should update ContainerTokenIdentifier#read/write to use 
 {{getContainerId}} instead of {{getId}} to pass all container information 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.

[
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139256#comment-14139256
]

Jason Lowe commented on YARN-2561:
--

We should not assume that the lack of any container status means the NM doesn't
support restart. There may have been no containers running on it at the time,
but it could still have outstanding applications active (e.g.: still shuffling
for the mapreduce_shuffle service), and we definitely don't want AMs to be
notified of the node removal in that case.

It would be better to check the list of applications on the node. If there are
no applications then that should necessarily mean there are also no containers.
If the node registers with no applications then it's probably safe to treat
the node as a removal and re-add whether that node supports NM restart or not.

When a node reconnects with no containers but has the same port, we aren't
updating it's potentially new totalCapability as we did before.

MR job client cannot reconnect to AM after NM restart.
--

Key: YARN-2561
URL: https://issues.apache.org/jira/browse/YARN-2561
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch,
YARN-2561-v4.patch, YARN-2561.patch

Work-preserving NM restart is disabled.
Submit a job. Restart the only NM and found that Job will hang with connect
retries.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL

2014-09-18 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139269#comment-14139269
 ] 

Jonathan Eagles commented on YARN-2363:
---

+1. This will be a big help to users, [~jlowe].

 Submitted applications occasionally lack a tracking URL
 ---

 Key: YARN-2363
 URL: https://issues.apache.org/jira/browse/YARN-2363
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2363.patch


 Sometimes when an application is submitted the client receives no tracking 
 URL.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-18 Thread Kendall Thrapp (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139336#comment-14139336
 ] 

Kendall Thrapp commented on YARN-415:
-

Thanks [~eepayne], [~aklochkov], [~jianhe], [~wangda], [~kasha], [~sandyr] and 
[~jlowe] for all your effort on this!  Looking forward to being able to use 
this feature.

 Capture aggregate memory allocation at the app-level for chargeback
 ---

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Eric Payne
 Fix For: 2.6.0

 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
 YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
 YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
 YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
 YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
 YARN-415.201409040036.txt, YARN-415.201409092204.txt, 
 YARN-415.201409102216.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler

2014-09-18 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139368#comment-14139368
 ] 

Karthik Kambatla commented on YARN-1959:


Approach looks good. Couple of comments: 
# Can we add a single method to SchedulingPolicy that takes available cluster 
resources, queue's fairshare and current usage? May be, we can call that method 
computeHeadroom? 
# The amount of resources available in the cluster can be looked up in 
FairScheduler.rootMetrics. 

 Fix headroom calculation in Fair Scheduler
 --

 Key: YARN-1959
 URL: https://issues.apache.org/jira/browse/YARN-1959
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Anubhav Dhoot
 Attachments: YARN-1959.prelim.patch


 The Fair Scheduler currently always sets the headroom to 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager

2014-09-18 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139372#comment-14139372
 ] 

Karthik Kambatla commented on YARN-2080:


Latest patch looks good to me. 

PS: Versioned patches are easier to refer to. 

 Admission Control: Integrate Reservation subsystem with ResourceManager
 ---

 Key: YARN-2080
 URL: https://issues.apache.org/jira/browse/YARN-2080
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Subru Krishnan
Assignee: Subru Krishnan
 Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, 
 YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch


 This JIRA tracks the integration of Reservation subsystem data structures 
 introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring 
 of YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie


[ 
https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139376#comment-14139376
 ] 

Jian He commented on YARN-2563:
---

not related to this bug, I think for the following we can just check the kind 
name for simplicity.
{code}
  TokenIdentifier tokenIdentifier = token.decodeIdentifier();
  if (tokenIdentifier instanceof TimelineDelegationTokenIdentifier) {
return;
  }
{code}

 On secure clusters call to timeline server fails with authentication errors 
 when running a job via oozie
 

 Key: YARN-2563
 URL: https://issues.apache.org/jira/browse/YARN-2563
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2563.1.patch


 During our nightlies on a secure cluster we have seen oozie jobs fail with 
 authentication error to the time line server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie

2014-09-18 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2563:
--
Attachment: YARN-2563.2.patch

Upload a new patch to simplify the way to check the existence of timeline DT.

 On secure clusters call to timeline server fails with authentication errors 
 when running a job via oozie
 

 Key: YARN-2563
 URL: https://issues.apache.org/jira/browse/YARN-2563
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2563.1.patch, YARN-2563.2.patch


 During our nightlies on a secure cluster we have seen oozie jobs fail with 
 authentication error to the time line server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2056) Disable preemption at Queue level

2014-09-18 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-2056:
-
Attachment: YARN-2056.201409181916.txt

[~leftnoteasy] and [~jlowe],

Thank you both for your reviews of this jira.

The previous patch did not handle all cases in hierarchical queues where the 
queue is over-capacity and some or all of it's children are non-preemptable.

This patch has been rewritten and should address all use cases.

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL


[ 
https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139395#comment-14139395
 ] 

Jason Lowe commented on YARN-2363:
--

Thanks for the reviews, [~mitdesai], [~sjlee0], and [~jeagles]!  Committing 
this.

 Submitted applications occasionally lack a tracking URL
 ---

 Key: YARN-2363
 URL: https://issues.apache.org/jira/browse/YARN-2363
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2363.patch


 Sometimes when an application is submitted the client receives no tracking 
 URL.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


 [ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2561:
-
Attachment: YARN-2561-v5.patch

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561-v5.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie


[ 
https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139440#comment-14139440
 ] 

Hadoop QA commented on YARN-2563:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669779/YARN-2563.2.patch
  against trunk revision 570b8b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

  
org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5025//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5025//console

This message is automatically generated.

 On secure clusters call to timeline server fails with authentication errors 
 when running a job via oozie
 

 Key: YARN-2563
 URL: https://issues.apache.org/jira/browse/YARN-2563
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2563.1.patch, YARN-2563.2.patch


 During our nightlies on a secure cluster we have seen oozie jobs fail with 
 authentication error to the time line server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139450#comment-14139450
 ] 

Jason Lowe commented on YARN-2561:
--

bq. Yes. Update with adding with newNode instead of old rmNode.

Ah, I see now.  Before I thought the node update event was just going to the 
scheduler but now I see it's sending an event to itself which will eventually 
set totalCapability properly.

+1 on the latest patch pending Jenkins.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561-v5.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-18 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-003.patch

Updated patch with ACL build-up for system permissions on secure clusters, 
testing in-VM with miniKDC, ZK and showing doAs registry creation creates an 
authed registry.

Also: patched the build process so that precommit reports can now be found

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2056) Disable preemption at Queue level


[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139493#comment-14139493
 ] 

Hadoop QA commented on YARN-2056:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12669780/YARN-2056.201409181916.txt
  against trunk revision 570b8b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5026//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5026//console

This message is automatically generated.

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

[
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139498#comment-14139498
]

Hadoop QA commented on YARN-913:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12669798/YARN-913-003.patch
against trunk revision 474f116.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 34 new
or modified test files.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5028//console

This message is automatically generated.

Add a way to register long-lived services in a YARN cluster
---

Key: YARN-913
URL: https://issues.apache.org/jira/browse/YARN-913
Project: Hadoop YARN
Issue Type: New Feature
Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf,
2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt,
YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch,
YARN-913-003.patch, yarnregistry.pdf, yarnregistry.tla

In a YARN cluster you can't predict where services will come up -or on what
ports. The services need to work those things out as they come up and then
publish them somewhere.
Applications need to be able to find the service instance they are to bond to
-and not any others in the cluster.
Some kind of service registry -in the RM, in ZK, could do this. If the RM
held the write access to the ZK nodes, it would be more secure than having
apps register with ZK themselves.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2468) Log handling for LRS


 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.6.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS

[
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139515#comment-14139515
]

Xuan Gong commented on YARN-2468:
-

bq. If we want to use day as the unit, shall we allow double interval, such
that I can say 0.1 day. I think it would be useful for testing purpose (not
unit test, but verifying the feature is working on purpose, and the app is
using the feature correctly). Thoughts?

Add extra YarnConfiguration.DEBUG_LOG_AGGREGATION_SPEED_UP_RATIO for test/debug
purpose.

New patch also address other comments

Log handling for LRS

Key: YARN-2468
URL: https://issues.apache.org/jira/browse/YARN-2468
Project: Hadoop YARN
Issue Type: Sub-task
Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch,
YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch,
YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch,
YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch,
YARN-2468.5.patch, YARN-2468.6.patch

Currently, when application is finished, NM will start to do the log
aggregation. But for Long running service applications, this is not ideal.
The problems we have are:
1) LRS applications are expected to run for a long time (weeks, months).
2) Currently, all the container logs (from one NM) will be written into a
single file. The files could become larger and larger.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139523#comment-14139523
 ] 

Hadoop QA commented on YARN-2561:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669786/YARN-2561-v5.patch
  against trunk revision 570b8b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5027//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5027//console

This message is automatically generated.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561-v5.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2468) Log handling for LRS


 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.6.1.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2568) TestAMRMClientOnRMRestart test fails

Jian He created YARN-2568:
-

 Summary: TestAMRMClientOnRMRestart test fails
 Key: YARN-2568
 URL: https://issues.apache.org/jira/browse/YARN-2568
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie


[ 
https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139542#comment-14139542
 ] 

Jian He commented on YARN-2563:
---

looks good, +1

 On secure clusters call to timeline server fails with authentication errors 
 when running a job via oozie
 

 Key: YARN-2563
 URL: https://issues.apache.org/jira/browse/YARN-2563
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Arpit Gupta
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2563.1.patch, YARN-2563.2.patch


 During our nightlies on a secure cluster we have seen oozie jobs fail with 
 authentication error to the time line server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2568) TestAMRMClientOnRMRestart test fails


 [ 
https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2568:
--
Attachment: YARN-2568.patch

 TestAMRMClientOnRMRestart test fails
 

 Key: YARN-2568
 URL: https://issues.apache.org/jira/browse/YARN-2568
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2568.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.


[ 
https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139544#comment-14139544
 ] 

Jason Lowe commented on YARN-2561:
--

Committing this.

 MR job client cannot reconnect to AM after NM restart.
 --

 Key: YARN-2561
 URL: https://issues.apache.org/jira/browse/YARN-2561
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Tassapol Athiapinya
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, 
 YARN-2561-v4.patch, YARN-2561-v5.patch, YARN-2561.patch


 Work-preserving NM restart is disabled.
 Submit a job. Restart the only NM and found that Job will hang with connect 
 retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2568) TestAMRMClientOnRMRestart test fails


 [ 
https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2568:
--
Description: 
testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart)
  Time elapsed: 10.807 sec   FAILURE!
java.lang.AssertionError: Number of container should be 3 expected:3 but 
was:0
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at 
org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290)

 TestAMRMClientOnRMRestart test fails
 

 Key: YARN-2568
 URL: https://issues.apache.org/jira/browse/YARN-2568
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2568.patch


 testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart)
   Time elapsed: 10.807 sec   FAILURE!
 java.lang.AssertionError: Number of container should be 3 expected:3 but 
 was:0
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at 
 org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running


[ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139605#comment-14139605
 ] 

Jian He commented on YARN-1857:
---

thanks [~airbots] and [~cwelch], patch looks good overall, few comments and 
questions:
- Indentation of the last line seems incorrect.
{code}
Resource headroom =
  Resources.min(resourceCalculator, clusterResource,
Resources.subtract(
Resources.min(resourceCalculator, clusterResource, 
userLimit, queueMaxCap), 
userConsumed),
Resources.subtract(queueMaxCap, usedResources));
{code}
- Test case2: could you check app2 headRoom as well
- Test case3: could you check app_1 headRoom as well.
- Could you explain why in test case 4 {{assertEquals(5*GB, 
app_4.getHeadroom().getMemory());}}, app4 still has 5GB headRoom?

 CapacityScheduler headroom doesn't account for other AM's running
 -

 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Chen He
Priority: Critical
 Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, 
 YARN-1857.patch, YARN-1857.patch, YARN-1857.patch


 Its possible to get an application to hang forever (or a long time) in a 
 cluster with multiple users.  The reason why is that the headroom sent to the 
 application is based on the user limit but it doesn't account for other 
 Application masters using space in that queue.  So the headroom (user limit - 
 user consumed) can be  0 even though the cluster is 100% full because the 
 other space is being used by application masters from other users.  
 For instance if you have a cluster with 1 queue, user limit is 100%, you have 
 multiple users submitting applications.  One very large application by user 1 
 starts up, runs most of its maps and starts running reducers. other users try 
 to start applications and get their application masters started but not 
 tasks.  The very large application then gets to the point where it has 
 consumed the rest of the cluster resources with all reduces.  But at this 
 point it needs to still finish a few maps.  The headroom being sent to this 
 application is only based on the user limit (which is 100% of the cluster 
 capacity) its using lets say 95% of the cluster for reduces and then other 5% 
 is being used by other users running application masters.  The MRAppMaster 
 thinks it still has 5% so it doesn't know that it should kill a reduce in 
 order to run a map.  
 This can happen in other scenarios also.  Generally in a large cluster with 
 multiple queues this shouldn't cause a hang forever but it could cause the 
 application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139619#comment-14139619
 ] 

Hadoop QA commented on YARN-2468:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669809/YARN-2468.6.1.patch
  against trunk revision a337f0e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  org.apache.hadoop.yarn.client.cli.TestLogsCLI
  
org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5030//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5030//console

This message is automatically generated.

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

[
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139675#comment-14139675
]

Hadoop QA commented on YARN-913:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12669814/YARN-913-004.patch
against trunk revision fe2f54d.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 34 new
or modified test files.

{color:red}-1 javac{color}. The applied patch generated 1267 javac
compiler warnings (more than the trunk's current 1266 warnings).

{color:red}-1 javadoc{color}. The javadoc tool appears to have generated
10 warning messages.
See
https://builds.apache.org/job/PreCommit-YARN-Build/5031//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavadocWarnings.txt
for details.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 11 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-common-project/hadoop-minikdc
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
org.apache.hadoop.yarn.registry.secure.TestSecureLogins
org.apache.hadoop.yarn.server.TestMiniYARNClusterRegistry

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5031//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5031//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5031//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-registry.html
Javac warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5031//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5031//console

This message is automatically generated.

Add a way to register long-lived services in a YARN cluster
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2568) TestAMRMClientOnRMRestart test fails


[ 
https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139680#comment-14139680
 ] 

Hadoop QA commented on YARN-2568:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669810/YARN-2568.patch
  against trunk revision 2c3da25.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5032//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5032//console

This message is automatically generated.

 TestAMRMClientOnRMRestart test fails
 

 Key: YARN-2568
 URL: https://issues.apache.org/jira/browse/YARN-2568
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2568.patch


 testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart)
   Time elapsed: 10.807 sec   FAILURE!
 java.lang.AssertionError: Number of container should be 3 expected:3 but 
 was:0
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at 
 org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-18 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139731#comment-14139731
 ] 

Vinod Kumar Vavilapalli commented on YARN-668:
--

One more thing. Let's version each of the token separately and validate the 
versions on the NodeManager during startContainer() to catch incompatible 
changes. The Token is redesigned anyways, so we can start with version 0.

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Attachments: YARN-668-demo.patch


 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2565) RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore


[ 
https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139746#comment-14139746
 ] 

Jian He commented on YARN-2565:
---

- exception msg seems still using FileSystemApplicationHistoryStore
{code}
Could not instantiate ApplicationHistoryWriter: 
  + conf.get(YarnConfiguration.APPLICATION_HISTORY_STORE,
  FileSystemApplicationHistoryStore.class.getName());
{code}
- testDefaultStoreSetup: missing @test annotation 

 RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting 
 FileSystemApplicationHistoryStore
 ---

 Key: YARN-2565
 URL: https://issues.apache.org/jira/browse/YARN-2565
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Secure cluster with ATS (timeline server enabled) and 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 so that RM can send Application history to Timeline Store
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2565.1.patch, YARN-2565.2.patch


 Observed that RM fails to start in Secure mode when GenericeHistoryService is 
 enabled and ResourceManager is set to use Timeline Store



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields


[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139748#comment-14139748
 ] 

Junping Du commented on YARN-668:
-

Thanks [~vinodkv] for comments, and I also tested it offline that this way 
should work. Will work on this JIRA.

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Attachments: YARN-668-demo.patch


 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields


 [ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-668:

Assignee: Junping Du  (was: Vinod Kumar Vavilapalli)

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-668-demo.patch


 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2568) TestAMRMClientOnRMRestart test fails


 [ 
https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2568:
--
Attachment: YARN-2568.patch

 TestAMRMClientOnRMRestart test fails
 

 Key: YARN-2568
 URL: https://issues.apache.org/jira/browse/YARN-2568
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2568.patch, YARN-2568.patch


 testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart)
   Time elapsed: 10.807 sec   FAILURE!
 java.lang.AssertionError: Number of container should be 3 expected:3 but 
 was:0
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at 
 org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2565) RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore

2014-09-18 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2565:
--
Attachment: YARN-2565.3.patch

Fix the patch according to Jian's comments

 RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting 
 FileSystemApplicationHistoryStore
 ---

 Key: YARN-2565
 URL: https://issues.apache.org/jira/browse/YARN-2565
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Secure cluster with ATS (timeline server enabled) and 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 so that RM can send Application history to Timeline Store
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2565.1.patch, YARN-2565.2.patch, YARN-2565.3.patch


 Observed that RM fails to start in Secure mode when GenericeHistoryService is 
 enabled and ResourceManager is set to use Timeline Store



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2568) TestAMRMClientOnRMRestart test fails

2014-09-18 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139811#comment-14139811
 ] 

Zhijie Shen commented on YARN-2568:
---

+1, pending on jenkins

 TestAMRMClientOnRMRestart test fails
 

 Key: YARN-2568
 URL: https://issues.apache.org/jira/browse/YARN-2568
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2568.patch, YARN-2568.patch


 testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart)
   Time elapsed: 10.807 sec   FAILURE!
 java.lang.AssertionError: Number of container should be 3 expected:3 but 
 was:0
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at 
 org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-09-18 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1372:

Attachment: YARN-1372.006.patch

Addressed feedback except for 2 things (transferring all finishedContainers 
from previous attempt irrespective of wp-restart am enabled and adding am 
containers to finished containers). 



 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
 YARN-1372.002_NMHandlesCompletedApp.patch, 
 YARN-1372.002_RMHandlesCompletedApp.patch, 
 YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, 
 YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch, 
 YARN-1372.006.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch


 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2568) TestAMRMClientOnRMRestart test fails


[ 
https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139847#comment-14139847
 ] 

Hadoop QA commented on YARN-2568:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669871/YARN-2568.patch
  against trunk revision fe38d2e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5033//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5033//console

This message is automatically generated.

 TestAMRMClientOnRMRestart test fails
 

 Key: YARN-2568
 URL: https://issues.apache.org/jira/browse/YARN-2568
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2568.patch, YARN-2568.patch


 testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart)
   Time elapsed: 10.807 sec   FAILURE!
 java.lang.AssertionError: Number of container should be 3 expected:3 but 
 was:0
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at 
 org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2565) RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore


[ 
https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139873#comment-14139873
 ] 

Hadoop QA commented on YARN-2565:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669873/YARN-2565.3.patch
  against trunk revision 6434572.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5035//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5035//console

This message is automatically generated.

 RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting 
 FileSystemApplicationHistoryStore
 ---

 Key: YARN-2565
 URL: https://issues.apache.org/jira/browse/YARN-2565
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: Secure cluster with ATS (timeline server enabled) and 
 yarn.resourcemanager.system-metrics-publisher.enabled=true
 so that RM can send Application history to Timeline Store
Reporter: Karam Singh
Assignee: Zhijie Shen
 Attachments: YARN-2565.1.patch, YARN-2565.2.patch, YARN-2565.3.patch


 Observed that RM fails to start in Secure mode when GenericeHistoryService is 
 enabled and ResourceManager is set to use Timeline Store



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139887#comment-14139887
]

Hadoop QA commented on YARN-1372:
-

{color:green}+1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12669882/YARN-1372.006.patch
against trunk revision 6434572.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 6 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5036//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5036//console

This message is automatically generated.

Ensure all completed containers are reported to the AMs across RM restart
-

Key: YARN-1372
URL: https://issues.apache.org/jira/browse/YARN-1372
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
Attachments: YARN-1372.001.patch, YARN-1372.001.patch,
YARN-1372.002_NMHandlesCompletedApp.patch,
YARN-1372.002_RMHandlesCompletedApp.patch,
YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch,
YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch,
YARN-1372.006.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch

Currently the NM informs the RM about completed containers and then removes
those containers from the RM notification list. The RM passes on that
completed container information to the AM and the AM pulls this data. If the
RM dies before the AM pulls this data then the AM may not be able to get this
information again. To fix this, NM should maintain a separate list of such
completed container notifications sent to the RM. After the AM has pulled the
containers from the RM then the RM will inform the NM about it and the NM can
remove the completed container from the new list. Upon re-register with the
RM (after RM restart) the NM should send the entire list of completed
containers to the RM along with any other containers that completed while the
RM was dead. This ensures that the RM can inform the AM's about all completed
containers. Some container completions may be reported more than once since
the AM may have pulled the container but the RM may die before notifying the
NM about the pull.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-18 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2566:

Attachment: YARN-2566.000.patch

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
   APPID=application_1410663092546_0004
 CONTAINERID=container_1410663092546_0004_01_01
 2014-09-13 23:33:25,187 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-18 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139894#comment-14139894
 ] 

zhihai xu commented on YARN-2566:
-

I attached a patch YARN-2566.000.patch for review. I have a test case in the 
patch which need Mock the FileContext class, so I need remove final in 
FileContext class.

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139927#comment-14139927
 ] 

Hadoop QA commented on YARN-2566:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12669893/YARN-2566.000.patch
  against trunk revision 6434572.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5037//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5037//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5037//console

This message is automatically generated.

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)

[jira] [Created] (YARN-2569) Log Handling for LRS API Changes

Xuan Gong created YARN-2569:
---

 Summary: Log Handling for LRS API Changes
 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2569) Log Handling for LRS API Changes


 [ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2569:

Attachment: YARN-2569.1.patch

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes


[ 
https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139973#comment-14139973
 ] 

Xuan Gong commented on YARN-2569:
-

API changes required by log handling.
1. add LogContext into ASC
2. add logContext into ContainerTokenIdentifier. When RM schedules the 
containers, the logContext will be set which can guarantee all the containers 
for the same application has the same logContext.
3. When NM launches the containers, it will read logcontext from 
ContainerTokenIdentifier

 Log Handling for LRS API Changes
 

 Key: YARN-2569
 URL: https://issues.apache.org/jira/browse/YARN-2569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2569.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-796:

Attachment: YARN-796.node-label.consolidate.7.patch

Updated latest consolidated patch against trunk. Set patch available to Kick 
jenkins

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, 
 Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, 
 YARN-796.node-label.consolidate.1.patch, 
 YARN-796.node-label.consolidate.2.patch, 
 YARN-796.node-label.consolidate.3.patch, 
 YARN-796.node-label.consolidate.4.patch, 
 YARN-796.node-label.consolidate.5.patch, 
 YARN-796.node-label.consolidate.6.patch, 
 YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.demo.patch.1, 
 YARN-796.patch, YARN-796.patch4


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes