date:20140516


[ 
https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998269#comment-13998269
 ] 

Karthik Kambatla commented on YARN-2062:


I propose having a dummy invalid transition in RMNodeImpl to capture all the 
invalid transitions. We can just log these at DEBUG level. 

 Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
 ---

 Key: YARN-2062
 URL: https://issues.apache.org/jira/browse/YARN-2062
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 On busy clusters, we see several 
 {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events 
 invoked against NEW nodes. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First

2014-05-16 Thread Maysam Yabandeh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996444#comment-13996444
 ] 

Maysam Yabandeh commented on YARN-1969:
---

[~kkambatl], you are right. The title of the jira is misleading. The jira 
description talks about the jobs that are about to finish and their estimated 
endtime but the title indicates deadline. I guess the confusion came from the 
name earliest deadline first algorithm cited in the jira description. What we 
had in mind was a variation of the algorithm that a) takes other parameters 
into account, b) it is not necessarily tied to deadline.

 Fair Scheduler: Add policy for Earliest Deadline First
 --

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, what we require is a kind of 
 variation of *Earliest Deadline First scheduling*, that takes into account 
 the number of already-allocated resources and estimated time to finish.
 http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to *only within a queue*: i.e., 
 adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues 
 to use it by setting the schedulingPolicy field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2011) Fix typo and warning in TestLeafQueue


[ 
https://issues.apache.org/jira/browse/YARN-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998226#comment-13998226
 ] 

Hudson commented on YARN-2011:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1753 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1753/])
YARN-2011. Fix typo and warning in TestLeafQueue (Contributed by Chen He) 
(junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593804)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java


 Fix typo and warning in TestLeafQueue
 -

 Key: YARN-2011
 URL: https://issues.apache.org/jira/browse/YARN-2011
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Chen He
Assignee: Chen He
Priority: Trivial
 Fix For: 2.5.0

 Attachments: YARN-2011-v2.patch, YARN-2011.patch


 a.assignContainers(clusterResource, node_0);
 assertEquals(2*GB, a.getUsedResources().getMemory());
 assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_1.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
 // Again one to user_0 since he hasn't exceeded user limit yet
 a.assignContainers(clusterResource, node_0);
 assertEquals(3*GB, a.getUsedResources().getMemory());
 assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
 assertEquals(1*GB, app_1.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored


[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998231#comment-13998231
 ] 

Hudson commented on YARN-2016:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1753 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1753/])
YARN-2016. Fix a bug in GetApplicationsRequestPBImpl to add the missed fields 
to proto. Contributed by Junping Du (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594085)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java


 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Fix For: 2.4.1

 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts


 [ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2053:
-

Attachment: YARN-2053.patch

Attached a new patch with UT according to [~jianhe]'s suggestion.

 Slider AM fails to restart: NPE in 
 RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
 

 Key: YARN-2053
 URL: https://issues.apache.org/jira/browse/YARN-2053
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sumit Mohanty
Assignee: Wangda Tan
 Attachments: YARN-2053.patch, YARN-2053.patch, 
 yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
 yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak


 Slider AppMaster restart fails with the following:
 {code}
 org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-766) TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk

2014-05-16 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-766:


Summary: TestNodeManagerShutdown in branch-2 should use Shell to form the 
output path and a format issue in trunk  (was: TestNodeManagerShutdown should 
use Shell to form the output path)

 TestNodeManagerShutdown in branch-2 should use Shell to form the output path 
 and a format issue in trunk
 

 Key: YARN-766
 URL: https://issues.apache.org/jira/browse/YARN-766
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.1.0-beta
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Minor
 Attachments: YARN-766.branch-2.txt, YARN-766.trunk.txt, YARN-766.txt


 File scriptFile = new File(tmpDir, scriptFile.sh);
 should be replaced with
 File scriptFile = Shell.appendScriptExtension(tmpDir, scriptFile);
 to match trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect


[ 
https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993020#comment-13993020
 ] 

Jason Lowe commented on YARN-2034:
--

While updating it we may also want to clarify that it is a target retention 
size that only includes resources with PUBLIC and PRIVATE visibility and 
excludes resources with APPLICATION visibility.

 Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
 

 Key: YARN-2034
 URL: https://issues.apache.org/jira/browse/YARN-2034
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Priority: Minor

 The description in yarn-default.xml for 
 yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per 
 local directory, but according to the code it's a setting for the entire node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-893) Capacity scheduler allocates vcores to containers but does not report it in headroom


[ 
https://issues.apache.org/jira/browse/YARN-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998657#comment-13998657
 ] 

Tsuyoshi OZAWA commented on YARN-893:
-

Thanks for updating a patch, [~kj-ki]. It looks very cleaner. Great job. One 
additional point: Can we add unit tests for utility methods in 
DefaultResourceCalculator? 


 Capacity scheduler allocates vcores to containers but does not report it in 
 headroom
 

 Key: YARN-893
 URL: https://issues.apache.org/jira/browse/YARN-893
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta, 2.3.0
Reporter: Bikas Saha
Assignee: Kenji Kikushima
 Attachments: YARN-893-2.patch, YARN-893.patch


 In non-DRF mode, it reports 0 vcores in the headroom but it allocates 1 vcore 
 to containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998524#comment-13998524
 ] 

Karthik Kambatla commented on YARN-2061:


We assume that the Log level is at least INFO, so we add *Enabled only for 
TRACE and DEBUG levels. 

 Revisit logging levels in ZKRMStateStore 
 -

 Key: YARN-2061
 URL: https://issues.apache.org/jira/browse/YARN-2061
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie

 ZKRMStateStore has a few places where it is logging at the INFO level. We 
 should change these to DEBUG or TRACE level messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1957) ProportionalCapacitPreemptionPolicy handling of corner cases...


[ 
https://issues.apache.org/jira/browse/YARN-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999005#comment-13999005
 ] 

Hudson commented on YARN-1957:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5605/])
YARN-1957. Consider the max capacity of the queue when computing the ideal
capacity for preemption. Contributed by Carlo Curino (cdouglas: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594414)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java


 ProportionalCapacitPreemptionPolicy handling of corner cases...
 ---

 Key: YARN-1957
 URL: https://issues.apache.org/jira/browse/YARN-1957
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler, preemption
 Fix For: 3.0.0, 2.5.0, 2.4.1

 Attachments: YARN-1957.patch, YARN-1957.patch, YARN-1957_test.patch


 The current version of ProportionalCapacityPreemptionPolicy should be 
 improved to deal with the following two scenarios:
 1) when rebalancing over-capacity allocations, it potentially preempts 
 without considering the maxCapacity constraints of a queue (i.e., preempting 
 possibly more than strictly necessary)
 2) a zero capacity queue is preempted even if there is no demand (coherent 
 with old use of zero-capacity to disabled queues)
 The proposed patch fixes both issues, and introduce few new test cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests

2014-05-16 Thread Bikas Saha (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998912#comment-13998912
]

Bikas Saha commented on YARN-2027:
--

Yes. If strict node locality is needed then the rack should not be specified.
If the rack is specified then it will allow relaxing locality up to the rack
but no further.

YARN ignores host-specific resource requests

Key: YARN-2027
URL: https://issues.apache.org/jira/browse/YARN-2027
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager, scheduler
Affects Versions: 2.4.0
Environment: RHEL 6.1
YARN 2.4
Reporter: Chris Riccomini

YARN appears to be ignoring host-level ContainerRequests.
I am creating a container request with code that pretty closely mirrors the
DistributedShell code:
{code}
protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int)
{
info(Requesting %d container(s) with %dmb of memory format (containers,
memMb))
val capability = Records.newRecord(classOf[Resource])
val priority = Records.newRecord(classOf[Priority])
priority.setPriority(0)
capability.setMemory(memMb)
capability.setVirtualCores(cpuCores)
// Specifying a host in the String[] host parameter here seems to do
nothing. Setting relaxLocality to false also doesn't help.
(0 until containers).foreach(idx = amClient.addContainerRequest(new
ContainerRequest(capability, null, null, priority)))
}
{code}
When I run this code with a specific host in the ContainerRequest, YARN does
not honor the request. Instead, it puts the container on an arbitrary host.
This appears to be true for both the FifoScheduler and the CapacityScheduler.
Currently, we are running the CapacityScheduler with the following settings:
{noformat}
configuration
property
nameyarn.scheduler.capacity.maximum-applications/name
value1/value
description
Maximum number of applications that can be pending and running.
/description
/property
property
nameyarn.scheduler.capacity.maximum-am-resource-percent/name
value0.1/value
description
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
/description
/property
property
nameyarn.scheduler.capacity.resource-calculator/name

valueorg.apache.hadoop.yarn.util.resource.DefaultResourceCalculator/value
description
The ResourceCalculator implementation to be used to compare
Resources in the scheduler.
The default i.e. DefaultResourceCalculator only uses Memory while
DominantResourceCalculator uses dominant-resource to compare
multi-dimensional resources such as Memory, CPU etc.
/description
/property
property
nameyarn.scheduler.capacity.root.queues/name
valuedefault/value
description
The queues at the this level (root is the root queue).
/description
/property
property
nameyarn.scheduler.capacity.root.default.capacity/name
value100/value
descriptionSamza queue target capacity./description
/property
property
nameyarn.scheduler.capacity.root.default.user-limit-factor/name
value1/value
description
Default queue user limit a percentage from 0.0 to 1.0.
/description
/property
property
nameyarn.scheduler.capacity.root.default.maximum-capacity/name
value100/value
description
The maximum capacity of the default queue.
/description
/property
property
nameyarn.scheduler.capacity.root.default.state/name
valueRUNNING/value
description
The state of the default queue. State can be one of RUNNING or STOPPED.
/description
/property
property
nameyarn.scheduler.capacity.root.default.acl_submit_applications/name
value*/value
description
The ACL of who can submit jobs to the default queue.
/description
/property
property
nameyarn.scheduler.capacity.root.default.acl_administer_queue/name
value*/value
description
The ACL of who can administer jobs on the default queue.
/description
/property
property
nameyarn.scheduler.capacity.node-locality-delay/name
value40/value
description
Number of missed scheduling opportunities after which the
CapacityScheduler
attempts to schedule rack-local containers.
Typically this should be set to number of nodes in the cluster, By
default is setting
approximately number of nodes in one rack which is 40.
/description
/property
/configuration
{noformat}
Digging into the code a bit (props to [~jghoman] for finding this), we have a
theory as to

[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers


[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998825#comment-13998825
 ] 

Wangda Tan commented on YARN-2017:
--

LGTM, +1 (non-binding). Please kick off Jenkins building.

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, 
 YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts


 [ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2053:
-

Attachment: YARN-2053.patch

 Slider AM fails to restart: NPE in 
 RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
 

 Key: YARN-2053
 URL: https://issues.apache.org/jira/browse/YARN-2053
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sumit Mohanty
Assignee: Wangda Tan
 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, 
 YARN-2053.patch, yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
 yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak


 Slider AppMaster restart fails with the following:
 {code}
 org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1612) Change Fair Scheduler to not disable delay scheduling by default

2014-05-16 Thread Chen He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998887#comment-13998887
 ] 

Chen He commented on YARN-1612:
---

ping

 Change Fair Scheduler to not disable delay scheduling by default
 

 Key: YARN-1612
 URL: https://issues.apache.org/jira/browse/YARN-1612
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Sandy Ryza
Assignee: Chen He
 Attachments: YARN-1612-v2.patch, YARN-1612.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998623#comment-13998623
 ] 

Tsuyoshi OZAWA commented on YARN-2061:
--

The logging in 
removeRMDelegationTokenState()/updateRMDelegationTokenAndSequenceNumberInternal()/removeRMDTMasterKeyState()
 can be for RACE and DEBUG levels.

{code}
LOG.info(Done Loading applications from ZK state store);
{code}
About this log, how about moving this to the tail of loadRMAppState()?

 Revisit logging levels in ZKRMStateStore 
 -

 Key: YARN-2061
 URL: https://issues.apache.org/jira/browse/YARN-2061
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie

 ZKRMStateStore has a few places where it is logging at the INFO level. We 
 should change these to DEBUG or TRACE level messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup


[ 
https://issues.apache.org/jira/browse/YARN-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994001#comment-13994001
 ] 

Hadoop QA commented on YARN-2036:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644162/YARN2036-02.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+0 tests included{color}.  The patch appears to be a 
documentation patch that doesn't require tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3729//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3729//console

This message is automatically generated.

 Document yarn.resourcemanager.hostname in ClusterSetup
 --

 Key: YARN-2036
 URL: https://issues.apache.org/jira/browse/YARN-2036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
 Fix For: 2.5.0

 Attachments: YARN2036-01.patch, YARN2036-02.patch


 ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people 
 should just be able to use that directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart


[ 
https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999139#comment-13999139
 ] 

Hadoop QA commented on YARN-1365:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645051/YARN-1365.001.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3746//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3746//console

This message is automatically generated.

 ApplicationMasterService to allow Register and Unregister of an app that was 
 running before restart
 ---

 Key: YARN-1365
 URL: https://issues.apache.org/jira/browse/YARN-1365
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1365.001.patch, YARN-1365.initial.patch


 For an application that was running before restart, the 
 ApplicationMasterService currently throws an exception when the app tries to 
 make the initial register or final unregister call. These should succeed and 
 the RMApp state machine should transition to completed like normal. 
 Unregistration should succeed for an app that the RM considers complete since 
 the RM may have died after saving completion in the store but before 
 notifying the AM that the AM is free to exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts


 [ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2053:
--

Attachment: YARN-2053.patch

 Slider AM fails to restart: NPE in 
 RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
 

 Key: YARN-2053
 URL: https://issues.apache.org/jira/browse/YARN-2053
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sumit Mohanty
Assignee: Wangda Tan
 Fix For: 2.4.1

 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, 
 YARN-2053.patch, YARN-2053.patch, 
 yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
 yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak


 Slider AppMaster restart fails with the following:
 {code}
 org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used


[ 
https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999094#comment-13999094
 ] 

Jian He commented on YARN-2065:
---

Looked at the exception posted in SLIDER-34, the problem is that  AM can get 
new containers from RM, but cannot launch the containers on NM because of the 
following method.
The token is generated with the previous container's attempt Id, instead of the 
current attemptId. And NM is checking the attemptId from NMToken against the 
attemptId from the container.
{code}
  public NMToken createAndGetNMToken(String applicationSubmitter,
  ApplicationAttemptId appAttemptId, Container container) {
try {
  this.readLock.lock();
  HashSetNodeId nodeSet = this.appAttemptToNodeKeyMap.get(appAttemptId);
  NMToken nmToken = null;
  if (nodeSet != null) {
if (!nodeSet.contains(container.getNodeId())) {
  LOG.info(Sending NMToken for nodeId :  + container.getNodeId()
  +  for container :  + container.getId());
  Token token =
  createNMToken(**container.getId().getApplicationAttemptId()**,
container.getNodeId(), applicationSubmitter);
  nmToken = NMToken.newInstance(container.getNodeId(), token);
  nodeSet.add(container.getNodeId());
}
  }
  return nmToken;
} finally {
  this.readLock.unlock();
}
  }
{code}
Changing this method will fix this problem. 

But another problem is that 
ContainerMangerImpl#authorizeGetAndStopContainerRequest also requires the 
previous NMToken to talk to the previous container and current NMToken to talk 
with current container. Luckily, it's now not throwing exception but just log 
error messages.  we also need to change the NM side to check against the 
applicationId rather than attemptId. 

 AM cannot create new containers after restart-NM token from previous attempt 
 used
 -

 Key: YARN-2065
 URL: https://issues.apache.org/jira/browse/YARN-2065
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Steve Loughran

 Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot 
 create new containers.
 The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it 
 kills the AM, then kills a container while the AM is down, which triggers a 
 reallocation of a container, leading to this failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


[ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999258#comment-13999258
 ] 

Tsuyoshi OZAWA commented on YARN-1514:
--

I'll make these parameters configurable:

1. number of applications
2. number of application attempts 
3. ZK connection configuration(host:port)

A result message with the WIP patch is as follows:
{quote}
ZKRMStateStore takes 12644 msec to loadState.
{quote}

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.5.0

 Attachments: YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2014-05-16 Thread Mayank Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998941#comment-13998941
 ] 

Mayank Bansal commented on YARN-2055:
-

YARN-2022 is for avoiding killing AM however this issue more like how we are 
launching AM after preemption as there would be situations where you get some 
capacity for one heart beat and then again that capacity is reclaimed by other 
queue and then again AM will be killed and job will be failed. Based on the 
comments of YARN-2022 i dont see this case have been handeled there.

Thanks,
Mayank

 Preemption: Jobs are failing due to AMs are getting launched and killed 
 multiple times
 --

 Key: YARN-2055
 URL: https://issues.apache.org/jira/browse/YARN-2055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal

 If Queue A does not have enough capacity to run AM, then AM will borrow 
 capacity from queue B to run AM in that case AM will be killed if queue B 
 will reclaim its capacity and again AM will be launched and killed again, in 
 that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting

2014-05-16 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998992#comment-13998992
 ] 

zhihai xu commented on YARN-1569:
-

Hi,
I want to work on this issue(YARN-1569), Can someone assign this issue to me?

thanks
zhihai

 For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, 
 SchedulerEvent should get checked (instanceof) for appropriate type before 
 casting
 -

 Key: YARN-1569
 URL: https://issues.apache.org/jira/browse/YARN-1569
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Junping Du
Priority: Minor
  Labels: newbie

 As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should 
 always check appropriate type before casting. 
 handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so 
 far (no bug there now) but should be improved as FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions


[ 
https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998195#comment-13998195
 ] 

Hudson commented on YARN-1987:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1779 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1779/])
YARN-1987. Wrapper for leveldb DBIterator to aid in handling database 
exceptions. (Jason Lowe via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593757)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/pom.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/LeveldbIterator.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils/TestLeveldbIterator.java


 Wrapper for leveldb DBIterator to aid in handling database exceptions
 -

 Key: YARN-1987
 URL: https://issues.apache.org/jira/browse/YARN-1987
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 2.5.0

 Attachments: YARN-1987.patch, YARN-1987v2.patch


 Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a 
 utility wrapper around leveldb's DBIterator to translate the raw 
 RuntimeExceptions it can throw into DBExceptions to make it easier to handle 
 database errors while iterating.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-941) RM Should have a way to update the tokens it has for a running application

2014-05-16 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998704#comment-13998704
 ] 

Steve Loughran commented on YARN-941:
-

We've been doing AM restart, and seen some token renewal problems already 
-these may be worth fixing first: SLIDER-46, SLIDER-34, with YARN-side issues 
of YARN-2065, and YARN-2053. Fixing those probably comes before working on a 
patch here

 RM Should have a way to update the tokens it has for a running application
 --

 Key: YARN-941
 URL: https://issues.apache.org/jira/browse/YARN-941
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Robert Joseph Evans
Assignee: Xuan Gong

 When an application is submitted to the RM it includes with it a set of 
 tokens that the RM will renew on behalf of the application, that will be 
 passed to the AM when the application is launched, and will be used when 
 launching the application to access HDFS to download files on behalf of the 
 application.
 For long lived applications/services these tokens can expire, and then the 
 tokens that the AM has will be invalid, and the tokens that the RM had will 
 also not work to launch a new AM.
 We need to provide an API that will allow the RM to replace the current 
 tokens for this application with a new set.  To avoid any real race issues, I 
 think this API should be something that the AM calls, so that the client can 
 connect to the AM with a new set of tokens it got using kerberos, then the AM 
 can inform the RM of the new set of tokens and quickly update its tokens 
 internally to use these new ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

[
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999451#comment-13999451
]

Anubhav Dhoot commented on YARN-1366:
-

Seems like we are going with no resync api for now as per the current patch. I
think its a good idea to hold of on the new API unless we see a need. I feel
there isnt a strong case for it yet.

There a few issues i see which will need a little more work.
Pending releases - AM forgets about a request to release once its made. We will
have to reissue a release request after RM restart to be safe (add also make
sure RM can handle a duplicate of that). Otherwise we have a resource leak if
RM has not issued the release before it restarted. One way is to remember all
releases in a new SetContainerId pendingReleases in RMContainerRequestor and
remove it by processing the getCompletedContainersStatuses in makeRemoteRequest
or a new function that it calls.

{code}
+blacklistAdditions.addAll(blacklistedNodes);
{code}
Blacklisting has logic in ignoreBlacklisting to ignore it if we cross a
threshold. So we can do

{code}
if (!ignoreBlacklisting.get()) {
blacklistAdditions.addAll(blacklistedNodes);
}
{code}

There a few places where the line exceeds 80 chars.
Otherwise it looks good.
Lets add some tests and validate this.

ApplicationMasterService should Resync with the AM upon allocate call after
restart
---

Key: YARN-1366
URL: https://issues.apache.org/jira/browse/YARN-1366
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch,
YARN-1366.prototype.patch, YARN-1366.prototype.patch

The ApplicationMasterService currently sends a resync response to which the
AM responds by shutting down. The AM behavior is expected to change to
calling resyncing with the RM. Resync means resetting the allocate RPC
sequence number to 0 and the AM should send its entire outstanding request to
the RM. Note that if the AM is making its first allocate call to the RM then
things should proceed like normal without needing a resync. The RM will
return all containers that have completed since the RM last synced with the
AM. Some container completions may be reported more than once.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999689#comment-13999689
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

[~kkambatl], can you check a latest patch and kick the Jenkins? I have no 
permission to kick the Jenkins.

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
 YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
 YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2055) Preemption: Jobs are failing due to AMs are getting launched and killed multiple times

2014-05-16 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999724#comment-13999724
 ] 

Sunil G commented on YARN-2055:
---

Thank you Mayank for the clarification.
I have a small doubt here. In such scenarios, is it like scheduler should not 
assign any more container for Queue A?
Assuming that here Queue B is demand is there, then only Queue B's requests has 
to be served first. Am I correct?

 Preemption: Jobs are failing due to AMs are getting launched and killed 
 multiple times
 --

 Key: YARN-2055
 URL: https://issues.apache.org/jira/browse/YARN-2055
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal

 If Queue A does not have enough capacity to run AM, then AM will borrow 
 capacity from queue B to run AM in that case AM will be killed if queue B 
 will reclaim its capacity and again AM will be launched and killed again, in 
 that case job will be failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

[
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999771#comment-13999771
]

Karthik Kambatla commented on YARN-2054:

bq. If we want these configs to match up with
yarn.resourcemanager.zk-timeout-ms and (as YARN-1878 is trying) if that can
change, we need to somehow make them linked dynamically?
These configs need not match, but in an HA setting, it might not make a lot of
sense to have these significantly different.

bq. Does it make sense to link with the config HA enabled also ? If we have
another RM sitting standby, we may want to failover quickly. But if we have
only one RM, and somehow ZK is unavailable, RM will only retry for 10 seconds
and shuts down.
Good point. May be, we can come up with a good value for retry-interval based
on whether HA is enabled and yarn.resourcemanager.zk-timeout-ms.

Poor defaults for YARN ZK configs for retries and retry-inteval
---

Key: YARN-2054
URL: https://issues.apache.org/jira/browse/YARN-2054
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Attachments: yarn-2054-1.patch

Currenly, we have the following default values:
# yarn.resourcemanager.zk-num-retries - 500
# yarn.resourcemanager.zk-retry-interval-ms - 2000
This leads to a cumulate 1000 seconds before the RM gives up trying to
connect to the ZK.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart

2014-05-16 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999558#comment-13999558
 ] 

Junping Du commented on YARN-1338:
--

Hi [~jlowe], thanks for contributing a patch here. Looks like the latest patch 
include some code in YARN-1987 which is already committed. Would you mind to 
update it so that I can start to review and comment? Thanks! 

 Recover localized resource cache state upon nodemanager restart
 ---

 Key: YARN-1338
 URL: https://issues.apache.org/jira/browse/YARN-1338
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1338.patch, YARN-1338v2.patch, 
 YARN-1338v3-and-YARN-1987.patch


 Today when node manager restarts we clean up all the distributed cache files 
 from disk. This is definitely not ideal from 2 aspects.
 * For work preserving restart we definitely want them as running containers 
 are using them
 * For even non work preserving restart this will be useful in the sense that 
 we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2068) FairScheduler uses the same ResourceCalculator for all policies

Karthik Kambatla created YARN-2068:
--

 Summary: FairScheduler uses the same ResourceCalculator for all 
policies
 Key: YARN-2068
 URL: https://issues.apache.org/jira/browse/YARN-2068
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla


FairScheduler uses the same ResourceCalculator for all policies including DRF. 
Need to fix that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval


[ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999173#comment-13999173
 ] 

Hadoop QA commented on YARN-2054:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644753/yarn-2054-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3747//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3747//console

This message is automatically generated.

 Poor defaults for YARN ZK configs for retries and retry-inteval
 ---

 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2054-1.patch


 Currenly, we have the following default values:
 # yarn.resourcemanager.zk-num-retries - 500
 # yarn.resourcemanager.zk-retry-interval-ms - 2000
 This leads to a cumulate 1000 seconds before the RM gives up trying to 
 connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart


[ 
https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998625#comment-13998625
 ] 

Tsuyoshi OZAWA commented on YARN-1365:
--

Oops, this comment is for YARN-1367. I'll comment it on YARN-1367.

 ApplicationMasterService to allow Register and Unregister of an app that was 
 running before restart
 ---

 Key: YARN-1365
 URL: https://issues.apache.org/jira/browse/YARN-1365
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1365.initial.patch


 For an application that was running before restart, the 
 ApplicationMasterService currently throws an exception when the app tries to 
 make the initial register or final unregister call. These should succeed and 
 the RMApp state machine should transition to completed like normal. 
 Unregistration should succeed for an app that the RM considers complete since 
 the RM may have died after saving completion in the store but before 
 notifying the AM that the AM is free to exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers


[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999233#comment-13999233
 ] 

Hadoop QA commented on YARN-2017:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645077/YARN-2017.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1277 javac 
compiler warnings (more than the trunk's current 1276 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-tools/hadoop-sls 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3748//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3748//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3748//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3748//console

This message is automatically generated.

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, 
 YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1751) Improve MiniYarnCluster for log aggregation testing


[ 
https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999003#comment-13999003
 ] 

Hudson commented on YARN-1751:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5605/])
YARN-1751. Improve MiniYarnCluster for log aggregation testing. Contributed by 
Ming Ma (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594275)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java


 Improve MiniYarnCluster for log aggregation testing
 ---

 Key: YARN-1751
 URL: https://issues.apache.org/jira/browse/YARN-1751
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Fix For: 3.0.0, 2.5.0

 Attachments: YARN-1751-trunk.patch, YARN-1751.patch


 MiniYarnCluster specifies individual remote log aggregation root dir for each 
 NM. Test code that uses MiniYarnCluster won't be able to get the value of log 
 aggregation root dir. The following code isn't necessary in MiniYarnCluster.
   File remoteLogDir =
   new File(testWorkDir, MiniYARNCluster.this.getName()
   + -remoteLogDir-nm- + index);
   remoteLogDir.mkdir();
   config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
   remoteLogDir.getAbsolutePath());
 In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to 
 FileContext.getFileContext() call.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests

2014-05-16 Thread Chris Riccomini (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1343#comment-1343
]

Chris Riccomini commented on YARN-2027:
---

K, feel free to close.

I'm fairly sure that I tried a host with a null rack during testing and it
didn't work, but it might have been on the FIFO scheduler. Either way, we've
figured out a workaround to our problem, and [~zhiguohong] has verified
functionality on a real cluster, so I'm OK with closing this ticket out.

YARN ignores host-specific resource requests

[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever


 [ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2049:
--

Attachment: YARN-2049.2.patch

Fix a bug in the previous patch:

When creating the delegation token, we shouldn't use the current user to server 
as the owner of the DT, because the current user is going to be the user of the 
timeline server.

On the other side, we also cannot use the remote user from 
AuthenticationFilter, because before passing AuthenticationFilter, the user is 
still not logged in, and the remote user from HttpServletRequest is going to be 
dr.who by default, given static user filter is applied before.

The right way is get the user name from authentication token, because at this 
point the kerberos authentication is passed, and authentication token's user 
name is actually the client kerberos principle, which is the right one we want 
to use.

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch, YARN-2049.2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts


[ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998948#comment-13998948
 ] 

Jian He commented on YARN-2053:
---

LGTM, +1, submit the same patch to kick jenkins

 Slider AM fails to restart: NPE in 
 RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
 

 Key: YARN-2053
 URL: https://issues.apache.org/jira/browse/YARN-2053
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sumit Mohanty
Assignee: Wangda Tan
 Fix For: 2.4.1

 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, 
 YARN-2053.patch, YARN-2053.patch, 
 yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
 yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak


 Slider AppMaster restart fails with the following:
 {code}
 org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


 [ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1514:
-

Attachment: YARN-1514.wip.patch

Attached a WIP patch.

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.5.0

 Attachments: YARN-1514.wip.patch


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used

2014-05-16 Thread Steve Loughran (JIRA)

Steve Loughran created YARN-2065:


 Summary: AM cannot create new containers after restart-NM token 
from previous attempt used
 Key: YARN-2065
 URL: https://issues.apache.org/jira/browse/YARN-2065
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Steve Loughran


Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot 
create new containers.

The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it 
kills the AM, then kills a container while the AM is down, which triggers a 
reallocation of a container, leading to this failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1424) RMAppAttemptImpl should precompute a zeroed ApplicationResourceUsageReport to return when attempt not active


 [ 
https://issues.apache.org/jira/browse/YARN-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang reassigned YARN-1424:


Assignee: Ray Chiang

 RMAppAttemptImpl should precompute a zeroed ApplicationResourceUsageReport to 
 return when attempt not active
 

 Key: YARN-1424
 URL: https://issues.apache.org/jira/browse/YARN-1424
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Sandy Ryza
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie

 RMAppImpl has a DUMMY_APPLICATION_RESOURCE_USAGE_REPORT to return when the 
 caller of createAndGetApplicationReport doesn't have access.
 RMAppAttemptImpl should have something similar for 
 getApplicationResourceUsageReport.
 It also might make sense to put the dummy report into 
 ApplicationResourceUsageReport and allow both to use it.
 A test would also be useful to verify that 
 RMAppAttemptImpl#getApplicationResourceUsageReport doesn't return null if the 
 scheduler doesn't have a report to return.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting


 [ 
https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1569:


Assignee: (was: Anubhav Dhoot)

 For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, 
 SchedulerEvent should get checked (instanceof) for appropriate type before 
 casting
 -

 Key: YARN-1569
 URL: https://issues.apache.org/jira/browse/YARN-1569
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Junping Du
Priority: Minor
  Labels: newbie

 As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should 
 always check appropriate type before casting. 
 handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so 
 far (no bug there now) but should be improved as FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2067) FairScheduler update/continuous-scheduling threads should start only when after the scheduler is started

Karthik Kambatla created YARN-2067:
--

 Summary: FairScheduler update/continuous-scheduling threads should 
start only when after the scheduler is started
 Key: YARN-2067
 URL: https://issues.apache.org/jira/browse/YARN-2067
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2056) Disable preemption at Queue level

2014-05-16 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2056:
--

Fix Version/s: (was: 2.1.0-beta)

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal

 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1550) NPE in FairSchedulerAppsBlock#render


 [ 
https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1550:


Attachment: YARN-1550.001.patch

Updated caolong's patch 

 NPE in FairSchedulerAppsBlock#render
 

 Key: YARN-1550
 URL: https://issues.apache.org/jira/browse/YARN-1550
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
Reporter: caolong
Priority: Critical
 Fix For: 2.2.1

 Attachments: YARN-1550.001.patch, YARN-1550.patch


 three Steps :
 1、debug at RMAppManager#submitApplication after code
 if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
 null) {
   String message = Application with id  + applicationId
   +  is already present! Cannot add a duplicate!;
   LOG.warn(message);
   throw RPCUtil.getRemoteException(message);
 }
 2、submit one application:hadoop jar 
 ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar
  sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 
 -r 1
 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR！
 the log:
 {noformat}
 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
 handling URI: /cluster/scheduler
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart


 [ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1338:
-

Attachment: YARN-1338v4.patch

Updating patch to trunk.

 Recover localized resource cache state upon nodemanager restart
 ---

 Key: YARN-1338
 URL: https://issues.apache.org/jira/browse/YARN-1338
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1338.patch, YARN-1338v2.patch, 
 YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch


 Today when node manager restarts we clean up all the distributed cache files 
 from disk. This is definitely not ideal from 2 aspects.
 * For work preserving restart we definitely want them as running containers 
 are using them
 * For even non work preserving restart this will be useful in the sense that 
 we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()

2014-05-16 Thread Hong Zhiguo (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo reassigned YARN-2066:
-

Assignee: Hong Zhiguo

 Wrong field is referenced in 
 GetApplicationsRequestPBImpl#mergeLocalToBuilder()
 ---

 Key: YARN-2066
 URL: https://issues.apache.org/jira/browse/YARN-2066
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Hong Zhiguo
Priority: Minor

 {code}
 if (this.finish != null) {
   builder.setFinishBegin(start.getMinimumLong());
   builder.setFinishEnd(start.getMaximumLong());
 }
 {code}
 this.finish should be referenced in the if block.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart


[ 
https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999018#comment-13999018
 ] 

Hudson commented on YARN-1362:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5605/])
YARN-1362. Distinguish between nodemanager shutdown for decommission vs 
shutdown for restart. (Contributed by Jason Lowe) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594421)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/Context.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java


 Distinguish between nodemanager shutdown for decommission vs shutdown for 
 restart
 -

 Key: YARN-1362
 URL: https://issues.apache.org/jira/browse/YARN-1362
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 2.5.0

 Attachments: YARN-1362.patch


 When a nodemanager shuts down it needs to determine if it is likely to be 
 restarted.  If a restart is likely then it needs to preserve container 
 directories, logs, distributed cache entries, etc.  If it is being shutdown 
 more permanently (e.g.: like a decommission) then the nodemanager should 
 cleanup directories and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1936) Secured timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Attachment: YARN-1936.2.patch

Upload a new patch:

We shouldn't request the timeline DT when the timeline services is not enabled.

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch, YARN-1936.2.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998624#comment-13998624
 ] 

Tsuyoshi OZAWA commented on YARN-2061:
--

s/RACE/TRACE/

 Revisit logging levels in ZKRMStateStore 
 -

 Key: YARN-2061
 URL: https://issues.apache.org/jira/browse/YARN-2061
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie

 ZKRMStateStore has a few places where it is logging at the INFO level. We 
 should change these to DEBUG or TRACE level messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2061) Revisit logging levels in ZKRMStateStore


 [ 
https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-2061:
-

Attachment: YARN2061-01.patch

Patch to move several LOG.info messages to LOG.debug.  Cleans up messages a bit 
and adds some consistency to messages from the same method.

 Revisit logging levels in ZKRMStateStore 
 -

 Key: YARN-2061
 URL: https://issues.apache.org/jira/browse/YARN-2061
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie
 Attachments: YARN2061-01.patch


 ZKRMStateStore has a few places where it is logging at the INFO level. We 
 should change these to DEBUG or TRACE level messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers


 [ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2017:
--

Attachment: YARN-2017.4.patch

Same patch to kick jenkins

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, 
 YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1799) Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff

2014-05-16 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-1799:
-

Assignee: Sunil G

 Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff
 -

 Key: YARN-1799
 URL: https://issues.apache.org/jira/browse/YARN-1799
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Sunil G
Assignee: Sunil G

 LocalDirAllocator provides paths for all tasks for its local write.
 This considers the good list of directories which are selected by the 
 HealthCheck mechamnism in LocalDirsHandlerService
 getLocalPathForWrite() considers whether input demand size can meet the 
 capacity in lastAccessed directory.
 If more tasks asks for path from LocalDirAllocator, then it is possible that 
 the allocation is done based on the current disk availability at that given 
 time.
 But this path would have earlier given to some other tasks to write and they 
 may be sequentially doing writing.
 It is better to check for an upper cutoff for disk availability



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1981) Nodemanager version is not updated when a node reconnects


[ 
https://issues.apache.org/jira/browse/YARN-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999021#comment-13999021
 ] 

Hudson commented on YARN-1981:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5605/])
YARN-1981. Nodemanager version is not updated when a node reconnects (Jason 
Lowe via jeagles) (jeagles: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594358)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java


 Nodemanager version is not updated when a node reconnects
 -

 Key: YARN-1981
 URL: https://issues.apache.org/jira/browse/YARN-1981
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 3.0.0, 2.5.0

 Attachments: YARN-1981.patch


 When a nodemanager is quickly restarted and happens to change versions during 
 the restart (e.g.: rolling upgrade scenario) the NM version as reported by 
 the RM is not updated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2070) DistributedShell publish unfriendly user information to the timeline server

Zhijie Shen created YARN-2070:
-

 Summary: DistributedShell publish unfriendly user information to 
the timeline server
 Key: YARN-2070
 URL: https://issues.apache.org/jira/browse/YARN-2070
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Priority: Minor


Bellow is the code of using the string of current user object as the user 
value.
{code}
entity.addPrimaryFilter(user, UserGroupInformation.getCurrentUser()
.toString());
{code}

When we use kerberos authentication, it's going to output the full name, such 
as zjshen/localhost@LOCALHOST (auth.KERBEROS). It is not user friendly for 
searching by the primary filters. It's better to use shortUserName instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2070) DistributedShell publishes unfriendly user information to the timeline server


 [ 
https://issues.apache.org/jira/browse/YARN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2070:
--

Summary: DistributedShell publishes unfriendly user information to the 
timeline server  (was: DistributedShell publish unfriendly user information to 
the timeline server)

 DistributedShell publishes unfriendly user information to the timeline server
 -

 Key: YARN-2070
 URL: https://issues.apache.org/jira/browse/YARN-2070
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Priority: Minor

 Bellow is the code of using the string of current user object as the user 
 value.
 {code}
 entity.addPrimaryFilter(user, UserGroupInformation.getCurrentUser()
 .toString());
 {code}
 When we use kerberos authentication, it's going to output the full name, such 
 as zjshen/localhost@LOCALHOST (auth.KERBEROS). It is not user friendly for 
 searching by the primary filters. It's better to use shortUserName instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.

[
https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karthik Kambatla updated YARN-1996:
---

Description:
Currently, UNHEALTHY nodes can significantly prolong execution of large
expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster
health even further due to [positive
feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that
might have deemed the node unhealthy in the first place starts spreading across
the cluster because the current node is declared unusable and all its
containers are killed and rescheduled on different nodes.

To mitigate this, we experiment with a patch that allows containers already
running on a node turning UNHEALTHY to complete (drain) whereas no new
container can be assigned to it until it turns healthy again.

This mechanism can also be used for graceful decommissioning of NM. To this
end, we have to write a health script such that it can deterministically
report UNHEALTHY. For example with
{code}
if [ -e $1 ] ; then

echo ERROR Node decommmissioning via health script hack

fi
{code}

In the current version patch, the behavior is controlled by a boolean property
{{yarn.nodemanager.unhealthy.drain.containers}}. More versatile policies are
possible in the future work. Currently, the health state of a node is binary
determined based on the disk checker and the health script ERROR outputs.
However, we can as well interpret health script output similar to java logging
levels (one of which is ERROR) such as WARN, FATAL. Each level can then be
treated differently. E.g.,
- FATAL: unusable like today
- ERROR: drain
- WARN: halve the node capacity.
complimented with some equivalence rules such as 3 WARN messages == ERROR,
2*ERROR == FATAL, etc.

was:
Currently, UNHEALTHY nodes can significantly prolong execution of large
expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster
health even further due to [positive
feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that
might have deemed the node unhealthy in the first place starts spreading across
the cluster because the current node is declared unusable and all its
containers are killed and rescheduled on different nodes.

echo ERROR Node decommmissioning via health script hack

fi
{code}

In the current version patch, the behavior is controlled by a boolean property
{{yarn.nodemanager.unheathy.drain.containers}}. More versatile policies are
possible in the future work. Currently, the health state of a node is binary
determined based on the disk checker and the health script ERROR outputs.
However, we can as well interpret health script output similar to java logging
levels (one of which is ERROR) such as WARN, FATAL. Each level can then be
treated differently. E.g.,
- FATAL: unusable like today
- ERROR: drain
- WARN: halve the node capacity.
complimented with some equivalence rules such as 3 WARN messages == ERROR,
2*ERROR == FATAL, etc.

Provide alternative policies for UNHEALTHY nodes.
-

Key: YARN-1996
URL: https://issues.apache.org/jira/browse/YARN-1996
Project: Hadoop YARN
Issue Type: New Feature
Components: nodemanager, scheduler
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
Attachments: YARN-1996.v01.patch

Currently, UNHEALTHY nodes can significantly prolong execution of large
expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster
health even further due to [positive
feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set
that might have deemed the node unhealthy in the first place starts spreading
across the cluster because the current node is declared unusable and all its
containers are killed and rescheduled on different nodes.
To mitigate this, we experiment with a patch that allows containers already
running on a node turning UNHEALTHY to complete (drain) whereas no new
container can be assigned to it until it turns healthy again.
This mechanism can also be used for graceful decommissioning of NM. To this
end, we have to write a health

[jira] [Updated] (YARN-1354) Recover applications upon nodemanager restart


 [ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1354:
-

Attachment: YARN-1354-v3.patch

Updated patch now that YARN-1987 and YARN-1362 have been committed.

 Recover applications upon nodemanager restart
 -

 Key: YARN-1354
 URL: https://issues.apache.org/jira/browse/YARN-1354
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1354-v1.patch, 
 YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch


 The set of active applications in the nodemanager context need to be 
 recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1969) Fair Scheduler: Add policy for Earliest Endtime First


 [ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1969:
---

Summary: Fair Scheduler: Add policy for Earliest Endtime First  (was: Fair 
Scheduler: Add policy for Earliest Deadline First)

 Fair Scheduler: Add policy for Earliest Endtime First
 -

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, what we require is a kind of 
 variation of *Earliest Deadline First scheduling*, that takes into account 
 the number of already-allocated resources and estimated time to finish.
 http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to *only within a queue*: i.e., 
 adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues 
 to use it by setting the schedulingPolicy field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1424) RMAppAttemptImpl should precompute a zeroed ApplicationResourceUsageReport to return when attempt not active


 [ 
https://issues.apache.org/jira/browse/YARN-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-1424:
-

Attachment: YARN1424-01.patch

First version of a potential patch.  

- Moves DUMMY_APPLICATION_RESOURCE_USAGE_REPORT RMAppImpl to RMServerUtils.  
Cannot move this to ApplicationResourceUsageReport, since it exists in the 
hadoop-yarn-api module as opposed to everything else being in the 
hadoop-yarn-server module.
- Uses the reference in RMAppImpl and RMAppAttemptImpl.
- No unit tests in this particular patch file.

 RMAppAttemptImpl should precompute a zeroed ApplicationResourceUsageReport to 
 return when attempt not active
 

 Key: YARN-1424
 URL: https://issues.apache.org/jira/browse/YARN-1424
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Sandy Ryza
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie
 Attachments: YARN1424-01.patch


 RMAppImpl has a DUMMY_APPLICATION_RESOURCE_USAGE_REPORT to return when the 
 caller of createAndGetApplicationReport doesn't have access.
 RMAppAttemptImpl should have something similar for 
 getApplicationResourceUsageReport.
 It also might make sense to put the dummy report into 
 ApplicationResourceUsageReport and allow both to use it.
 A test would also be useful to verify that 
 RMAppAttemptImpl#getApplicationResourceUsageReport doesn't return null if the 
 scheduler doesn't have a report to return.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1969) Fair Scheduler: Add policy for Earliest Endtime First


 [ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1969:
---

Description: 
What we are observing is that some big jobs with many allocated containers are 
waiting for a few containers to finish. Under *fair-share scheduling* however 
they have a low priority since there are other jobs (usually much smaller, new 
comers) that are using resources way below their fair share, hence new released 
containers are not offered to the big, yet close-to-be-finished job. 
Nevertheless, everybody would benefit from an unfair scheduling that offers 
the resource to the big job since the sooner the big job finishes, the sooner 
it releases its many allocated resources to be used by other jobs.In other 
words, we need a relaxed version of *Earliest Endtime First scheduling*, that 
takes into account the number of already-allocated resources and estimated time 
to finish.

For example, if a job is using MEM GB of memory and is expected to finish in 
TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). 
The expected time to finish can be estimated by the AppMaster using 
TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
request messages. To be less susceptible to the issue of apps gaming the 
system, we can have this scheduling limited to leaf queues which have 
applications.

  was:
What we are observing is that some big jobs with many allocated containers are 
waiting for a few containers to finish. Under *fair-share scheduling* however 
they have a low priority since there are other jobs (usually much smaller, new 
comers) that are using resources way below their fair share, hence new released 
containers are not offered to the big, yet close-to-be-finished job. 
Nevertheless, everybody would benefit from an unfair scheduling that offers 
the resource to the big job since the sooner the big job finishes, the sooner 
it releases its many allocated resources to be used by other jobs.In other 
words, what we require is a kind of variation of *Earliest Deadline First 
scheduling*, that takes into account the number of already-allocated resources 
and estimated time to finish.
http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling

For example, if a job is using MEM GB of memory and is expected to finish in 
TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). 
The expected time to finish can be estimated by the AppMaster using 
TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
request messages. To be less susceptible to the issue of apps gaming the 
system, we can have this scheduling limited to *only within a queue*: i.e., 
adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues to 
use it by setting the schedulingPolicy field.


 Fair Scheduler: Add policy for Earliest Endtime First
 -

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, we need a relaxed version of 
 *Earliest Endtime First scheduling*, that takes into account the number of 
 already-allocated resources and estimated time to finish.
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to leaf queues which have 
 applications.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998911#comment-13998911
 ] 

Ray Chiang commented on YARN-2061:
--

One other observation.  For the various LOG.info() statements in a catch block, 
should those be LOG.error() or does it make sense for those to stay LOG.info()?

 Revisit logging levels in ZKRMStateStore 
 -

 Key: YARN-2061
 URL: https://issues.apache.org/jira/browse/YARN-2061
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie
 Attachments: YARN2061-01.patch


 ZKRMStateStore has a few places where it is logging at the INFO level. We 
 should change these to DEBUG or TRACE level messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()

2014-05-16 Thread Ted Yu (JIRA)

Ted Yu created YARN-2066:


 Summary: Wrong field is referenced in 
GetApplicationsRequestPBImpl#mergeLocalToBuilder()
 Key: YARN-2066
 URL: https://issues.apache.org/jira/browse/YARN-2066
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


{code}
if (this.finish != null) {
  builder.setFinishBegin(start.getMinimumLong());
  builder.setFinishEnd(start.getMaximumLong());
}
{code}
this.finish should be referenced in the if block.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()

2014-05-16 Thread Hong Zhiguo (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo updated YARN-2066:
--

Attachment: YARN-2066.patch

 Wrong field is referenced in 
 GetApplicationsRequestPBImpl#mergeLocalToBuilder()
 ---

 Key: YARN-2066
 URL: https://issues.apache.org/jira/browse/YARN-2066
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-2066.patch


 {code}
 if (this.finish != null) {
   builder.setFinishBegin(start.getMinimumLong());
   builder.setFinishEnd(start.getMaximumLong());
 }
 {code}
 this.finish should be referenced in the if block.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers


 [ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2017:
--

Attachment: YARN-2017.5.patch

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, 
 YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1962) Timeline server is enabled by default


[ 
https://issues.apache.org/jira/browse/YARN-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994006#comment-13994006
 ] 

Jason Lowe commented on YARN-1962:
--

+1 lgtm.  Will commit this early next week to give [~zjshen] a chance to 
comment.

 Timeline server is enabled by default
 -

 Key: YARN-1962
 URL: https://issues.apache.org/jira/browse/YARN-1962
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.0
Reporter: Mohammad Kamrul Islam
Assignee: Mohammad Kamrul Islam
 Attachments: YARN-1962.1.patch, YARN-1962.2.patch


 Since Timeline server is not matured and secured yet, enabling  it by default 
 might create some confusion.
 We were playing with 2.4.0 and found a lot of exceptions for distributed 
 shell example related to connection refused error. Btw, we didn't run TS 
 because it is not secured yet.
 Although it is possible to explicitly turn it off through yarn-site config. 
 In my opinion,  this extra change for this new service is not worthy at this 
 point,.  
 This JIRA is to turn it off by default.
 If there is an agreement, i can put a simple patch about this.
 {noformat}
 14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response 
 from the timeline server.
 com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: 
 Connection refused
   at 
 com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
   at com.sun.jersey.api.client.Client.handle(Client.java:648)
   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
   at 
 com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281)
 Caused by: java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
   at sun.net.www.http.HttpClient.in14/04/17 23:24:33 ERROR 
 impl.TimelineClientImpl: Failed to get the response from the timeline server.
 com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: 
 Connection refused
   at 
 com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
   at com.sun.jersey.api.client.Client.handle(Client.java:648)
   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
   at 
 com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281)
 Caused by: java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
   at

[jira] [Commented] (YARN-2061) Revisit logging levels in ZKRMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999111#comment-13999111
 ] 

Jian He commented on YARN-2061:
---

Hi Ray, thanks for cleaning it up. I think  a reasonable  way is to put info 
level in unusual condition which helps debugging in most cases,  and debug 
level in usual condition which avoids excessive loggings.

 Revisit logging levels in ZKRMStateStore 
 -

 Key: YARN-2061
 URL: https://issues.apache.org/jira/browse/YARN-2061
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
  Labels: newbie
 Attachments: YARN2061-01.patch


 ZKRMStateStore has a few places where it is logging at the INFO level. We 
 should change these to DEBUG or TRACE level messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled


[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999020#comment-13999020
 ] 

Hudson commented on YARN-1861:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5605/])
YARN-1861. Fixed a bug in RM to reset leader-election on fencing that was 
causing both RMs to be stuck in standby mode when automatic failover is 
enabled. Contributed by Karthik Kambatla and Xuan Gong. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594356)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java


 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Fix For: 2.4.1

 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA


[ 
https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998660#comment-13998660
 ] 

Tsuyoshi OZAWA commented on YARN-1514:
--

Rough design:
1. Launch ZKRMStateStore and initialize ZooKeeper.
2. Creating znodes based on given option.
3. Run loadState() and show how much time it takes.

 Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
 

 Key: YARN-1514
 URL: https://issues.apache.org/jira/browse/YARN-1514
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.5.0


 ZKRMStateStore is very sensitive to ZNode-related operations as discussed in 
 YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is 
 called when RM-HA cluster does failover. Therefore, its execution time 
 impacts failover time of RM-HA.
 We need utility to benchmark time execution time of ZKRMStateStore#loadStore 
 as development tool.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled


 [ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1861:
---

Attachment: yarn-1861-6.patch

Updated new patch (yarn-1861-6.patch) to fix the nits. Also, the RM could 
transition to Standby and immediately transition back to Active - reduced the 
sleep between retries to 1 ms, and changed the assert after the loop to use the 
number of attempts.

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-766) TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk

2014-05-16 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994129#comment-13994129
 ] 

Junping Du commented on YARN-766:
-

Hi [~sseth], the patch against trunk make sense to me. So I update the name of 
JIRA to mention a format inconsistent here. Will commit is shortly.

 TestNodeManagerShutdown in branch-2 should use Shell to form the output path 
 and a format issue in trunk
 

 Key: YARN-766
 URL: https://issues.apache.org/jira/browse/YARN-766
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.1.0-beta
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Minor
 Attachments: YARN-766.branch-2.txt, YARN-766.trunk.txt, YARN-766.txt


 File scriptFile = new File(tmpDir, scriptFile.sh);
 should be replaced with
 File scriptFile = Shell.appendScriptExtension(tmpDir, scriptFile);
 to match trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1937) Add entity-level access control of the timeline data for owners only


 [ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1937:
--

Attachment: YARN-1937.3.patch

I've tested the patch on a single node cluster, which seems to work fine 
generally. Fix one bug I've found in the new patch.

 Add entity-level access control of the timeline data for owners only
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1937.1.patch, YARN-1937.2.patch, YARN-1937.3.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2070) DistributedShell publishes unfriendly user information to the timeline server


 [ 
https://issues.apache.org/jira/browse/YARN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2070:
--

Labels: newbie  (was: )

 DistributedShell publishes unfriendly user information to the timeline server
 -

 Key: YARN-2070
 URL: https://issues.apache.org/jira/browse/YARN-2070
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Priority: Minor
  Labels: newbie

 Bellow is the code of using the string of current user object as the user 
 value.
 {code}
 entity.addPrimaryFilter(user, UserGroupInformation.getCurrentUser()
 .toString());
 {code}
 When we use kerberos authentication, it's going to output the full name, such 
 as zjshen/localhost@LOCALHOST (auth.KERBEROS). It is not user friendly for 
 searching by the primary filters. It's better to use shortUserName instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart


[ 
https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999719#comment-13999719
 ] 

Tsuyoshi OZAWA commented on YARN-1365:
--

Sure! I'll check it.

 ApplicationMasterService to allow Register and Unregister of an app that was 
 running before restart
 ---

 Key: YARN-1365
 URL: https://issues.apache.org/jira/browse/YARN-1365
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1365.001.patch, YARN-1365.initial.patch


 For an application that was running before restart, the 
 ApplicationMasterService currently throws an exception when the app tries to 
 make the initial register or final unregister call. These should succeed and 
 the RMApp state machine should transition to completed like normal. 
 Unregistration should succeed for an app that the RM considers complete since 
 the RM may have died after saving completion in the store but before 
 notifying the AM that the AM is free to exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1935) Security for timeline server


 [ 
https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1935:
--

Attachment: Timeline_Kerberos_DT_ACLs.patch

I created an uber patch which integrate the pieces I've done so far. With this 
patch the timeline server can work in a secure mode (except the generic history 
service part):

1. Timeline server can start and login with Kerberors principle and keytab;
2. The user either passed the Kerberos authentication or obtained the timeline 
delegation token can get access to the timeline data;
3. Withe ACLs enabled, only the owner who published the timeline data before 
can access the data.

Folks who are interested in the timeline security can play with the patch.

 Security for timeline server
 

 Key: YARN-1935
 URL: https://issues.apache.org/jira/browse/YARN-1935
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Zhijie Shen
 Attachments: Timeline_Kerberos_DT_ACLs.patch


 Jira to track work to secure the ATS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect

2014-05-16 Thread Chen He (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2034:
--

Attachment: YARN-2034.patch

 Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
 

 Key: YARN-2034
 URL: https://issues.apache.org/jira/browse/YARN-2034
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Chen He
Priority: Minor
 Attachments: YARN-2034.patch


 The description in yarn-default.xml for 
 yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per 
 local directory, but according to the code it's a setting for the entire node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart

[
https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998592#comment-13998592
]

Tsuyoshi OZAWA commented on YARN-1365:
--

I've read your code. The prototype is including following changes:

1. Changed NodeManager's RegisterNodeManagerRequest to send ContainerReport.
2. Added Configuration about RM_WORK_PRESERVING_RECOVERY_ENABLED.
3. Added cluster timestamp to Container Id.

I think we should focus on NM should resync with the RM when the
RM_WORK_PRESERVING_RECOVERY_ENABLED is set to true. Can you add resync
code(ResourceManager's side code) into the patch? Also, in regard to
ContainerId format, let's discuss on YARN-2052.

ApplicationMasterService to allow Register and Unregister of an app that was
running before restart
---

Key: YARN-1365
URL: https://issues.apache.org/jira/browse/YARN-1365
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
Attachments: YARN-1365.initial.patch

For an application that was running before restart, the
ApplicationMasterService currently throws an exception when the app tries to
make the initial register or final unregister call. These should succeed and
the RMApp state machine should transition to completed like normal.
Unregistration should succeed for an app that the RM considers complete since
the RM may have died after saving completion in the store but before
notifying the AM that the AM is free to exit.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1986) In Fifo Scheduler, node heartbeat in between creating app and attempt causes NPE


[ 
https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999022#comment-13999022
 ] 

Hudson commented on YARN-1986:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5605/])
YARN-1986. In Fifo Scheduler, node heartbeat in between creating app and 
attempt causes NPE (Hong Zhiguo via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594476)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java


 In Fifo Scheduler, node heartbeat in between creating app and attempt causes 
 NPE
 

 Key: YARN-1986
 URL: https://issues.apache.org/jira/browse/YARN-1986
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Jon Bringhurst
Assignee: Hong Zhiguo
Priority: Critical
 Fix For: 2.4.1

 Attachments: YARN-1986-2.patch, YARN-1986-3.patch, 
 YARN-1986-testcase.patch, YARN-1986.patch


 After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
 -After RM was restarted, the job runs without a problem.-
 {noformat}
 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type 
 NODE_UPDATE to the scheduler
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591)
   at java.lang.Thread.run(Thread.java:744)
 19:11:13,443  INFO ResourceManager:604 - Exiting, bbye..
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render


[ 
https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000434#comment-14000434
 ] 

Hadoop QA commented on YARN-1550:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645277/YARN-1550.001.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3756//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3756//console

This message is automatically generated.

 NPE in FairSchedulerAppsBlock#render
 

 Key: YARN-1550
 URL: https://issues.apache.org/jira/browse/YARN-1550
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
Reporter: caolong
Priority: Critical
 Fix For: 2.2.1

 Attachments: YARN-1550.001.patch, YARN-1550.patch


 three Steps :
 1、debug at RMAppManager#submitApplication after code
 if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
 null) {
   String message = Application with id  + applicationId
   +  is already present! Cannot add a duplicate!;
   LOG.warn(message);
   throw RPCUtil.getRemoteException(message);
 }
 2、submit one application:hadoop jar 
 ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar
  sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 
 -r 1
 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR！
 the log:
 {noformat}
 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
 handling URI: /cluster/scheduler
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers


[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998631#comment-13998631
 ] 

Tsuyoshi OZAWA commented on YARN-1367:
--

Some comments against a patch:

1. Can you fix the indent?
{code}
+  public boolean isWorkPreservingRestartEnabled() { return
+  isWorkPreservingRestartEnabled;
+  }
{code}

{code}
+  if (!rmWorkPreservingRestartEnbaled)
+  {
+containerManager.cleanupContainersOnNMResync();
+  }
{code}

2. IMO, recovery.work-preserving-restart.enabled is more appropriate because 
this is one of options under RECOVERY_ENABLED namespace. 
{code}
  public static final String RM_WORK_PRESERVING_RECOVERY_ENABLED = RM_PREFIX
  + work-preserving.recovery.enabled;
{code}


 After restart NM should resync with the RM without killing containers
 -

 Key: YARN-1367
 URL: https://issues.apache.org/jira/browse/YARN-1367
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1367.prototype.patch


 After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
  Upon receiving the resync response, the NM kills all containers and 
 re-registers with the RM. The NM should be changed to not kill the container 
 and instead inform the RM about all currently running containers including 
 their allocations etc. After the re-register, the NM should send all pending 
 container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler


[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998823#comment-13998823
 ] 

Wangda Tan commented on YARN-1368:
--

Sorry I went to wrong JIRA, please ignore above comment :-/

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Jian He
 Attachments: YARN-1368.1.patch, YARN-1368.2.patch, 
 YARN-1368.combined.001.patch, YARN-1368.preliminary.patch


 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2030) Use StateMachine to simplify handleStoreEvent() in RMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993246#comment-13993246
 ] 

Wangda Tan commented on YARN-2030:
--

+1 for this idea, I think we should handle this neatly to avoid possible bugs.

 Use StateMachine to simplify handleStoreEvent() in RMStateStore
 ---

 Key: YARN-2030
 URL: https://issues.apache.org/jira/browse/YARN-2030
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du

 Now the logic to handle different store events in handleStoreEvent() is as 
 following:
 {code}
 if (event.getType().equals(RMStateStoreEventType.STORE_APP)
 || event.getType().equals(RMStateStoreEventType.UPDATE_APP)) {
   ...
   if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
 ...
   } else {
 ...
   }
   ...
   try {
 if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
   ...
 } else {
   ...
 }
   } 
   ...
 } else if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)
 || event.getType().equals(RMStateStoreEventType.UPDATE_APP_ATTEMPT)) {
   ...
   if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) {
 ...
   } else {
 ...
   }
 ...
 if (event.getType().equals(RMStateStoreEventType.STORE_APP_ATTEMPT)) {
   ...
 } else {
   ...
 }
   }
   ...
 } else if (event.getType().equals(RMStateStoreEventType.REMOVE_APP)) {
 ...
 } else {
   ...
 }
 }
 {code}
 This is not only confuse people but also led to mistake easily. We may 
 leverage state machine to simply this even no state transitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1918) Typo in description and error message for 'yarn.resourcemanager.cluster-id'

2014-05-16 Thread Anandha L Ranganathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anandha L Ranganathan updated YARN-1918:


Attachment: YARN-1918.1.patch

 Typo in description and error message for 'yarn.resourcemanager.cluster-id'
 ---

 Key: YARN-1918
 URL: https://issues.apache.org/jira/browse/YARN-1918
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.3.0
Reporter: Devaraj K
Assignee: Anandha L Ranganathan
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1918.1.patch


 1.  In yarn-default.xml
 {code:xml}
 property
 descriptionName of the cluster. In a HA setting,
   this is used to ensure the RM participates in leader
   election fo this cluster and ensures it does not affect
   other clusters/description
 nameyarn.resourcemanager.cluster-id/name
 !--valueyarn-cluster/value--
   /property
 {code}
 Here the line 'election fo this cluster and ensures it does not affect' 
 should be replaced with  'election for this cluster and ensures it does not 
 affect'.
 2. 
 {code:xml}
 org.apache.hadoop.HadoopIllegalArgumentException: Configuration doesn't 
 specifyyarn.resourcemanager.cluster-id
   at 
 org.apache.hadoop.yarn.conf.YarnConfiguration.getClusterId(YarnConfiguration.java:1336)
 {code}
 In the above exception message, it is missing a space between message and 
 configuration name.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute

2014-05-16 Thread Ashwin Shankar (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2012:
-

Attachment: YARN-2012-v2.txt

Patch refreshed.

 Fair Scheduler : Default rule in queue placement policy can take a queue as 
 an optional attribute
 -

 Key: YARN-2012
 URL: https://issues.apache.org/jira/browse/YARN-2012
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt


 Currently 'default' rule in queue placement policy,if applied,puts the app in 
 root.default queue. It would be great if we can make 'default' rule 
 optionally point to a different queue as default queue . This queue should be 
 an existing queue,if not we fall back to root.default queue hence keeping 
 this rule as terminal.
 This default queue can be a leaf queue or it can also be an parent queue if 
 the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart


[ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000435#comment-14000435
 ] 

Hadoop QA commented on YARN-1354:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645314/YARN-1354-v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 12 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3757//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3757//console

This message is automatically generated.

 Recover applications upon nodemanager restart
 -

 Key: YARN-1354
 URL: https://issues.apache.org/jira/browse/YARN-1354
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1354-v1.patch, 
 YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch


 The set of active applications in the nodemanager context need to be 
 recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2049) Delegation token stuff for the timeline sever


[ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000471#comment-14000471
 ] 

Hadoop QA commented on YARN-2049:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645260/YARN-2049.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3760//console

This message is automatically generated.

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch, YARN-2049.2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2053) Slider AM fails to restart: NPE in RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts


[ 
https://issues.apache.org/jira/browse/YARN-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999533#comment-13999533
 ] 

Hadoop QA commented on YARN-2053:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645047/YARN-2053.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3751//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3751//console

This message is automatically generated.

 Slider AM fails to restart: NPE in 
 RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts
 

 Key: YARN-2053
 URL: https://issues.apache.org/jira/browse/YARN-2053
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sumit Mohanty
Assignee: Wangda Tan
 Fix For: 2.4.1

 Attachments: YARN-2053.patch, YARN-2053.patch, YARN-2053.patch, 
 YARN-2053.patch, YARN-2053.patch, 
 yarn-yarn-nodemanager-c6403.ambari.apache.org.log.bak, 
 yarn-yarn-resourcemanager-c6403.ambari.apache.org.log.bak


 Slider AppMaster restart fails with the following:
 {code}
 org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterResponseProto$Builder.addAllNmTokensFromPreviousAttempts(YarnServiceProtos.java:2700)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1339) Recover DeletionService state upon nodemanager restart


 [ 
https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1339:
-

Attachment: YARN-1339v4.patch

Updated patch now that YARN-1987 has been committed.

 Recover DeletionService state upon nodemanager restart
 --

 Key: YARN-1339
 URL: https://issues.apache.org/jira/browse/YARN-1339
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1339.patch, YARN-1339v2.patch, 
 YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue

2014-05-16 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2069:
--

Fix Version/s: (was: 2.1.0-beta)

 Add cross-user preemption within CapacityScheduler's leaf-queue
 ---

 Key: YARN-2069
 URL: https://issues.apache.org/jira/browse/YARN-2069
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 Preemption today only works across queues and moves around resources across 
 queues per demand and usage. We should also have user-level preemption within 
 a queue, to balance capacity across users in a predictable manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-16 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000284#comment-14000284
 ] 

Bikas Saha commented on YARN-1366:
--

bq. Seems like we are going with no resync api for now as per the current 
patch. I think its a good idea to hold of on the new API unless we see a need. 
I feel there isnt a strong case for it yet.
I dont think we can summarily make such a choice without a proper discussion. 
Again, I am not advocating either choice. But we should understand the 
approaches and their effects on the system (users + back-end implementation) 
before we make a call on the API. My last comment opened the discussion with 
some questions and it would be great if the assignee ([~rohithsharma] and other 
committers/contributors express their understanding and insight on those 
questions.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1936) Secured timeline client


[ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000448#comment-14000448
 ] 

Hadoop QA commented on YARN-1936:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645288/YARN-1936.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3758//console

This message is automatically generated.

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch, YARN-1936.2.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart


[ 
https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998979#comment-13998979
 ] 

Anubhav Dhoot commented on YARN-1365:
-

Hi [~ozawa] just saw your comment after i had it ready. Can you please help 
review the tests i added. Thanks.

 ApplicationMasterService to allow Register and Unregister of an app that was 
 running before restart
 ---

 Key: YARN-1365
 URL: https://issues.apache.org/jira/browse/YARN-1365
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1365.001.patch, YARN-1365.initial.patch


 For an application that was running before restart, the 
 ApplicationMasterService currently throws an exception when the app tries to 
 make the initial register or final unregister call. These should succeed and 
 the RMApp state machine should transition to completed like normal. 
 Unregistration should succeed for an app that the RM considers complete since 
 the RM may have died after saving completion in the store but before 
 notifying the AM that the AM is free to exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1976) Tracking url missing http protocol for FAILED application


[ 
https://issues.apache.org/jira/browse/YARN-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999004#comment-13999004
 ] 

Hudson commented on YARN-1976:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5605/])
YARN-1976. Fix CHANGES.txt for YARN-1976. (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594123)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


 Tracking url missing http protocol for FAILED application
 -

 Key: YARN-1976
 URL: https://issues.apache.org/jira/browse/YARN-1976
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yesha Vora
Assignee: Junping Du
 Fix For: 2.4.1

 Attachments: YARN-1976-v2.patch, YARN-1976.patch


 Run yarn application -list -appStates FAILED,  It does not print http 
 protocol name like FINISHED apps.
 {noformat}
 -bash-4.1$ yarn application -list -appStates FINISHED,FAILED,KILLED
 14/04/15 23:55:07 INFO client.RMProxy: Connecting to ResourceManager at host
 Total number of applications (application-types: [] and states: [FINISHED, 
 FAILED, KILLED]):4
 Application-IdApplication-Name
 Application-Type  User   Queue   State
  Final-State ProgressTracking-URL
 application_1397598467870_0004   Sleep job   
 MAPREDUCEhrt_qa defaultFINISHED   
 SUCCEEDED 100% 
 http://host:19888/jobhistory/job/job_1397598467870_0004
 application_1397598467870_0003   Sleep job   
 MAPREDUCEhrt_qa defaultFINISHED   
 SUCCEEDED 100% 
 http://host:19888/jobhistory/job/job_1397598467870_0003
 application_1397598467870_0002   Sleep job   
 MAPREDUCEhrt_qa default  FAILED   
FAILED 100% 
 host:8088/cluster/app/application_1397598467870_0002
 application_1397598467870_0001  word count   
 MAPREDUCEhrt_qa defaultFINISHED   
 SUCCEEDED 100% 
 http://host:19888/jobhistory/job/job_1397598467870_0001
 {noformat}
 It only prints 'host:8088/cluster/app/application_1397598467870_0002' instead 
 'http://host:8088/cluster/app/application_1397598467870_0002' 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting

2014-05-16 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu reassigned YARN-1569:
---

Assignee: zhihai xu

 For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, 
 SchedulerEvent should get checked (instanceof) for appropriate type before 
 casting
 -

 Key: YARN-1569
 URL: https://issues.apache.org/jira/browse/YARN-1569
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Junping Du
Assignee: zhihai xu
Priority: Minor
  Labels: newbie

 As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should 
 always check appropriate type before casting. 
 handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so 
 far (no bug there now) but should be improved as FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2066) Wrong field is referenced in GetApplicationsRequestPBImpl#mergeLocalToBuilder()


[ 
https://issues.apache.org/jira/browse/YARN-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000393#comment-14000393
 ] 

Hadoop QA commented on YARN-2066:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645233/YARN-2066.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3754//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3754//console

This message is automatically generated.

 Wrong field is referenced in 
 GetApplicationsRequestPBImpl#mergeLocalToBuilder()
 ---

 Key: YARN-2066
 URL: https://issues.apache.org/jira/browse/YARN-2066
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-2066.patch


 {code}
 if (this.finish != null) {
   builder.setFinishBegin(start.getMinimumLong());
   builder.setFinishEnd(start.getMaximumLong());
 }
 {code}
 this.finish should be referenced in the if block.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1569) For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, SchedulerEvent should get checked (instanceof) for appropriate type before casting


 [ 
https://issues.apache.org/jira/browse/YARN-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-1569:
---

Assignee: Anubhav Dhoot

 For handle(SchedulerEvent) in FifoScheduler and CapacityScheduler, 
 SchedulerEvent should get checked (instanceof) for appropriate type before 
 casting
 -

 Key: YARN-1569
 URL: https://issues.apache.org/jira/browse/YARN-1569
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Junping Du
Assignee: Anubhav Dhoot
Priority: Minor
  Labels: newbie

 As following: http://wiki.apache.org/hadoop/CodeReviewChecklist, we should 
 always check appropriate type before casting. 
 handle(SchedulerEvent) in FifoScheduler and CapacityScheduler didn't check so 
 far (no bug there now) but should be improved as FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler


[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998822#comment-13998822
 ] 

Wangda Tan commented on YARN-1368:
--

LGTM, +1 (non-binding). Please kick off Jenkins building.

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Jian He
 Attachments: YARN-1368.1.patch, YARN-1368.2.patch, 
 YARN-1368.combined.001.patch, YARN-1368.preliminary.patch


 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart


[ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000398#comment-14000398
 ] 

Hadoop QA commented on YARN-1338:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12645279/YARN-1338v4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 15 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3753//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3753//console

This message is automatically generated.

 Recover localized resource cache state upon nodemanager restart
 ---

 Key: YARN-1338
 URL: https://issues.apache.org/jira/browse/YARN-1338
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1338.patch, YARN-1338v2.patch, 
 YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch


 Today when node manager restarts we clean up all the distributed cache files 
 from disk. This is definitely not ideal from 2 aspects.
 * For work preserving restart we definitely want them as running containers 
 are using them
 * For even non work preserving restart this will be useful in the sense that 
 we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000528#comment-14000528
 ] 

Karthik Kambatla commented on YARN-1474:


Looks like it did run, but couldn't apply the patch. Mind updating? 

Also, I was wondering whether we should change the signature of 
{{reinitialize()}}. FWIW, I am +0 to changing it.
# I understand passing the RMContext is not required anymore, and is better to 
change it so we don't accumulate more code calling it.
# However, that is an incompatible change to the ResourceScheduler which is 
Private. 

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
 YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
 YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used


 [ 
https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He reassigned YARN-2065:
-

Assignee: Jian He

 AM cannot create new containers after restart-NM token from previous attempt 
 used
 -

 Key: YARN-2065
 URL: https://issues.apache.org/jira/browse/YARN-2065
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Steve Loughran
Assignee: Jian He

 Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot 
 create new containers.
 The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it 
 kills the AM, then kills a container while the AM is down, which triggers a 
 reallocation of a container, leading to this failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart