[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994837#comment-13994837
 ] 

Karthik Kambatla commented on YARN-1474:


# Correct me if I am wrong, but changes to AllocationFileLoaderService look 
unrelated. Can we do it in a different JIRA?
# Nothing to do with this patch, but may be we can add spaces between each 
interface ResourceSchedulerWrapper implements?
{code}
public class ResourceSchedulerWrapper extends AbstractYarnScheduler
implements SchedulerWrapper,ResourceScheduler,Configurable {
{code}
# Correct me if I am wrong. We need to set the rmContext only once. Can we 
update the comment to say, need to call this immediately after instantiating a 
scheduler? 
# Do we need the changes to {{reinitialize()}} implementations in each 
scheduler? Also, don't think we need a separate serviceInitInternal. Why not 
just have serviceInit call reinitialize?
# FairScheduler: we can do without these variables.
{code}
  private volatile boolean isUpdateThreadRunning = false;
  private volatile boolean isSchedulingThreadRunning = false;
{code}
# FairScheduler: serviceStartInternal and serviceStopInternal are fairly small 
methods - do we need these separate methods? 
# Can we call join(timeout) after interrupt, may be use a constant 
THREAD_JOIN_TIMEOUT = 1000? Also, set updateThread to null after join.
{code}
if (updateThread != null) {
  updateThread.interrupt();
}
{code}
# Check for schedulingThread is null? Also, set the thread to null after join()
{code}
if (continuousSchedulingEnabled) {
  isSchedulingThreadRunning = false;
  schedulingThread.interrupt();
}
{code}

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2044) thrift interface for YARN?

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2044.
---

Resolution: Not a Problem

We have protocol buffer based interfaces that you can look at.

Also, please post questions on the mailing lists instead of opening tickets on 
the issue-tracker. Thanks.

 thrift interface for YARN?
 --

 Key: YARN-2044
 URL: https://issues.apache.org/jira/browse/YARN-2044
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Nikhil Mulley

 Hi,
 I was searching for the thrift interface definitions for YARN but could not 
 come across any. Is there any plan to have a thrift interface to YARN ? If 
 there is already one, could some one please redirect me to the appropriate 
 place?
 thanks,
 Nikhil



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994865#comment-13994865
 ] 

Vinod Kumar Vavilapalli commented on YARN-1515:
---

bq. Vinod Kumar Vavilapalli, I am interested in your feedback in the context of 
your comment on MAPREDUCE-5044 .
Yeah, sorry. This was in my blind spot. I understand this patch was online for 
a while, and is likely also being run in production, but I have some comments.

As I mentioned on MAPREDUCE-5044, this feature is and should be done via 
YARN-445. Dumping threads is strictly a java construct and so far we have 
avoided any language feature in the YARN APIs (not willingly anyways). Can we 
instead implement this feature using YARN-445 and getting clients/AMs to send a 
SIGQUIT signal or some such signal command instead? I looked at the patch and 
that is indeed what it is doing eventually in the NM. We need to keep the API 
clean. Thoughts?



 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1515:
--

Issue Type: Sub-task  (was: New Feature)
Parent: YARN-445

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-445) Ability to signal containers

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994878#comment-13994878
 ] 

Vinod Kumar Vavilapalli commented on YARN-445:
--

Folks, I just made YARN-1515 a sub-tasks of this.

This JIRA is today focusing on exposing a signalling interface on the 
ResourceManager. It seems like we can simply expose the same API as part of 
ContainerManagement and get most of the thread-dump functionality with minimal 
changes.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: MRJob.png, MRTasks.png, YARN-445--n2.patch, 
 YARN-445--n3.patch, YARN-445--n4.patch, 
 YARN-445-signal-container-via-rm.patch, YARN-445.patch, YARNContainers.png


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2017:
--

Summary: Merge some of the common lib code in schedulers  (was: Merge 
common code in schedulers)

Edited the title to reflect what is being actually done.

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1803) Signal container support in nodemanager

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994879#comment-13994879
 ] 

Vinod Kumar Vavilapalli commented on YARN-1803:
---

Tx for working on this Ming. Few comments, in line with [my comment on 
YARN-445|https://issues.apache.org/jira/browse/YARN-445?focusedCommentId=13994878page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13994878]
 about combining this functionality with the thread-dump feature,
 - We need to consolidate stopContainer* APIs into the signalContainer APIs. 
Logically, they are a subset of signalling.
 - To make that happen, we will need to have bulk signalling APIs to signal 
multiple containers simultaneously
 - One other requirement as part of that is to to be able to send an ordered 
list of signals so that NM can for example do things like sigterm+sigkill or 
thread-dump+sigterm+sigkill etc.
 - SignalContainerCommand defines a bunch of commands that aren't going to 
implemented today - let's only add those that are required and are going to be 
implemented as part of this set of patches.

Still navigating the entire arena w.r.t to the signalling work being done 
across several JIRAs.

 Signal container support in nodemanager
 ---

 Key: YARN-1803
 URL: https://issues.apache.org/jira/browse/YARN-1803
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1803.patch


 It could include the followings.
 1. ContainerManager is able to process a new event type 
 ContainerManagerEventType.SIGNAL_CONTAINERS coming from NodeStatusUpdater and 
 deliver the request to ContainerExecutor.
 2. Translate the platform independent signal command to Linux specific 
 signals. Windows support will be tracked by another task.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reassigned YARN-1368:
-

Assignee: Jian He  (was: Anubhav Dhoot)

Assigning to Jian as he started putting up patches.. [~adhoot]/[~wangda], 
please help with reviews.

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Jian He
 Attachments: YARN-1368.1.patch, YARN-1368.preliminary.patch


 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-12 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-1366:


Assignee: Rohith

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.patch, YARN-1366.prototype.patch, 
 YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-12 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994920#comment-13994920
 ] 

Rohith commented on YARN-1366:
--

Thank you for offering! It was just wait to finsih prototype by Anubhav Dhoot. 
I assign to myself:-)

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
 Attachments: YARN-1366.patch, YARN-1366.prototype.patch, 
 YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-05-12 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-796:
---

Assignee: Wangda Tan  (was: Arun C Murthy)

Working on this JIRA, assigned it to myself. And will post a design doc in a 
day or two.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994891#comment-13994891
 ] 

Vinod Kumar Vavilapalli commented on YARN-1366:
---

[~rohithsharma], are you interested in taking this patch further? If so, assign 
it to yourselves and [~adhoot] can provide review comments and help. Otherwise, 
he will take it over from what I can see.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
 Attachments: YARN-1366.patch, YARN-1366.prototype.patch, 
 YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994893#comment-13994893
 ] 

Vinod Kumar Vavilapalli commented on YARN-556:
--

Also, if there is a general agreement on how patches should go in which order, 
please create that ordering through JIRA dependencies. Thanks.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-05-12 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar reassigned YARN-2026:


Assignee: Ashwin Shankar

 Fair scheduler : Fair share for inactive queues causes unfair allocation in 
 some scenarios
 --

 Key: YARN-2026
 URL: https://issues.apache.org/jira/browse/YARN-2026
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler

 While using hierarchical queues in fair scheduler,there are few scenarios 
 where we have seen a leaf queue with least fair share can take majority of 
 the cluster and starve a sibling parent queue which has greater weight/fair 
 share and preemption doesn’t kick in to reclaim resources.
 The root cause seems to be that fair share of a parent queue is distributed 
 to all its children irrespective of whether its an active or an inactive(no 
 apps running) queue. Preemption based on fair share kicks in only if the 
 usage of a queue is less than 50% of its fair share and if it has demands 
 greater than that. When there are many queues under a parent queue(with high 
 fair share),the child queue’s fair share becomes really low. As a result when 
 only few of these child queues have apps running,they reach their *tiny* fair 
 share quickly and preemption doesn’t happen even if other leaf 
 queues(non-sibling) are hogging the cluster.
 This can be solved by dividing fair share of parent queue only to active 
 child queues.
 Here is an example describing the problem and proposed solution:
 root.lowPriorityQueue is a leaf queue with weight 2
 root.HighPriorityQueue is parent queue with weight 8
 root.HighPriorityQueue has 10 child leaf queues : 
 root.HighPriorityQueue.childQ(1..10)
 Above config,results in root.HighPriorityQueue having 80% fair share
 and each of its ten child queue would have 8% fair share. Preemption would 
 happen only if the child queue is 4% (0.5*8=4). 
 Lets say at the moment no apps are running in any of the 
 root.HighPriorityQueue.childQ(1..10) and few apps are running in 
 root.lowPriorityQueue which is taking up 95% of the cluster.
 Up till this point,the behavior of FS is correct.
 Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
 of the cluster. It would get only the available 5% in the cluster and 
 preemption wouldn't kick in since its above 4%(half fair share).This is bad 
 considering childQ1 is under a highPriority parent queue which has *80% fair 
 share*.
 Until root.lowPriorityQueue starts relinquishing containers,we would see the 
 following allocation on the scheduler page:
 *root.lowPriorityQueue = 95%*
 *root.HighPriorityQueue.childQ1=5%*
 This can be solved by distributing a parent’s fair share only to active 
 queues.
 So in the example above,since childQ1 is the only active queue
 under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
 80%.
 This would cause preemption to reclaim the 30% needed by childQ1 from 
 root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
 Also note that similar situation can happen between 
 root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck 
 at 5%,until childQ2 starts relinquishing containers. We would like each of 
 childQ1 and childQ2 to get half of root.HighPriorityQueue  fair share ie 
 40%,which would ensure childQ1 gets upto 40% resource if needed through 
 preemption.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1989) Adding shell scripts to launch multiple servers on localhost

2014-05-12 Thread Masatake Iwasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-1989:
---

Attachment: YARN-1989-0.patch

attaching patch. 
local-resourcemanagers-ha.sh starts multiple resourcemanagers in HA mode on 
localhost. local-nodemanagers.sh starts multiple nodemanagers on localhost.

 Adding shell scripts to launch multiple servers on localhost
 

 Key: YARN-1989
 URL: https://issues.apache.org/jira/browse/YARN-1989
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-1989-0.patch


 Adding shell scripts to launch multiple servers on localhost for test and 
 debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1986) After upgrade from 2.2.0 to 2.4.0, NPE on first job start.

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992950#comment-13992950
 ] 

Tsuyoshi OZAWA commented on YARN-1986:
--

+1(non-binding). Let's wait for [~sandyr]'s comment.

 After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
 --

 Key: YARN-1986
 URL: https://issues.apache.org/jira/browse/YARN-1986
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Jon Bringhurst
Assignee: Hong Zhiguo
 Attachments: YARN-1986-2.patch, YARN-1986-3.patch, 
 YARN-1986-testcase.patch, YARN-1986.patch


 After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
 After RM was restarted, the job runs without a problem.
 {noformat}
 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type 
 NODE_UPDATE to the scheduler
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591)
   at java.lang.Thread.run(Thread.java:744)
 19:11:13,443  INFO ResourceManager:604 - Exiting, bbye..
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-667) Data persisted in RM should be versioned

2014-05-12 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-667:


Summary: Data persisted in RM should be versioned  (was: Data persisted by 
YARN daemons should be versioned)

 Data persisted in RM should be versioned
 

 Key: YARN-667
 URL: https://issues.apache.org/jira/browse/YARN-667
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
Assignee: Junping Du

 Includes data persisted for RM restart, NodeManager directory structure and 
 the Aggregated Log Format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994886#comment-13994886
 ] 

Vinod Kumar Vavilapalli commented on YARN-556:
--

Tx for the community update, Karthik.

Also, Jian/Abhinav, can you both please file all the known sub-tasks and assign 
things to yourselves according as you are working on them rightaway? Other 
folks like [~ozawa] and [~rohithsharma] have been requesting repeatedly 
expressed interest to work on this feature. It'll be great to find stuff for 
everyone instead of creating all tickets and assigning them to the two of you. 
Thanks.

[~ozawa] and [~rohithsharma], let others know what you specifically want to 
work on, if you have something in mind.

bq.  6. clustertimestamp is added to containerId so that containerId after RM 
restart do not clash with containerId before (as the containerId counter resets 
to zero in memory)
I totally missed this line item. Can you throw more detail on what the problem 
is and what the proposal is? What is done in the prototype patch is a major 
compatibility issue - I'd like to avoid it if we can.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2045) Data persisted in NM should be versioned

2014-05-12 Thread Junping Du (JIRA)
Junping Du created YARN-2045:


 Summary: Data persisted in NM should be versioned
 Key: YARN-2045
 URL: https://issues.apache.org/jira/browse/YARN-2045
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du


As a split task from YARN-667, we want to add version info to NM related data, 
include:
- NodeManager local LevelDB state
- NodeManager directory structure




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-667) Data persisted in RM should be versioned

2014-05-12 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994984#comment-13994984
 ] 

Junping Du commented on YARN-667:
-

Filed YARN-2045 to address NM part data.

 Data persisted in RM should be versioned
 

 Key: YARN-667
 URL: https://issues.apache.org/jira/browse/YARN-667
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
Assignee: Junping Du

 Includes data persisted for RM restart, NodeManager directory structure and 
 the Aggregated Log Format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-182) Unnecessary Container killed by the ApplicationMaster message for successful containers

2014-05-12 Thread Deepak Kumar V (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994967#comment-13994967
 ] 

Deepak Kumar V commented on YARN-182:
-

Hello,

I am seeing this in Note section of each map task. Each of Map task state is 
SUCCEEDED.

Message:
attempt_1399912169384_0001_m_13_0   100.00  SUCCEEDED   map 
datanode-9-281920.slc01.dev.company.com:8042logsMon, 12 May 2014 
16:33:33 GMT   Mon, 12 May 2014 16:34:26 GMT   52sec   Container killed by the 
ApplicationMaster. Container killed on request. Exit code is 143 Container 
exited with a non-zero exit code 143

Hadoop Details:
NameNode 'namenode-284133.slc01.dev.company.com:8020' (active)

Started:Tue May 06 16:18:04 GMT-07:00 2014
Version:2.4.0.2.1.1.0-385, 68ceccf06a4441273e81a5ec856d41fc7e11c792
Compiled:   2014-04-16T21:24Z by jenkins from (no branch)
Cluster ID: CID-fb86b3cf-7787-4c67-998f-24f00e43c137
Block Pool ID:  BP-1163369527-10.65.216.196-1399412949036

 Unnecessary Container killed by the ApplicationMaster message for 
 successful containers
 -

 Key: YARN-182
 URL: https://issues.apache.org/jira/browse/YARN-182
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.1-alpha
Reporter: zhengqiu cai
Assignee: Omkar Vinit Joshi
  Labels: hadoop, usability
 Attachments: Log.txt


 I was running wordcount and the resourcemanager web UI shown the status as 
 FINISHED SUCCEEDED, but the log shown Container killed by the 
 ApplicationMaster



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-667) Data persisted in RM should be versioned

2014-05-12 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994975#comment-13994975
 ] 

Junping Du commented on YARN-667:
-

Limit the scope of this JIRA to persistent data in RM - RMStateStore. Will file 
separate JIRA to address persistent data in NM. 

 Data persisted in RM should be versioned
 

 Key: YARN-667
 URL: https://issues.apache.org/jira/browse/YARN-667
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
Assignee: Junping Du

 Includes data persisted for RM restart, NodeManager directory structure and 
 the Aggregated Log Format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-05-12 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994957#comment-13994957
 ] 

Carlo Curino commented on YARN-2022:


Sunil, I am travelling abroad till 26th (please forgive delays)... I could only 
skim the patch from a mobile device. It looks reasonable, a concern I have is 
that we rely on a user set Priority to choose whether to preempt or not. Unless 
there are check in place preventing the user from abusing this value, this is 
egregiously gameable (set my containers all to AM priority and get away with 
murder).

Also I thought more about the possible corner cases, after conversation with 
Chris Douglas, and Mayank: we should keep an eye out for max percentage of 
resources dedicated to AMs... we should save the AMs from earlier 
(higher-pri) applications up till the max % of AM we can allocate in the Queue, 
and at the very least not protect the AMs past that point. Similar check 
should be in place for userLimitFactor. Without this it is entirely possible 
that a queue is wedged with 100% AMs or that a user has in its AM more 
resources than he deserve (and it is systematically skipped, even if the 
cluster is empty). We have seen some of this in particular extreme test cases 
(espilon-size queues, many apps moved to a queue etc...).
Please share your thoughts on this... 

 Preempting an Application Master container can be kept as least priority when 
 multiple applications are marked for preemption by 
 ProportionalCapacityPreemptionPolicy
 -

 Key: YARN-2022
 URL: https://issues.apache.org/jira/browse/YARN-2022
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-2022.1.patch


 Cluster Size = 16GB [2NM's]
 Queue A Capacity = 50%
 Queue B Capacity = 50%
 Consider there are 3 applications running in Queue A which has taken the full 
 cluster capacity. 
 J1 = 2GB AM + 1GB * 4 Maps
 J2 = 2GB AM + 1GB * 4 Maps
 J3 = 2GB AM + 1GB * 2 Maps
 Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
 Currently in this scenario, Jobs J3 will get killed including its AM.
 It is better if AM can be given least priority among multiple applications. 
 In this same scenario, map tasks from J3 and J2 can be preempted.
 Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994877#comment-13994877
 ] 

Vinod Kumar Vavilapalli commented on YARN-1515:
---

YARN-445 BTW is today focusing on exposing a signalling interface on the 
ResourceManager. It seems like we can simply expose the same API as part of 
ContainerManagement and get most of this functionality with minimal changes.

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994866#comment-13994866
 ] 

Vinod Kumar Vavilapalli commented on YARN-1515:
---

We can still implement the 'single RPC' functionality you wanted, by making the 
signal API take in a list of signals and optional time-intervals in between.

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---

Attachment: yarn-2010-2.patch

New patch with following changes - 
# Noticed that RMAppManager#recoverApplication wasn't failing running 
applications in all the code-paths corresponding to failed recovery. Fixed that 
and cleaned it up futher.
# Changed the config name to be shorter. 
# Added comments to make sure we document why we are doing what we are doing.

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
  
 ... 13 more 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()

2014-05-12 Thread Ted Yu (JIRA)
Ted Yu created YARN-2042:


 Summary: String shouldn't be compared using == in 
QueuePlacementRule#NestedUserQueue#getQueueForApp()
 Key: YARN-2042
 URL: https://issues.apache.org/jira/browse/YARN-2042
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


{code}
  if (queueName != null  queueName != ) {
{code}
queueName.isEmpty() should be used instead of comparing against 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1515:
--

Target Version/s: 2.5.0  (was: 2.4.0)

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2046) Out of band heartbeats are sent only on container kill and possibly too early

2014-05-12 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-2046:


 Summary: Out of band heartbeats are sent only on container kill 
and possibly too early
 Key: YARN-2046
 URL: https://issues.apache.org/jira/browse/YARN-2046
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0, 0.23.10
Reporter: Jason Lowe


[~mingma] pointed out in the review discussion for MAPREDUCE-5465 that the NM 
is currently sending out of band heartbeats only when stopContainer is called.  
In addition those heartbeats might be sent too early because the container kill 
event is asynchronously posted then the heartbeat monitor is notified.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995165#comment-13995165
 ] 

Karthik Kambatla commented on YARN-556:
---

Oh. Forgot to mention that. [~adhoot] offered to split up the prototype into 
multiple patches, one for each of the sub-tasks. If I understand right, his 
prototype covers almost all the sub-tasks already created.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1918) Typo in description and error message for 'yarn.resourcemanager.cluster-id'

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995173#comment-13995173
 ] 

Tsuyoshi OZAWA commented on YARN-1918:
--

Thanks for your contribution, [~analog.sony]. It looks good to me(non-binding). 
Please wait for a review by commiters.

 Typo in description and error message for 'yarn.resourcemanager.cluster-id'
 ---

 Key: YARN-1918
 URL: https://issues.apache.org/jira/browse/YARN-1918
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.3.0
Reporter: Devaraj K
Assignee: Anandha L Ranganathan
Priority: Trivial
  Labels: newbie
 Attachments: YARN-1918.1.patch


 1.  In yarn-default.xml
 {code:xml}
 property
 descriptionName of the cluster. In a HA setting,
   this is used to ensure the RM participates in leader
   election fo this cluster and ensures it does not affect
   other clusters/description
 nameyarn.resourcemanager.cluster-id/name
 !--valueyarn-cluster/value--
   /property
 {code}
 Here the line 'election fo this cluster and ensures it does not affect' 
 should be replaced with  'election for this cluster and ensures it does not 
 affect'.
 2. 
 {code:xml}
 org.apache.hadoop.HadoopIllegalArgumentException: Configuration doesn't 
 specifyyarn.resourcemanager.cluster-id
   at 
 org.apache.hadoop.yarn.conf.YarnConfiguration.getClusterId(YarnConfiguration.java:1336)
 {code}
 In the above exception message, it is missing a space between message and 
 configuration name.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1701) Improve default paths of timeline store and generic history store

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994348#comment-13994348
 ] 

Hudson commented on YARN-1701:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #560 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/560/])
YARN-1701. Improved default paths of the timeline store and the generic history 
store. Contributed by Tsuyoshi Ozawa. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593481)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 Improve default paths of timeline store and generic history store
 -

 Key: YARN-1701
 URL: https://issues.apache.org/jira/browse/YARN-1701
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Tsuyoshi OZAWA
 Fix For: 2.4.1

 Attachments: YARN-1701.3.patch, YARN-1701.v01.patch, 
 YARN-1701.v02.patch


 When I enable AHS via yarn.ahs.enabled, the app history is still not visible 
 in AHS webUI. This is due to NullApplicationHistoryStore as 
 yarn.resourcemanager.history-writer.class. It would be good to have just one 
 key to enable basic functionality.
 yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is 
 local file system location. However, FileSystemApplicationHistoryStore uses 
 DFS by default.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1474) Make schedulers services

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1474:
-

Attachment: YARN-1474.11.patch

Added missing file to pass compile.

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests

2014-05-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995262#comment-13995262
 ] 

Bikas Saha commented on YARN-2027:
--

Was the relaxLocality flag set to false in order to make a hard constraint for 
the node? 
Or is the jira stating that even soft locality constraints (where YARN is 
allowed to relax the locality from node to rack to *) is also not working? Soft 
locality would need delay scheduling to be enabled and that needs the configs 
that Sandy mentioned.

 YARN ignores host-specific resource requests
 

 Key: YARN-2027
 URL: https://issues.apache.org/jira/browse/YARN-2027
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.4.0
 Environment: RHEL 6.1
 YARN 2.4
Reporter: Chris Riccomini

 YARN appears to be ignoring host-level ContainerRequests.
 I am creating a container request with code that pretty closely mirrors the 
 DistributedShell code:
 {code}
   protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) 
 {
 info(Requesting %d container(s) with %dmb of memory format (containers, 
 memMb))
 val capability = Records.newRecord(classOf[Resource])
 val priority = Records.newRecord(classOf[Priority])
 priority.setPriority(0)
 capability.setMemory(memMb)
 capability.setVirtualCores(cpuCores)
 // Specifying a host in the String[] host parameter here seems to do 
 nothing. Setting relaxLocality to false also doesn't help.
 (0 until containers).foreach(idx = amClient.addContainerRequest(new 
 ContainerRequest(capability, null, null, priority)))
   }
 {code}
 When I run this code with a specific host in the ContainerRequest, YARN does 
 not honor the request. Instead, it puts the container on an arbitrary host. 
 This appears to be true for both the FifoScheduler and the CapacityScheduler.
 Currently, we are running the CapacityScheduler with the following settings:
 {noformat}
 configuration
   property
 nameyarn.scheduler.capacity.maximum-applications/name
 value1/value
 description
   Maximum number of applications that can be pending and running.
 /description
   /property
   property
 nameyarn.scheduler.capacity.maximum-am-resource-percent/name
 value0.1/value
 description
   Maximum percent of resources in the cluster which can be used to run
   application masters i.e. controls number of concurrent running
   applications.
 /description
   /property
   property
 nameyarn.scheduler.capacity.resource-calculator/name
 
 valueorg.apache.hadoop.yarn.util.resource.DefaultResourceCalculator/value
 description
   The ResourceCalculator implementation to be used to compare
   Resources in the scheduler.
   The default i.e. DefaultResourceCalculator only uses Memory while
   DominantResourceCalculator uses dominant-resource to compare
   multi-dimensional resources such as Memory, CPU etc.
 /description
   /property
   property
 nameyarn.scheduler.capacity.root.queues/name
 valuedefault/value
 description
   The queues at the this level (root is the root queue).
 /description
   /property
   property
 nameyarn.scheduler.capacity.root.default.capacity/name
 value100/value
 descriptionSamza queue target capacity./description
   /property
   property
 nameyarn.scheduler.capacity.root.default.user-limit-factor/name
 value1/value
 description
   Default queue user limit a percentage from 0.0 to 1.0.
 /description
   /property
   property
 nameyarn.scheduler.capacity.root.default.maximum-capacity/name
 value100/value
 description
   The maximum capacity of the default queue.
 /description
   /property
   property
 nameyarn.scheduler.capacity.root.default.state/name
 valueRUNNING/value
 description
   The state of the default queue. State can be one of RUNNING or STOPPED.
 /description
   /property
   property
 nameyarn.scheduler.capacity.root.default.acl_submit_applications/name
 value*/value
 description
   The ACL of who can submit jobs to the default queue.
 /description
   /property
   property
 nameyarn.scheduler.capacity.root.default.acl_administer_queue/name
 value*/value
 description
   The ACL of who can administer jobs on the default queue.
 /description
   /property
   property
 nameyarn.scheduler.capacity.node-locality-delay/name
 value40/value
 description
   Number of missed scheduling opportunities after which the 
 CapacityScheduler
   attempts to schedule rack-local containers.
   Typically this should be set to number of nodes in the cluster, By 
 default is setting
   approximately number of nodes in 

[jira] [Created] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup

2014-05-12 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-2036:
--

 Summary: Document yarn.resourcemanager.hostname in ClusterSetup
 Key: YARN-2036
 URL: https://issues.apache.org/jira/browse/YARN-2036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
 Attachments: YARN2036-01.patch, YARN2036-02.patch

ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people 
should just be able to use that directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995290#comment-13995290
 ] 

Anubhav Dhoot commented on YARN-2001:
-

Won't killing the containers on RM restart/fail over defeat the purpose of the 
Work Preserving effort.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2046) Out of band heartbeats are sent only on container kill and possibly too early

2014-05-12 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995147#comment-13995147
 ] 

Jason Lowe commented on YARN-2046:
--

We should consider sending out of band heartbeats after a container completes 
rather than when a container is killed.  For a cluster running MapReduce this 
should be almost equivalent in terms of number of OOB heartbeats sent since the 
MR AM always kills completed task attempts until MAPREDUCE-5465 is addressed.

 Out of band heartbeats are sent only on container kill and possibly too early
 -

 Key: YARN-2046
 URL: https://issues.apache.org/jira/browse/YARN-2046
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe

 [~mingma] pointed out in the review discussion for MAPREDUCE-5465 that the NM 
 is currently sending out of band heartbeats only when stopContainer is 
 called.  In addition those heartbeats might be sent too early because the 
 container kill event is asynchronously posted then the heartbeat monitor is 
 notified.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2039) Better reporting of finished containers to AMs

2014-05-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla resolved YARN-2039.


Resolution: Duplicate

Thanks for pointing that out, Bikas. Resolving as duplicate.

 Better reporting of finished containers to AMs
 --

 Key: YARN-2039
 URL: https://issues.apache.org/jira/browse/YARN-2039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Priority: Critical

 On RM restart, we shouldn't lose information about finished containers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995361#comment-13995361
 ] 

Jian He commented on YARN-2016:
---

Adding unit tests for all records would be another big effort. Junping, you can 
open a new jira to discuss about this if needed.

Committing this.

 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995316#comment-13995316
 ] 

Bikas Saha commented on YARN-2001:
--

I think the offline discussion agreement was that there would be a threshold 
for NM's to resync. After that threshold the scheduler would be started. After 
that the NM's have until the NM heartbeat expire interval to resync. After the 
NM expiry interval, the NM's are considered lost (consistent with current 
behavior).

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1982) Rename the daemon name to timelineserver

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995404#comment-13995404
 ] 

Hudson commented on YARN-1982:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-1982. Renamed the daemon name to be TimelineServer instead of History 
Server and deprecated the old usage. Contributed by Zhijie Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593748)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn.cmd
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/yarn-env.sh


 Rename the daemon name to timelineserver
 

 Key: YARN-1982
 URL: https://issues.apache.org/jira/browse/YARN-1982
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 2.4.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
  Labels: cli
 Fix For: 2.5.0

 Attachments: YARN-1982.1.patch


 Nowadays, it's confusing that we call the new component timeline server, but 
 we use
 {code}
 yarn historyserver
 yarn-daemon.sh start historyserver
 {code}
 to start the daemon.
 Before the confusion keeps being propagated, we'd better to modify command 
 line asap.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1962) Timeline server is enabled by default

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995407#comment-13995407
 ] 

Hudson commented on YARN-1962:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-1962. Changed Timeline Service client configuration to be off by default 
given the non-readiness of the feature yet. Contributed by Mohammad Kamrul 
Islam. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593750)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 Timeline server is enabled by default
 -

 Key: YARN-1962
 URL: https://issues.apache.org/jira/browse/YARN-1962
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.0
Reporter: Mohammad Kamrul Islam
Assignee: Mohammad Kamrul Islam
 Fix For: 2.4.1

 Attachments: YARN-1962.1.patch, YARN-1962.2.patch


 Since Timeline server is not matured and secured yet, enabling  it by default 
 might create some confusion.
 We were playing with 2.4.0 and found a lot of exceptions for distributed 
 shell example related to connection refused error. Btw, we didn't run TS 
 because it is not secured yet.
 Although it is possible to explicitly turn it off through yarn-site config. 
 In my opinion,  this extra change for this new service is not worthy at this 
 point,.  
 This JIRA is to turn it off by default.
 If there is an agreement, i can put a simple patch about this.
 {noformat}
 14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response 
 from the timeline server.
 com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: 
 Connection refused
   at 
 com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
   at com.sun.jersey.api.client.Client.handle(Client.java:648)
   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
   at 
 com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281)
 Caused by: java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
   at sun.net.www.http.HttpClient.in14/04/17 23:24:33 ERROR 
 impl.TimelineClientImpl: Failed to get the response from the timeline server.
 com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: 
 Connection refused
   at 
 com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
   at com.sun.jersey.api.client.Client.handle(Client.java:648)
   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
   at 
 com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104)
   at 
 

[jira] [Commented] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995405#comment-13995405
 ] 

Hudson commented on YARN-2036:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-2036. Document yarn.resourcemanager.hostname in ClusterSetup (Ray Chiang 
via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593631)
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


 Document yarn.resourcemanager.hostname in ClusterSetup
 --

 Key: YARN-2036
 URL: https://issues.apache.org/jira/browse/YARN-2036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
 Fix For: 2.5.0

 Attachments: YARN2036-01.patch, YARN2036-02.patch


 ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people 
 should just be able to use that directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1975) Used resources shows escaped html in CapacityScheduler and FairScheduler page

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995408#comment-13995408
 ] 

Hudson commented on YARN-1975:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-1975. Fix yarn application CLI to print the scheme of the tracking url of 
failed/killed applications. Contributed by Junping Du (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593874)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java


 Used resources shows escaped html in CapacityScheduler and FairScheduler page
 -

 Key: YARN-1975
 URL: https://issues.apache.org/jira/browse/YARN-1975
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 2.4.0
Reporter: Nathan Roberts
Assignee: Mit Desai
 Fix For: 3.0.0, 2.4.1

 Attachments: YARN-1975.patch, screenshot-1975.png


 Used resources displays as amp;lt;memory:, vCores;amp;gt; with capacity 
 scheduler



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2032) Implement a scalable, available TimelineStore using HBase

2014-05-12 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2032:


Attachment: YARN-2032-branch-2-1.patch

Updating patch for branch-2

Thanks,
Mayank

 Implement a scalable, available TimelineStore using HBase
 -

 Key: YARN-2032
 URL: https://issues.apache.org/jira/browse/YARN-2032
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
 Attachments: YARN-2032-branch-2-1.patch


 As discussed on YARN-1530, we should pursue implementing a scalable, 
 available Timeline store using HBase.
 One goal is to reuse most of the code from the levelDB Based store - 
 YARN-1635.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995438#comment-13995438
 ] 

Karthik Kambatla commented on YARN-2033:


Thanks Vinod. Would like to hear your thoughts on the following:
What are the perceived scalability requirements of the history store and 
timeline store? I would think the timeline store might have to supporting 
storing a lot more information than the history store. In that case, one might 
want to keep them separate? 

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995326#comment-13995326
 ] 

Tsuyoshi OZAWA commented on YARN-2001:
--

[~adhoot], you're correct basically. I meant that if epoch gap between RM and 
NM is too large to handle for RM, it can be killed. It saves memory usage of RM.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995328#comment-13995328
 ] 

Anubhav Dhoot commented on YARN-556:


bq. clustertimestamp is added to containerId so that containerId after RM 
restart do not clash with containerId before (as the containerId counter resets 
to zero in memory). 

The problem is the containerId currently is composed of  ApplicationAttemptId + 
int. The int part comes from a in memory containerIdCounter from 
AppSchedulingInfo. This gets reset after a RM restart. Without any changes the 
containerIds for containers allocated after restart would clash with existing 
containerIds. 
The prototype proposal is to make it ApplicationAttemptId + uniqueid + int 
where the uniqueid can be a timestamp set by RM. I feel containerId should be 
an opaque string that YARN app developers don't take a dependency on. Also if 
we used protobuf serialization/deserialization rules everywhere we could deal 
with compatibility changes of different YARN code versions. 

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs

2014-05-12 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-1913:
--

Assignee: Karthik Kambatla

 With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
 --

 Key: YARN-1913
 URL: https://issues.apache.org/jira/browse/YARN-1913
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Karthik Kambatla

 It's possible to deadlock a cluster by submitting many applications at once, 
 and have all cluster resources taken up by AMs.
 One solution is for the scheduler to limit resources taken up by AMs, as a 
 percentage of total cluster resources, via a maxApplicationMasterShare 
 config.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-05-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995354#comment-13995354
 ] 

Jian He commented on YARN-1372:
---

An alternative would be to make NM remember the current containers in memory 
until the application is completed. On each re-register NM sends across the 
whole list of container statuses. Typically, each NM holds tens of containers 
in memory which shouldn't be much memory overhead, as compared to RM which 
holds all the active containers in the cluster.  This also avoids protocol 
changes.

 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()

2014-05-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995485#comment-13995485
 ] 

Hadoop QA commented on YARN-2042:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644357/YARN-2042.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3732//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3732//console

This message is automatically generated.

 String shouldn't be compared using == in 
 QueuePlacementRule#NestedUserQueue#getQueueForApp()
 

 Key: YARN-2042
 URL: https://issues.apache.org/jira/browse/YARN-2042
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Chen He
Priority: Minor
 Attachments: YARN-2042.patch


 {code}
   if (queueName != null  queueName != ) {
 {code}
 queueName.isEmpty() should be used instead of comparing against 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995213#comment-13995213
 ] 

Tsuyoshi OZAWA commented on YARN-2001:
--

[~leftnoteasy] , my idea is creating ClusterId-space under the 
epoch(cluster-timestamp) like {{MapEpoch, ListClusterID}}.

* Epoch (saved in ZKRMStateStore and RM's memory), just a integer value.
* ClusterID (saved in RM's memory), same as current code.

A rough sketch is as follows:

* When a new active RM starts up, Epoch in RMStateStore is incremented and RM 
sets the Epoch. ClusterID is reset to zero. 
* Heartbeats between NM and RM include Epoch: RM can distinguish old 
cluster-timestamps from the new one when NM is registered. If the Epoch is 
older than RM expects, RM can kill the containers via NM.

Please correct me if I'm wrong.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995153#comment-13995153
 ] 

Karthik Kambatla commented on YARN-1969:


Just stating the obvious: we need to add a way to specify per-job deadlines 
too. 

 Fair Scheduler: Add policy for Earliest Deadline First
 --

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, what we require is a kind of 
 variation of *Earliest Deadline First scheduling*, that takes into account 
 the number of already-allocated resources and estimated time to finish.
 http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to *only within a queue*: i.e., 
 adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues 
 to use it by setting the schedulingPolicy field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2011) Fix typo and warning in TestLeafQueue

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995403#comment-13995403
 ] 

Hudson commented on YARN-2011:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-2011. Fix typo and warning in TestLeafQueue (Contributed by Chen He) 
(junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593804)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java


 Fix typo and warning in TestLeafQueue
 -

 Key: YARN-2011
 URL: https://issues.apache.org/jira/browse/YARN-2011
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Chen He
Assignee: Chen He
Priority: Trivial
 Fix For: 2.5.0

 Attachments: YARN-2011-v2.patch, YARN-2011.patch


 a.assignContainers(clusterResource, node_0);
 assertEquals(2*GB, a.getUsedResources().getMemory());
 assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_1.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
 // Again one to user_0 since he hasn't exceeded user limit yet
 a.assignContainers(clusterResource, node_0);
 assertEquals(3*GB, a.getUsedResources().getMemory());
 assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
 assertEquals(1*GB, app_1.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-05-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995516#comment-13995516
 ] 

Bikas Saha commented on YARN-1372:
--

what happens when they are 100's of long jobs running 100's of containers per 
nodes. Do we hold onto info about all those containers that have completed long 
ago?

 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995398#comment-13995398
 ] 

Hudson commented on YARN-1987:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-1987. Wrapper for leveldb DBIterator to aid in handling database 
exceptions. (Jason Lowe via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593757)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/pom.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/LeveldbIterator.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils/TestLeveldbIterator.java


 Wrapper for leveldb DBIterator to aid in handling database exceptions
 -

 Key: YARN-1987
 URL: https://issues.apache.org/jira/browse/YARN-1987
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 2.5.0

 Attachments: YARN-1987.patch, YARN-1987v2.patch


 Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a 
 utility wrapper around leveldb's DBIterator to translate the raw 
 RuntimeExceptions it can throw into DBExceptions to make it easier to handle 
 database errors while iterating.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995498#comment-13995498
 ] 

Hadoop QA commented on YARN-2016:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644248/YARN-2016.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3733//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3733//console

This message is automatically generated.

 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-12 Thread Min Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Zhou updated YARN-2048:
---

Attachment: YARN-2048-trunk-v1.patch

Submit a patch on trunk

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-12 Thread Min Zhou (JIRA)
Min Zhou created YARN-2048:
--

 Summary: List all of the containers of an application from the 
yarn web
 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Reporter: Min Zhou


Currently, Yarn haven't provide a way to list all of the containers of an 
application from its web. This kind of information is needed by the application 
user. They can conveniently know how many containers their applications already 
acquired as well as which node those containers were launched on.  They also 
want to view the logs of each container of an application.

One approach is maintain a container list in RMAppImpl and expose this info to 
Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2040) Recover information about finished containers

2014-05-12 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-2040.
--

Resolution: Duplicate

This will be covered by YARN-1337.

 Recover information about finished containers
 -

 Key: YARN-2040
 URL: https://issues.apache.org/jira/browse/YARN-2040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla

 The NM should store and recover information about finished containers as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()

2014-05-12 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995505#comment-13995505
 ] 

Chen He commented on YARN-2042:
---

The change in this patch does not need to include test.

 String shouldn't be compared using == in 
 QueuePlacementRule#NestedUserQueue#getQueueForApp()
 

 Key: YARN-2042
 URL: https://issues.apache.org/jira/browse/YARN-2042
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Chen He
Priority: Minor
 Attachments: YARN-2042.patch


 {code}
   if (queueName != null  queueName != ) {
 {code}
 queueName.isEmpty() should be used instead of comparing against 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1702) Expose kill app functionality as part of RM web services

2014-05-12 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1702:


Attachment: apache-yarn-1702.9.patch

New patch with the following fixes -
1. Use right call to get queue and acl managers to check permissions
2. Fix kill api to use PUT on /apps/{appid}/state to kill an app
3. Added documentation for the REST call.

 Expose kill app functionality as part of RM web services
 

 Key: YARN-1702
 URL: https://issues.apache.org/jira/browse/YARN-1702
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, 
 apache-yarn-1702.4.patch, apache-yarn-1702.5.patch, apache-yarn-1702.7.patch, 
 apache-yarn-1702.8.patch, apache-yarn-1702.9.patch


 Expose functionality to kill an app via the ResourceManager web services API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1938) Kerberos authentication for the timeline server

2014-05-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1938:
--

Attachment: YARN-1938.2.patch

Made some minor touch on the last patch.

 Kerberos authentication for the timeline server
 ---

 Key: YARN-1938
 URL: https://issues.apache.org/jira/browse/YARN-1938
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1938.1.patch, YARN-1938.2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992962#comment-13992962
 ] 

Tsuyoshi OZAWA commented on YARN-2001:
--

{quote}
If possible, I think we should avoid changing container Id format. 
{quote}

+1, if possible. Can we add epoch (cluster timestamp) to 
ResourceTrackerService's state via heartbeat?

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1457) YARN single node install issues on mvn clean install assembly:assembly on mapreduce project

2014-05-12 Thread Shaun Gittens (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992948#comment-13992948
 ] 

Shaun Gittens commented on YARN-1457:
-

I get virtually the same error when executing :

 mvn clean install assembly:assembly -DskipTests

except I'm running it on a Centos VM compiling Hadoop 2.2.0 and proto 2.5.0 ...

 YARN single node install issues on mvn clean install assembly:assembly on 
 mapreduce project
 ---

 Key: YARN-1457
 URL: https://issues.apache.org/jira/browse/YARN-1457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: site
Affects Versions: 2.0.5-alpha
Reporter: Rekha Joshi
Priority: Minor
  Labels: mvn
 Attachments: yarn-mvn-mapreduce.txt


 YARN single node install - 
 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
 On Mac OSX 10.7.3, Java 1.6, Protobuf 2.5.0 and hadoop-2.0.5-alpha.tar,  mvn 
 clean install -DskipTests succeds after a YARN fix on pom.xml(using 2.5.0 
 protobuf)
 But on hadoop-mapreduce-project mvn install fails for tests with below errors
 $ mvn clean install assembly:assembly -Pnative
 errors as in atatched yarn-mvn-mapreduce,txt
 On $mvn clean install assembly:assembly  -DskipTests
 Reactor Summary:
 [INFO] 
 [INFO] hadoop-mapreduce-client ... SUCCESS [2.410s]
 [INFO] hadoop-mapreduce-client-core .. SUCCESS [13.781s]
 [INFO] hadoop-mapreduce-client-common  SUCCESS [8.486s]
 [INFO] hadoop-mapreduce-client-shuffle ... SUCCESS [0.774s]
 [INFO] hadoop-mapreduce-client-app ... SUCCESS [4.409s]
 [INFO] hadoop-mapreduce-client-hs  SUCCESS [1.618s]
 [INFO] hadoop-mapreduce-client-jobclient . SUCCESS [4.470s]
 [INFO] hadoop-mapreduce-client-hs-plugins  SUCCESS [0.561s]
 [INFO] Apache Hadoop MapReduce Examples .. SUCCESS [1.620s]
 [INFO] hadoop-mapreduce .. FAILURE [10.107s]
 [INFO] 
 
 [INFO] BUILD FAILURE
 [INFO] 
 
 [INFO] Total time: 49.606s
 [INFO] Finished at: Thu Nov 28 16:20:52 GMT+05:30 2013
 [INFO] Final Memory: 34M/118M
 [INFO] 
 
 [ERROR] Failed to execute goal 
 org.apache.maven.plugins:maven-assembly-plugin:2.3:assembly (default-cli) on 
 project hadoop-mapreduce: Error reading assemblies: No assembly descriptors 
 found. - [Help 1]
 $mvn package -Pdist -DskipTests=true -Dtar
 works
 The documentation needs to be updated for possible issues and resolutions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995511#comment-13995511
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

If we can break the compatibility about the container id, I think Anubhav's 
approach has no problem.
If we cannot do this as [~jianhe] mentioned on YARN-2001, I think epoch idea 
[described 
here|https://issues.apache.org/jira/browse/YARN-2001?focusedCommentId=13995213page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13995213]
 might be used.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext

2014-05-12 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-2050:
--

Attachment: YARN-2050.patch

Papreduce CLI and yarn CLI pass the configuration to LogCLIHelpers. 
LogCLIHelpers use the same configuration to create remoteRootLogDir and 
remoteAppLogDir, etc. in dumpAllContainersLogs. The fix is to use the same 
configuration to create FileContext.

To follow up on [~jlowe]'s comments,

1. remoteAppLogDir.toUri().getScheme() returns null and 
AbstractFileSystem.createFileSystem doesn't like it if dumpAllContainersLogs 
calls FileContext.getFileContext(remoteAppLogDir.toUri()).

2. If caller of LogCLIHelpers doesn't setConf ahead of time, 
dumpAllContainersLogs will get null pointer exception when it tries to get 
remoteRootLogDir.

 Fix LogCLIHelpers to create the correct FileContext
 ---

 Key: YARN-2050
 URL: https://issues.apache.org/jira/browse/YARN-2050
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-2050.patch


 LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus 
 the FileContext created isn't necessarily the FileContext for remote log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2039) Better reporting of finished containers to AMs

2014-05-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995241#comment-13995241
 ] 

Bikas Saha commented on YARN-2039:
--

Dupe of YARN-1372?

 Better reporting of finished containers to AMs
 --

 Key: YARN-2039
 URL: https://issues.apache.org/jira/browse/YARN-2039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Priority: Critical

 On RM restart, we shouldn't lose information about finished containers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995243#comment-13995243
 ] 

Karthik Kambatla commented on YARN-2001:


I think the epoch idea might work very nicely with the versioning work we plan 
to do as part of YARN-667.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-766) TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995402#comment-13995402
 ] 

Hudson commented on YARN-766:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-766. TestNodeManagerShutdown in branch-2 should use Shell to form the 
output path and a format issue in trunk. (Contributed by Siddharth Seth) 
(junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593660)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java


 TestNodeManagerShutdown in branch-2 should use Shell to form the output path 
 and a format issue in trunk
 

 Key: YARN-766
 URL: https://issues.apache.org/jira/browse/YARN-766
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.1.0-beta
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Minor
 Attachments: YARN-766.branch-2.txt, YARN-766.trunk.txt, YARN-766.txt


 File scriptFile = new File(tmpDir, scriptFile.sh);
 should be replaced with
 File scriptFile = Shell.appendScriptExtension(tmpDir, scriptFile);
 to match trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995247#comment-13995247
 ] 

Karthik Kambatla commented on YARN-1372:


Based on offline discussion with Anubhav, Bikas, Jian and Vinod, the control 
flow for notifying the AM of finished containers should be as follows:
# NM informs RM and holds on to the information (YARN-1336 should handle this 
as well)
# RM informs AM
# AM acks RM
# RM acks NM
# NM deletes the information

 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1412) Allocating Containers on a particular Node in Yarn

2014-05-12 Thread Chris Riccomini (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993616#comment-13993616
 ] 

Chris Riccomini commented on YARN-1412:
---

Seeing this as well (YARN-2027).

 Allocating Containers on a particular Node in Yarn
 --

 Key: YARN-1412
 URL: https://issues.apache.org/jira/browse/YARN-1412
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: centos, Hadoop 2.2.0
Reporter: gaurav gupta

 Summary of the problem: 
  If I pass the node on which I want container and set relax locality default 
 which is true, I don't get back the container on the node specified even if 
 the resources are available on the node. It doesn't matter if I set rack or 
 not.
 Here is the snippet of the code that I am using
 AMRMClientContainerRequest amRmClient =  AMRMClient.createAMRMClient();;
 String host = h1;
 Resource capability = Records.newRecord(Resource.class);
 capability.setMemory(memory);
 nodes = new String[] {host};
 // in order to request a host, we also have to request the rack
 racks = new String[] {/default-rack};
  ListContainerRequest containerRequests = new 
 ArrayListContainerRequest();
 ListContainerId releasedContainers = new ArrayListContainerId();
 containerRequests.add(new ContainerRequest(capability, nodes, racks, 
 Priority.newInstance(priority)));
 if (containerRequests.size()  0) {
   LOG.info(Asking RM for containers:  + containerRequests);
   for (ContainerRequest cr : containerRequests) {
 LOG.info(Requested container: {}, cr.toString());
 amRmClient.addContainerRequest(cr);
   }
 }
 for (ContainerId containerId : releasedContainers) {
   LOG.info(Released container, id={}, containerId.getId());
   amRmClient.releaseAssignedContainer(containerId);
 }
 return amRmClient.allocate(0);



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1936) Secured timeline client

2014-05-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Issue Type: Bug  (was: Sub-task)
Parent: (was: YARN-1530)

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1936) Secured timeline client

2014-05-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Description: TimelineClient should be able to talk to the timeline server 
with kerberos authentication or delegation token

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase

2014-05-12 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995381#comment-13995381
 ] 

Ted Yu commented on YARN-2032:
--

{code}
+  version0.98.0-hadoop2/version
{code}
0.98.2 is the latest relase for 0.98
Since SingleColumnValueFilter is used, you need to have HBASE-10850 which is in 
0.98.2
{code}
+  protected void serviceInit(Configuration conf) throws Exception {
+HBaseAdmin hbase = initHBase(conf);
{code}
Looks like the HBaseAdmin instance is not closed upon leaving serviceInit().



 Implement a scalable, available TimelineStore using HBase
 -

 Key: YARN-2032
 URL: https://issues.apache.org/jira/browse/YARN-2032
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
 Attachments: YARN-2032-branch-2-1.patch


 As discussed on YARN-1530, we should pursue implementing a scalable, 
 available Timeline store using HBase.
 One goal is to reuse most of the code from the levelDB Based store - 
 YARN-1635.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995812#comment-13995812
 ] 

Xuan Gong commented on YARN-1861:
-

Uploaded a new patch, Explicitly throwing the exception, saying  Can not find 
the active RM, instead of NPE.

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1751) Improve MiniYarnCluster for log aggregation testing

2014-05-12 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-1751:
--

Attachment: YARN-1751.patch

Thanks, Jason. Here is the patch for MiniYarnCluster.

I have opened https://issues.apache.org/jira/browse/YARN-2050 for the 
LogCLIHelpers issue and will post more comments there.

 Improve MiniYarnCluster for log aggregation testing
 ---

 Key: YARN-1751
 URL: https://issues.apache.org/jira/browse/YARN-1751
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1751-trunk.patch, YARN-1751.patch


 MiniYarnCluster specifies individual remote log aggregation root dir for each 
 NM. Test code that uses MiniYarnCluster won't be able to get the value of log 
 aggregation root dir. The following code isn't necessary in MiniYarnCluster.
   File remoteLogDir =
   new File(testWorkDir, MiniYARNCluster.this.getName()
   + -remoteLogDir-nm- + index);
   remoteLogDir.mkdir();
   config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
   remoteLogDir.getAbsolutePath());
 In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to 
 FileContext.getFileContext() call.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1936) Secured timeline client

2014-05-12 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995851#comment-13995851
 ] 

Zhijie Shen commented on YARN-1936:
---

BTW, the patch depends on YARN-2049 for compiling

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995878#comment-13995878
 ] 

Bikas Saha commented on YARN-556:
-

Folks please take the discussion for container id to its own jira. Spreading it 
in the main jira will make it harder to track.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995829#comment-13995829
 ] 

Bikas Saha commented on YARN-556:
-

bq. After the configurable wait-time, the RM starts accepting RPCs from both 
new AMs and already existing AMs.
This is not needed. The AM can be allowed to re-sync after state is recovered 
from the store. Allocations to the AM may not occur until the threshold 
elapses. In fact, we want to re-sync the AM's asap so that they dont give up on 
the RM.

bq. Existing AMs are expected to resync with the RM, which essentially 
translates to register followed by an allocate call
We should keep the option open to use a new API called resync that does exactly 
that. It may help to make this operation atomic





 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995536#comment-13995536
 ] 

Tsuyoshi OZAWA commented on YARN-2001:
--

Bikas and Karthik, thanks for the sharing. I'll check YARN-667.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995796#comment-13995796
 ] 

Xuan Gong commented on YARN-1861:
-

bq. Can we make this explicit, instead of being an NPE? Like doing a client 
call to find the current active RM or something like that?

Yes, we can do that. DONE

bq. That is what I was thinking, but I am concerned about locking etc. This 
code has become a little convoluted. Per Xuan, we seem to be safe for now, so 
may be look at this separately?

Yes. But I will make a note about it. 


 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995743#comment-13995743
 ] 

Karthik Kambatla commented on YARN-1861:


bq. Also, we need to make sure that when automatic failover is enabled, all 
external interventions like a fence like this bug (and forced-manual failover 
from CLI?) do a similar reset into the leader election. There may not be cases 
like this today though.
One way to future-proof this is to call resetLeaderElection in 
ResourceManager#transitionToStandby itself. That looks hacky, but doesn't 
require new external interventions to explicitly handle it. [~vinodkv] - do you 
think that would be a better approach?

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995760#comment-13995760
 ] 

Vinod Kumar Vavilapalli commented on YARN-1861:
---

bq. Without the core code change, this testcase will fail. Because NM is trying 
to connect the active RM, but neither of two RMs are active. So, the NPE is 
expected.
Can we make this explicit, instead of being an NPE? Like doing a client call to 
find the current active RM or something like that?

Tx for the explanation of all the cases, Xuan.

bq. That looks hacky, but doesn't require new external interventions to 
explicitly handle it. Vinod Kumar Vavilapalli - do you think that would be a 
better approach?
That is what I was thinking, but I am concerned about locking etc. This code 
has become a little convoluted. Per Xuan, we seem to be safe for now, so may be 
look at this separately?

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1227) Update Single Cluster doc to use yarn.resourcemanager.hostname

2014-05-12 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA resolved YARN-1227.
-

Resolution: Invalid

Closing this issue. Feel free to reopen if you disagree.

 Update Single Cluster doc to use yarn.resourcemanager.hostname
 --

 Key: YARN-1227
 URL: https://issues.apache.org/jira/browse/YARN-1227
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.1.0-beta
Reporter: Sandy Ryza
Assignee: Ray Chiang
  Labels: newbie

 Now that yarn.resourcemanager.hostname can be used in place or 
 yarn.resourcemanager.address, yarn.resourcemanager.scheduler.address, etc., 
 we should update the doc to use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1936) Secured timeline client

2014-05-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-1935

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute

2014-05-12 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar reassigned YARN-2012:


Assignee: Ashwin Shankar

 Fair Scheduler : Default rule in queue placement policy can take a queue as 
 an optional attribute
 -

 Key: YARN-2012
 URL: https://issues.apache.org/jira/browse/YARN-2012
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt


 Currently 'default' rule in queue placement policy,if applied,puts the app in 
 root.default queue. It would be great if we can make 'default' rule 
 optionally point to a different queue as default queue . This queue should be 
 an existing queue,if not we fall back to root.default queue hence keeping 
 this rule as terminal.
 This default queue can be a leaf queue or it can also be an parent queue if 
 the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext

2014-05-12 Thread Ming Ma (JIRA)
Ming Ma created YARN-2050:
-

 Summary: Fix LogCLIHelpers to create the correct FileContext
 Key: YARN-2050
 URL: https://issues.apache.org/jira/browse/YARN-2050
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma


LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus 
the FileContext created isn't necessarily the FileContext for remote log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext

2014-05-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995780#comment-13995780
 ] 

Hadoop QA commented on YARN-2050:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644494/YARN-2050.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3739//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3739//console

This message is automatically generated.

 Fix LogCLIHelpers to create the correct FileContext
 ---

 Key: YARN-2050
 URL: https://issues.apache.org/jira/browse/YARN-2050
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-2050.patch


 LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus 
 the FileContext created isn't necessarily the FileContext for remote log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995781#comment-13995781
 ] 

Karthik Kambatla commented on YARN-1861:


bq. That is what I was thinking, but I am concerned about locking etc. This 
code has become a little convoluted.
Agree. I did consider going that route, but was worried about the 
maintainability. 

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995569#comment-13995569
 ] 

Hadoop QA commented on YARN-1515:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644314/YARN-1515.v07.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3734//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3734//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3734//console

This message is automatically generated.

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995782#comment-13995782
 ] 

Wangda Tan commented on YARN-2001:
--

[~ozawa], as Bikas said, we should keep at least some early-reported containers 
not been killed via a threshold for NM's to resync. This is why we do work 
preserving restart.



 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First

2014-05-12 Thread Maysam Yabandeh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995993#comment-13995993
 ] 

Maysam Yabandeh commented on YARN-1969:
---

Thanks for the comment [~kasha]. I think this is a good point to distinguish 
between the terms deadline and endtime.

deadline would be the user-specified SLA and as you correctly mentioned in 
many cases it is quite likely to be missed due to failures, limited resources, 
etc. Still the user can express the level of urgency by the desired deadline, 
but they could also do that via priorities, so the user-specified deadline 
would be a complementary (and perhaps more expressive) way for users to specify 
the priorities of their jobs.

endtime, on the other hand, is the estimated end time of the job based on the 
current progress and assuming that the RM will give the rest of the required 
resources immediately. endtime is automatically computed by the AppMaster and 
there is no need for user involvement. When scheduling resources, the advantage 
of taking endtime into consideration is that the giant jobs that are close to 
be finished could be prioritized. We in general want to have such jobs finished 
sooner since (i) they would release the resources that they have occupied such 
as the disk space for the mappers' output, (ii) a large job is more susceptible 
to failures and the longer they are hanging around , the more is the likelihood 
of being affected by a loss of a mapper node.

The added subtasks are based on the agenda of (i) estimating the end time, (ii) 
sending it over to RM, (iii) letting RM take it into consideration. We can also 
extend the API to allow the users to specify their desired deadline. As for how 
RM take the specified deadline or estimated endtime into consideration, I think 
once we have the endtime field available in RM, there will be many new 
opportunities to take advantage of it. One way as you mentioned is to translate 
them into weights to be used by the current fair scheduler. Any other 
scheduling algorithm, including EDF, also can be plugged in and do the 
scheduling based on a function of the endtime and other variables. The other 
variables could include the size of the job, as discussed above.

 Fair Scheduler: Add policy for Earliest Deadline First
 --

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, what we require is a kind of 
 variation of *Earliest Deadline First scheduling*, that takes into account 
 the number of already-allocated resources and estimated time to finish.
 http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to *only within a queue*: i.e., 
 adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues 
 to use it by setting the schedulingPolicy field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-12 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995925#comment-13995925
 ] 

Junping Du commented on YARN-2016:
--

Thanks [~jianhe] for review and comments! Filed YARN-2051 to address more tests 
for PBImpl.

 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Fix For: 2.4.1

 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995981#comment-13995981
 ] 

Hudson commented on YARN-2016:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5604 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5604/])
YARN-2016. Fix a bug in GetApplicationsRequestPBImpl to add the missed fields 
to proto. Contributed by Junping Du (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594085)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java


 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Fix For: 2.4.1

 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever

2014-05-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2049:
--

Attachment: YARN-2049.1.patch

In this patch, I implemented the delegation token service via HTTP protocol by 
leveraging the hadoop-auth modules, and I significantly referred to  the design 
of the delegation token service of HttpFS.

1. Make the TimelineDelegationTokenIdenifier and secretManager as usual.
2. Extend the KerberosAuthenticationFilter and KerberosAuthenticationHandler to 
accept authentication based either the kerberos principle or the delegation 
token.
3. Extend KerberosAuthenticator to encapsulate DT based communication, and add 
the APIs to get/renew/cancel DT.
4. Modify the web stack to enable SPNEGO for the timeline server, and make 
secret manager service callable from the filter.
5. Fix the test cases accordingly.

This patch is only compilable based on YARN-1938 and HADOOP-10596

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-1861:


Attachment: YARN-1861.7.patch

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1936) Secured timeline client

2014-05-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995919#comment-13995919
 ] 

Hadoop QA commented on YARN-1936:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644522/YARN-1936.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3742//console

This message is automatically generated.

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1751) Improve MiniYarnCluster for log aggregation testing

2014-05-12 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-1751:
--

Summary: Improve MiniYarnCluster for log aggregation testing  (was: Improve 
MiniYarnCluster and LogCLIHelpers for log aggregation testing)

 Improve MiniYarnCluster for log aggregation testing
 ---

 Key: YARN-1751
 URL: https://issues.apache.org/jira/browse/YARN-1751
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1751-trunk.patch


 MiniYarnCluster specifies individual remote log aggregation root dir for each 
 NM. Test code that uses MiniYarnCluster won't be able to get the value of log 
 aggregation root dir. The following code isn't necessary in MiniYarnCluster.
   File remoteLogDir =
   new File(testWorkDir, MiniYARNCluster.this.getName()
   + -remoteLogDir-nm- + index);
   remoteLogDir.mkdir();
   config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
   remoteLogDir.getAbsolutePath());
 In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to 
 FileContext.getFileContext() call.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1937) Access control of per-framework data

2014-05-12 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1937:
--

Issue Type: Bug  (was: Sub-task)
Parent: (was: YARN-1530)

 Access control of per-framework data
 

 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen





--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >