date:20140512


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994837#comment-13994837
 ] 

Karthik Kambatla commented on YARN-1474:


# Correct me if I am wrong, but changes to AllocationFileLoaderService look 
unrelated. Can we do it in a different JIRA?
# Nothing to do with this patch, but may be we can add spaces between each 
interface ResourceSchedulerWrapper implements?
{code}
public class ResourceSchedulerWrapper extends AbstractYarnScheduler
implements SchedulerWrapper,ResourceScheduler,Configurable {
{code}
# Correct me if I am wrong. We need to set the rmContext only once. Can we 
update the comment to say, need to call this immediately after instantiating a 
scheduler? 
# Do we need the changes to {{reinitialize()}} implementations in each 
scheduler? Also, don't think we need a separate serviceInitInternal. Why not 
just have serviceInit call reinitialize?
# FairScheduler: we can do without these variables.
{code}
  private volatile boolean isUpdateThreadRunning = false;
  private volatile boolean isSchedulingThreadRunning = false;
{code}
# FairScheduler: serviceStartInternal and serviceStopInternal are fairly small 
methods - do we need these separate methods? 
# Can we call join(timeout) after interrupt, may be use a constant 
THREAD_JOIN_TIMEOUT = 1000? Also, set updateThread to null after join.
{code}
if (updateThread != null) {
  updateThread.interrupt();
}
{code}
# Check for schedulingThread is null? Also, set the thread to null after join()
{code}
if (continuousSchedulingEnabled) {
  isSchedulingThreadRunning = false;
  schedulingThread.interrupt();
}
{code}

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2044) thrift interface for YARN?


 [ 
https://issues.apache.org/jira/browse/YARN-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-2044.
---

Resolution: Not a Problem

We have protocol buffer based interfaces that you can look at.

Also, please post questions on the mailing lists instead of opening tickets on 
the issue-tracker. Thanks.

 thrift interface for YARN?
 --

 Key: YARN-2044
 URL: https://issues.apache.org/jira/browse/YARN-2044
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Nikhil Mulley

 Hi,
 I was searching for the thrift interface definitions for YARN but could not 
 come across any. Is there any plan to have a thrift interface to YARN ? If 
 there is already one, could some one please redirect me to the appropriate 
 place?
 thanks,
 Nikhil



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

[
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994865#comment-13994865
]

Vinod Kumar Vavilapalli commented on YARN-1515:
---

bq. Vinod Kumar Vavilapalli, I am interested in your feedback in the context of
your comment on MAPREDUCE-5044 .
Yeah, sorry. This was in my blind spot. I understand this patch was online for
a while, and is likely also being run in production, but I have some comments.

As I mentioned on MAPREDUCE-5044, this feature is and should be done via
YARN-445. Dumping threads is strictly a java construct and so far we have
avoided any language feature in the YARN APIs (not willingly anyways). Can we
instead implement this feature using YARN-445 and getting clients/AMs to send a
SIGQUIT signal or some such signal command instead? I looked at the patch and
that is indeed what it is doing eventually in the NM. We need to keep the API
clean. Thoughts?

Ability to dump the container threads and stop the containers in a single RPC
-

Key: YARN-1515
URL: https://issues.apache.org/jira/browse/YARN-1515
Project: Hadoop YARN
Issue Type: New Feature
Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch,
YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch,
YARN-1515.v06.patch, YARN-1515.v07.patch

This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for
timed-out task attempts.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC


 [ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1515:
--

Issue Type: Sub-task  (was: New Feature)
Parent: YARN-445

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-445) Ability to signal containers


[ 
https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994878#comment-13994878
 ] 

Vinod Kumar Vavilapalli commented on YARN-445:
--

Folks, I just made YARN-1515 a sub-tasks of this.

This JIRA is today focusing on exposing a signalling interface on the 
ResourceManager. It seems like we can simply expose the same API as part of 
ContainerManagement and get most of the thread-dump functionality with minimal 
changes.

 Ability to signal containers
 

 Key: YARN-445
 URL: https://issues.apache.org/jira/browse/YARN-445
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Reporter: Jason Lowe
Assignee: Andrey Klochkov
 Attachments: MRJob.png, MRTasks.png, YARN-445--n2.patch, 
 YARN-445--n3.patch, YARN-445--n4.patch, 
 YARN-445-signal-container-via-rm.patch, YARN-445.patch, YARNContainers.png


 It would be nice if an ApplicationMaster could send signals to contaniers 
 such as SIGQUIT, SIGUSR1, etc.
 For example, in order to replicate the jstack-on-task-timeout feature 
 implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an 
 interface for sending SIGQUIT to a container.  For that specific feature we 
 could implement it as an additional field in the StopContainerRequest.  
 However that would not address other potential features like the ability for 
 an AM to trigger jstacks on arbitrary tasks *without* killing them.  The 
 latter feature would be a very useful debugging tool for users who do not 
 have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers


 [ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2017:
--

Summary: Merge some of the common lib code in schedulers  (was: Merge 
common code in schedulers)

Edited the title to reflect what is being actually done.

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1803) Signal container support in nodemanager

[
https://issues.apache.org/jira/browse/YARN-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994879#comment-13994879
]

Vinod Kumar Vavilapalli commented on YARN-1803:
---

Tx for working on this Ming. Few comments, in line with [my comment on
YARN-445|https://issues.apache.org/jira/browse/YARN-445?focusedCommentId=13994878page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13994878]
about combining this functionality with the thread-dump feature,
- We need to consolidate stopContainer* APIs into the signalContainer APIs.
Logically, they are a subset of signalling.
- To make that happen, we will need to have bulk signalling APIs to signal
multiple containers simultaneously
- One other requirement as part of that is to to be able to send an ordered
list of signals so that NM can for example do things like sigterm+sigkill or
thread-dump+sigterm+sigkill etc.
- SignalContainerCommand defines a bunch of commands that aren't going to
implemented today - let's only add those that are required and are going to be
implemented as part of this set of patches.

Still navigating the entire arena w.r.t to the signalling work being done
across several JIRAs.

Signal container support in nodemanager
---

Key: YARN-1803
URL: https://issues.apache.org/jira/browse/YARN-1803
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
Attachments: YARN-1803.patch

It could include the followings.
1. ContainerManager is able to process a new event type
ContainerManagerEventType.SIGNAL_CONTAINERS coming from NodeStatusUpdater and
deliver the request to ContainerExecutor.
2. Translate the platform independent signal command to Linux specific
signals. Windows support will be tracked by another task.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1368) Common work to re-populate containers’ state into scheduler


 [ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reassigned YARN-1368:
-

Assignee: Jian He  (was: Anubhav Dhoot)

Assigning to Jian as he started putting up patches.. [~adhoot]/[~wangda], 
please help with reviews.

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Jian He
 Attachments: YARN-1368.1.patch, YARN-1368.preliminary.patch


 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-12 Thread Rohith (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-1366:


Assignee: Rohith

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.patch, YARN-1366.prototype.patch, 
 YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-12 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994920#comment-13994920
 ] 

Rohith commented on YARN-1366:
--

Thank you for offering! It was just wait to finsih prototype by Anubhav Dhoot. 
I assign to myself:-)

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
 Attachments: YARN-1366.patch, YARN-1366.prototype.patch, 
 YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-05-12 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-796:
---

Assignee: Wangda Tan  (was: Arun C Murthy)

Working on this JIRA, assigned it to myself. And will post a design doc in a 
day or two.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart


[ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994891#comment-13994891
 ] 

Vinod Kumar Vavilapalli commented on YARN-1366:
---

[~rohithsharma], are you interested in taking this patch further? If so, assign 
it to yourselves and [~adhoot] can provide review comments and help. Otherwise, 
he will take it over from what I can see.

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
 Attachments: YARN-1366.patch, YARN-1366.prototype.patch, 
 YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart


[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994893#comment-13994893
 ] 

Vinod Kumar Vavilapalli commented on YARN-556:
--

Also, if there is a general agreement on how patches should go in which order, 
please create that ordering through JIRA dependencies. Thanks.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-05-12 Thread Ashwin Shankar (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashwin Shankar reassigned YARN-2026:

Assignee: Ashwin Shankar

Fair scheduler : Fair share for inactive queues causes unfair allocation in
some scenarios
--

Key: YARN-2026
URL: https://issues.apache.org/jira/browse/YARN-2026
Project: Hadoop YARN
Issue Type: Bug
Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
Labels: scheduler

While using hierarchical queues in fair scheduler,there are few scenarios
where we have seen a leaf queue with least fair share can take majority of
the cluster and starve a sibling parent queue which has greater weight/fair
share and preemption doesn’t kick in to reclaim resources.
The root cause seems to be that fair share of a parent queue is distributed
to all its children irrespective of whether its an active or an inactive(no
apps running) queue. Preemption based on fair share kicks in only if the
usage of a queue is less than 50% of its fair share and if it has demands
greater than that. When there are many queues under a parent queue(with high
fair share),the child queue’s fair share becomes really low. As a result when
only few of these child queues have apps running,they reach their *tiny* fair
share quickly and preemption doesn’t happen even if other leaf
queues(non-sibling) are hogging the cluster.
This can be solved by dividing fair share of parent queue only to active
child queues.
Here is an example describing the problem and proposed solution:
root.lowPriorityQueue is a leaf queue with weight 2
root.HighPriorityQueue is parent queue with weight 8
root.HighPriorityQueue has 10 child leaf queues :
root.HighPriorityQueue.childQ(1..10)
Above config,results in root.HighPriorityQueue having 80% fair share
and each of its ten child queue would have 8% fair share. Preemption would
happen only if the child queue is 4% (0.5*8=4).
Lets say at the moment no apps are running in any of the
root.HighPriorityQueue.childQ(1..10) and few apps are running in
root.lowPriorityQueue which is taking up 95% of the cluster.
Up till this point,the behavior of FS is correct.
Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30%
of the cluster. It would get only the available 5% in the cluster and
preemption wouldn't kick in since its above 4%(half fair share).This is bad
considering childQ1 is under a highPriority parent queue which has *80% fair
share*.
Until root.lowPriorityQueue starts relinquishing containers,we would see the
following allocation on the scheduler page:
*root.lowPriorityQueue = 95%*
*root.HighPriorityQueue.childQ1=5%*
This can be solved by distributing a parent’s fair share only to active
queues.
So in the example above,since childQ1 is the only active queue
under root.HighPriorityQueue, it would get all its parent’s fair share i.e.
80%.
This would cause preemption to reclaim the 30% needed by childQ1 from
root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
Also note that similar situation can happen between
root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2
hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck
at 5%,until childQ2 starts relinquishing containers. We would like each of
childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie
40%,which would ensure childQ1 gets upto 40% resource if needed through
preemption.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1989) Adding shell scripts to launch multiple servers on localhost

2014-05-12 Thread Masatake Iwasaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-1989:
---

Attachment: YARN-1989-0.patch

attaching patch. 
local-resourcemanagers-ha.sh starts multiple resourcemanagers in HA mode on 
localhost. local-nodemanagers.sh starts multiple nodemanagers on localhost.

 Adding shell scripts to launch multiple servers on localhost
 

 Key: YARN-1989
 URL: https://issues.apache.org/jira/browse/YARN-1989
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-1989-0.patch


 Adding shell scripts to launch multiple servers on localhost for test and 
 debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1986) After upgrade from 2.2.0 to 2.4.0, NPE on first job start.


[ 
https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992950#comment-13992950
 ] 

Tsuyoshi OZAWA commented on YARN-1986:
--

+1(non-binding). Let's wait for [~sandyr]'s comment.

 After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
 --

 Key: YARN-1986
 URL: https://issues.apache.org/jira/browse/YARN-1986
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Jon Bringhurst
Assignee: Hong Zhiguo
 Attachments: YARN-1986-2.patch, YARN-1986-3.patch, 
 YARN-1986-testcase.patch, YARN-1986.patch


 After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
 After RM was restarted, the job runs without a problem.
 {noformat}
 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type 
 NODE_UPDATE to the scheduler
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591)
   at java.lang.Thread.run(Thread.java:744)
 19:11:13,443  INFO ResourceManager:604 - Exiting, bbye..
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-667) Data persisted in RM should be versioned


 [ 
https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-667:


Summary: Data persisted in RM should be versioned  (was: Data persisted by 
YARN daemons should be versioned)

 Data persisted in RM should be versioned
 

 Key: YARN-667
 URL: https://issues.apache.org/jira/browse/YARN-667
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
Assignee: Junping Du

 Includes data persisted for RM restart, NodeManager directory structure and 
 the Aggregated Log Format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

[
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994886#comment-13994886
]

Vinod Kumar Vavilapalli commented on YARN-556:
--

Tx for the community update, Karthik.

Also, Jian/Abhinav, can you both please file all the known sub-tasks and assign
things to yourselves according as you are working on them rightaway? Other
folks like [~ozawa] and [~rohithsharma] have been requesting repeatedly
expressed interest to work on this feature. It'll be great to find stuff for
everyone instead of creating all tickets and assigning them to the two of you.
Thanks.

[~ozawa] and [~rohithsharma], let others know what you specifically want to
work on, if you have something in mind.

bq. 6. clustertimestamp is added to containerId so that containerId after RM
restart do not clash with containerId before (as the containerId counter resets
to zero in memory)
I totally missed this line item. Can you throw more detail on what the problem
is and what the proposal is? What is done in the prototype patch is a major
compatibility issue - I'd like to avoid it if we can.

RM Restart phase 2 - Work preserving restart

Key: YARN-556
URL: https://issues.apache.org/jira/browse/YARN-556
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
Attachments: Work Preserving RM Restart.pdf,
WorkPreservingRestartPrototype.001.patch

YARN-128 covered storing the state needed for the RM to recover critical
information. This umbrella jira will track changes needed to recover the
running state of the cluster so that work can be preserved across RM restarts.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2045) Data persisted in NM should be versioned

Junping Du created YARN-2045:


 Summary: Data persisted in NM should be versioned
 Key: YARN-2045
 URL: https://issues.apache.org/jira/browse/YARN-2045
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Junping Du
Assignee: Junping Du


As a split task from YARN-667, we want to add version info to NM related data, 
include:
- NodeManager local LevelDB state
- NodeManager directory structure




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-667) Data persisted in RM should be versioned


[ 
https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994984#comment-13994984
 ] 

Junping Du commented on YARN-667:
-

Filed YARN-2045 to address NM part data.

 Data persisted in RM should be versioned
 

 Key: YARN-667
 URL: https://issues.apache.org/jira/browse/YARN-667
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
Assignee: Junping Du

 Includes data persisted for RM restart, NodeManager directory structure and 
 the Aggregated Log Format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-182) Unnecessary Container killed by the ApplicationMaster message for successful containers

2014-05-12 Thread Deepak Kumar V (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994967#comment-13994967
 ] 

Deepak Kumar V commented on YARN-182:
-

Hello,

I am seeing this in Note section of each map task. Each of Map task state is 
SUCCEEDED.

Message:
attempt_1399912169384_0001_m_13_0   100.00  SUCCEEDED   map 
datanode-9-281920.slc01.dev.company.com:8042logsMon, 12 May 2014 
16:33:33 GMT   Mon, 12 May 2014 16:34:26 GMT   52sec   Container killed by the 
ApplicationMaster. Container killed on request. Exit code is 143 Container 
exited with a non-zero exit code 143

Hadoop Details:
NameNode 'namenode-284133.slc01.dev.company.com:8020' (active)

Started:Tue May 06 16:18:04 GMT-07:00 2014
Version:2.4.0.2.1.1.0-385, 68ceccf06a4441273e81a5ec856d41fc7e11c792
Compiled:   2014-04-16T21:24Z by jenkins from (no branch)
Cluster ID: CID-fb86b3cf-7787-4c67-998f-24f00e43c137
Block Pool ID:  BP-1163369527-10.65.216.196-1399412949036

 Unnecessary Container killed by the ApplicationMaster message for 
 successful containers
 -

 Key: YARN-182
 URL: https://issues.apache.org/jira/browse/YARN-182
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.1-alpha
Reporter: zhengqiu cai
Assignee: Omkar Vinit Joshi
  Labels: hadoop, usability
 Attachments: Log.txt


 I was running wordcount and the resourcemanager web UI shown the status as 
 FINISHED SUCCEEDED, but the log shown Container killed by the 
 ApplicationMaster



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-667) Data persisted in RM should be versioned


[ 
https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994975#comment-13994975
 ] 

Junping Du commented on YARN-667:
-

Limit the scope of this JIRA to persistent data in RM - RMStateStore. Will file 
separate JIRA to address persistent data in NM. 

 Data persisted in RM should be versioned
 

 Key: YARN-667
 URL: https://issues.apache.org/jira/browse/YARN-667
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
Assignee: Junping Du

 Includes data persisted for RM restart, NodeManager directory structure and 
 the Aggregated Log Format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-05-12 Thread Carlo Curino (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994957#comment-13994957
]

Carlo Curino commented on YARN-2022:

Sunil, I am travelling abroad till 26th (please forgive delays)... I could only
skim the patch from a mobile device. It looks reasonable, a concern I have is
that we rely on a user set Priority to choose whether to preempt or not. Unless
there are check in place preventing the user from abusing this value, this is
egregiously gameable (set my containers all to AM priority and get away with
murder).

Also I thought more about the possible corner cases, after conversation with
Chris Douglas, and Mayank: we should keep an eye out for max percentage of
resources dedicated to AMs... we should save the AMs from earlier
(higher-pri) applications up till the max % of AM we can allocate in the Queue,
and at the very least not protect the AMs past that point. Similar check
should be in place for userLimitFactor. Without this it is entirely possible
that a queue is wedged with 100% AMs or that a user has in its AM more
resources than he deserve (and it is systematically skipped, even if the
cluster is empty). We have seen some of this in particular extreme test cases
(espilon-size queues, many apps moved to a queue etc...).
Please share your thoughts on this...

Preempting an Application Master container can be kept as least priority when
multiple applications are marked for preemption by
ProportionalCapacityPreemptionPolicy
-

Key: YARN-2022
URL: https://issues.apache.org/jira/browse/YARN-2022
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sunil G
Assignee: Sunil G
Attachments: Yarn-2022.1.patch

Cluster Size = 16GB [2NM's]
Queue A Capacity = 50%
Queue B Capacity = 50%
Consider there are 3 applications running in Queue A which has taken the full
cluster capacity.
J1 = 2GB AM + 1GB * 4 Maps
J2 = 2GB AM + 1GB * 4 Maps
J3 = 2GB AM + 1GB * 2 Maps
Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
Currently in this scenario, Jobs J3 will get killed including its AM.
It is better if AM can be given least priority among multiple applications.
In this same scenario, map tasks from J3 and J2 can be preempted.
Later when cluster is free, maps can be allocated to these Jobs.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC


[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994877#comment-13994877
 ] 

Vinod Kumar Vavilapalli commented on YARN-1515:
---

YARN-445 BTW is today focusing on exposing a signalling interface on the 
ResourceManager. It seems like we can simply expose the same API as part of 
ContainerManagement and get most of this functionality with minimal changes.

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC


[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994866#comment-13994866
 ] 

Vinod Kumar Vavilapalli commented on YARN-1515:
---

We can still implement the 'single RPC' functionality you wanted, by making the 
signal API take in a list of signals and optional time-intervals in between.

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt


 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---

Attachment: yarn-2010-2.patch

New patch with following changes - 
# Noticed that RMAppManager#recoverApplication wasn't failing running 
applications in all the code-paths corresponding to failed recovery. Fixed that 
and cleaned it up futher.
# Changed the config name to be shorter. 
# Added comments to make sure we document why we are doing what we are doing.

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
  
 ... 13 more 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()

2014-05-12 Thread Ted Yu (JIRA)

Ted Yu created YARN-2042:


 Summary: String shouldn't be compared using == in 
QueuePlacementRule#NestedUserQueue#getQueueForApp()
 Key: YARN-2042
 URL: https://issues.apache.org/jira/browse/YARN-2042
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


{code}
  if (queueName != null  queueName != ) {
{code}
queueName.isEmpty() should be used instead of comparing against 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC


 [ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1515:
--

Target Version/s: 2.5.0  (was: 2.4.0)

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2046) Out of band heartbeats are sent only on container kill and possibly too early

2014-05-12 Thread Jason Lowe (JIRA)

Jason Lowe created YARN-2046:


 Summary: Out of band heartbeats are sent only on container kill 
and possibly too early
 Key: YARN-2046
 URL: https://issues.apache.org/jira/browse/YARN-2046
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0, 0.23.10
Reporter: Jason Lowe


[~mingma] pointed out in the review discussion for MAPREDUCE-5465 that the NM 
is currently sending out of band heartbeats only when stopContainer is called.  
In addition those heartbeats might be sent too early because the container kill 
event is asynchronously posted then the heartbeat monitor is notified.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart


[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995165#comment-13995165
 ] 

Karthik Kambatla commented on YARN-556:
---

Oh. Forgot to mention that. [~adhoot] offered to split up the prototype into 
multiple patches, one for each of the sub-tasks. If I understand right, his 
prototype covers almost all the sub-tasks already created.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1918) Typo in description and error message for 'yarn.resourcemanager.cluster-id'

[
https://issues.apache.org/jira/browse/YARN-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995173#comment-13995173
]

Tsuyoshi OZAWA commented on YARN-1918:
--

Thanks for your contribution, [~analog.sony]. It looks good to me(non-binding).
Please wait for a review by commiters.

Typo in description and error message for 'yarn.resourcemanager.cluster-id'
---

Key: YARN-1918
URL: https://issues.apache.org/jira/browse/YARN-1918
Project: Hadoop YARN
Issue Type: Improvement
Affects Versions: 2.3.0
Reporter: Devaraj K
Assignee: Anandha L Ranganathan
Priority: Trivial
Labels: newbie
Attachments: YARN-1918.1.patch

1. In yarn-default.xml
{code:xml}
property
descriptionName of the cluster. In a HA setting,
this is used to ensure the RM participates in leader
election fo this cluster and ensures it does not affect
other clusters/description
nameyarn.resourcemanager.cluster-id/name
!--valueyarn-cluster/value--
/property
{code}
Here the line 'election fo this cluster and ensures it does not affect'
should be replaced with 'election for this cluster and ensures it does not
affect'.
2.
{code:xml}
org.apache.hadoop.HadoopIllegalArgumentException: Configuration doesn't
specifyyarn.resourcemanager.cluster-id
at
org.apache.hadoop.yarn.conf.YarnConfiguration.getClusterId(YarnConfiguration.java:1336)
{code}
In the above exception message, it is missing a space between message and
configuration name.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1701) Improve default paths of timeline store and generic history store

[
https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994348#comment-13994348
]

Hudson commented on YARN-1701:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #560 (See
[https://builds.apache.org/job/Hadoop-Yarn-trunk/560/])
YARN-1701. Improved default paths of the timeline store and the generic history
store. Contributed by Tsuyoshi Ozawa. (zjshen:
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593481)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
*
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml

Improve default paths of timeline store and generic history store
-

Key: YARN-1701
URL: https://issues.apache.org/jira/browse/YARN-1701
Project: Hadoop YARN
Issue Type: Sub-task
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Tsuyoshi OZAWA
Fix For: 2.4.1

Attachments: YARN-1701.3.patch, YARN-1701.v01.patch,
YARN-1701.v02.patch

When I enable AHS via yarn.ahs.enabled, the app history is still not visible
in AHS webUI. This is due to NullApplicationHistoryStore as
yarn.resourcemanager.history-writer.class. It would be good to have just one
key to enable basic functionality.
yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is
local file system location. However, FileSystemApplicationHistoryStore uses
DFS by default.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1474) Make schedulers services


 [ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1474:
-

Attachment: YARN-1474.11.patch

Added missing file to pass compile.

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests

[
https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995262#comment-13995262
]

Bikas Saha commented on YARN-2027:
--

Was the relaxLocality flag set to false in order to make a hard constraint for
the node?
Or is the jira stating that even soft locality constraints (where YARN is
allowed to relax the locality from node to rack to *) is also not working? Soft
locality would need delay scheduling to be enabled and that needs the configs
that Sandy mentioned.

YARN ignores host-specific resource requests

Key: YARN-2027
URL: https://issues.apache.org/jira/browse/YARN-2027
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager, scheduler
Affects Versions: 2.4.0
Environment: RHEL 6.1
YARN 2.4
Reporter: Chris Riccomini

YARN appears to be ignoring host-level ContainerRequests.
I am creating a container request with code that pretty closely mirrors the
DistributedShell code:
{code}
protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int)
{
info(Requesting %d container(s) with %dmb of memory format (containers,
memMb))
val capability = Records.newRecord(classOf[Resource])
val priority = Records.newRecord(classOf[Priority])
priority.setPriority(0)
capability.setMemory(memMb)
capability.setVirtualCores(cpuCores)
// Specifying a host in the String[] host parameter here seems to do
nothing. Setting relaxLocality to false also doesn't help.
(0 until containers).foreach(idx = amClient.addContainerRequest(new
ContainerRequest(capability, null, null, priority)))
}
{code}
When I run this code with a specific host in the ContainerRequest, YARN does
not honor the request. Instead, it puts the container on an arbitrary host.
This appears to be true for both the FifoScheduler and the CapacityScheduler.
Currently, we are running the CapacityScheduler with the following settings:
{noformat}
configuration
property
nameyarn.scheduler.capacity.maximum-applications/name
value1/value
description
Maximum number of applications that can be pending and running.
/description
/property
property
nameyarn.scheduler.capacity.maximum-am-resource-percent/name
value0.1/value
description
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
/description
/property
property
nameyarn.scheduler.capacity.resource-calculator/name

valueorg.apache.hadoop.yarn.util.resource.DefaultResourceCalculator/value
description
The ResourceCalculator implementation to be used to compare
Resources in the scheduler.
The default i.e. DefaultResourceCalculator only uses Memory while
DominantResourceCalculator uses dominant-resource to compare
multi-dimensional resources such as Memory, CPU etc.
/description
/property
property
nameyarn.scheduler.capacity.root.queues/name
valuedefault/value
description
The queues at the this level (root is the root queue).
/description
/property
property
nameyarn.scheduler.capacity.root.default.capacity/name
value100/value
descriptionSamza queue target capacity./description
/property
property
nameyarn.scheduler.capacity.root.default.user-limit-factor/name
value1/value
description
Default queue user limit a percentage from 0.0 to 1.0.
/description
/property
property
nameyarn.scheduler.capacity.root.default.maximum-capacity/name
value100/value
description
The maximum capacity of the default queue.
/description
/property
property
nameyarn.scheduler.capacity.root.default.state/name
valueRUNNING/value
description
The state of the default queue. State can be one of RUNNING or STOPPED.
/description
/property
property
nameyarn.scheduler.capacity.root.default.acl_submit_applications/name
value*/value
description
The ACL of who can submit jobs to the default queue.
/description
/property
property
nameyarn.scheduler.capacity.root.default.acl_administer_queue/name
value*/value
description
The ACL of who can administer jobs on the default queue.
/description
/property
property
nameyarn.scheduler.capacity.node-locality-delay/name
value40/value
description
Number of missed scheduling opportunities after which the
CapacityScheduler
attempts to schedule rack-local containers.
Typically this should be set to number of nodes in the cluster, By
default is setting
approximately number of nodes in

[jira] [Created] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup

Karthik Kambatla created YARN-2036:
--

 Summary: Document yarn.resourcemanager.hostname in ClusterSetup
 Key: YARN-2036
 URL: https://issues.apache.org/jira/browse/YARN-2036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
 Attachments: YARN2036-01.patch, YARN2036-02.patch

ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people 
should just be able to use that directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995290#comment-13995290
 ] 

Anubhav Dhoot commented on YARN-2001:
-

Won't killing the containers on RM restart/fail over defeat the purpose of the 
Work Preserving effort.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2046) Out of band heartbeats are sent only on container kill and possibly too early

2014-05-12 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995147#comment-13995147
 ] 

Jason Lowe commented on YARN-2046:
--

We should consider sending out of band heartbeats after a container completes 
rather than when a container is killed.  For a cluster running MapReduce this 
should be almost equivalent in terms of number of OOB heartbeats sent since the 
MR AM always kills completed task attempts until MAPREDUCE-5465 is addressed.

 Out of band heartbeats are sent only on container kill and possibly too early
 -

 Key: YARN-2046
 URL: https://issues.apache.org/jira/browse/YARN-2046
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe

 [~mingma] pointed out in the review discussion for MAPREDUCE-5465 that the NM 
 is currently sending out of band heartbeats only when stopContainer is 
 called.  In addition those heartbeats might be sent too early because the 
 container kill event is asynchronously posted then the heartbeat monitor is 
 notified.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2039) Better reporting of finished containers to AMs


 [ 
https://issues.apache.org/jira/browse/YARN-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla resolved YARN-2039.


Resolution: Duplicate

Thanks for pointing that out, Bikas. Resolving as duplicate.

 Better reporting of finished containers to AMs
 --

 Key: YARN-2039
 URL: https://issues.apache.org/jira/browse/YARN-2039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Priority: Critical

 On RM restart, we shouldn't lose information about finished containers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-12 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995361#comment-13995361
 ] 

Jian He commented on YARN-2016:
---

Adding unit tests for all records would be another big effort. Junping, you can 
open a new jira to discuss about this if needed.

Committing this.

 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995316#comment-13995316
 ] 

Bikas Saha commented on YARN-2001:
--

I think the offline discussion agreement was that there would be a threshold 
for NM's to resync. After that threshold the scheduler would be started. After 
that the NM's have until the NM heartbeat expire interval to resync. After the 
NM expiry interval, the NM's are considered lost (consistent with current 
behavior).

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1982) Rename the daemon name to timelineserver


[ 
https://issues.apache.org/jira/browse/YARN-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995404#comment-13995404
 ] 

Hudson commented on YARN-1982:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-1982. Renamed the daemon name to be TimelineServer instead of History 
Server and deprecated the old usage. Contributed by Zhijie Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593748)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn.cmd
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/yarn-env.sh


 Rename the daemon name to timelineserver
 

 Key: YARN-1982
 URL: https://issues.apache.org/jira/browse/YARN-1982
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.0.0, 2.4.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
  Labels: cli
 Fix For: 2.5.0

 Attachments: YARN-1982.1.patch


 Nowadays, it's confusing that we call the new component timeline server, but 
 we use
 {code}
 yarn historyserver
 yarn-daemon.sh start historyserver
 {code}
 to start the daemon.
 Before the confusion keeps being propagated, we'd better to modify command 
 line asap.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1962) Timeline server is enabled by default


[ 
https://issues.apache.org/jira/browse/YARN-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995407#comment-13995407
 ] 

Hudson commented on YARN-1962:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-1962. Changed Timeline Service client configuration to be off by default 
given the non-readiness of the feature yet. Contributed by Mohammad Kamrul 
Islam. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593750)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 Timeline server is enabled by default
 -

 Key: YARN-1962
 URL: https://issues.apache.org/jira/browse/YARN-1962
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.0
Reporter: Mohammad Kamrul Islam
Assignee: Mohammad Kamrul Islam
 Fix For: 2.4.1

 Attachments: YARN-1962.1.patch, YARN-1962.2.patch


 Since Timeline server is not matured and secured yet, enabling  it by default 
 might create some confusion.
 We were playing with 2.4.0 and found a lot of exceptions for distributed 
 shell example related to connection refused error. Btw, we didn't run TS 
 because it is not secured yet.
 Although it is possible to explicitly turn it off through yarn-site config. 
 In my opinion,  this extra change for this new service is not worthy at this 
 point,.  
 This JIRA is to turn it off by default.
 If there is an agreement, i can put a simple patch about this.
 {noformat}
 14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response 
 from the timeline server.
 com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: 
 Connection refused
   at 
 com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
   at com.sun.jersey.api.client.Client.handle(Client.java:648)
   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
   at 
 com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281)
 Caused by: java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
   at sun.net.www.http.HttpClient.in14/04/17 23:24:33 ERROR 
 impl.TimelineClientImpl: Failed to get the response from the timeline server.
 com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: 
 Connection refused
   at 
 com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
   at com.sun.jersey.api.client.Client.handle(Client.java:648)
   at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
   at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
   at 
 com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104)
   at

[jira] [Commented] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup


[ 
https://issues.apache.org/jira/browse/YARN-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995405#comment-13995405
 ] 

Hudson commented on YARN-2036:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-2036. Document yarn.resourcemanager.hostname in ClusterSetup (Ray Chiang 
via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593631)
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


 Document yarn.resourcemanager.hostname in ClusterSetup
 --

 Key: YARN-2036
 URL: https://issues.apache.org/jira/browse/YARN-2036
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Ray Chiang
Priority: Minor
 Fix For: 2.5.0

 Attachments: YARN2036-01.patch, YARN2036-02.patch


 ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people 
 should just be able to use that directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1975) Used resources shows escaped html in CapacityScheduler and FairScheduler page


[ 
https://issues.apache.org/jira/browse/YARN-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995408#comment-13995408
 ] 

Hudson commented on YARN-1975:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-1975. Fix yarn application CLI to print the scheme of the tracking url of 
failed/killed applications. Contributed by Junping Du (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593874)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java


 Used resources shows escaped html in CapacityScheduler and FairScheduler page
 -

 Key: YARN-1975
 URL: https://issues.apache.org/jira/browse/YARN-1975
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 2.4.0
Reporter: Nathan Roberts
Assignee: Mit Desai
 Fix For: 3.0.0, 2.4.1

 Attachments: YARN-1975.patch, screenshot-1975.png


 Used resources displays as amp;lt;memory:, vCores;amp;gt; with capacity 
 scheduler



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2032) Implement a scalable, available TimelineStore using HBase

2014-05-12 Thread Mayank Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2032:


Attachment: YARN-2032-branch-2-1.patch

Updating patch for branch-2

Thanks,
Mayank

 Implement a scalable, available TimelineStore using HBase
 -

 Key: YARN-2032
 URL: https://issues.apache.org/jira/browse/YARN-2032
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
 Attachments: YARN-2032-branch-2-1.patch


 As discussed on YARN-1530, we should pursue implementing a scalable, 
 available Timeline store using HBase.
 One goal is to reuse most of the code from the levelDB Based store - 
 YARN-1635.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store


[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995438#comment-13995438
 ] 

Karthik Kambatla commented on YARN-2033:


Thanks Vinod. Would like to hear your thoughts on the following:
What are the perceived scalability requirements of the history store and 
timeline store? I would think the timeline store might have to supporting 
storing a lot more information than the history store. In that case, one might 
want to keep them separate? 

 Investigate merging generic-history into the Timeline Store
 ---

 Key: YARN-2033
 URL: https://issues.apache.org/jira/browse/YARN-2033
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 Having two different stores isn't amicable to generic insights on what's 
 happening with applications. This is to investigate porting generic-history 
 into the Timeline Store.
 One goal is to try and retain most of the client side interfaces as close to 
 what we have today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995326#comment-13995326
 ] 

Tsuyoshi OZAWA commented on YARN-2001:
--

[~adhoot], you're correct basically. I meant that if epoch gap between RM and 
NM is too large to handle for RM, it can be killed. It saves memory usage of RM.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-12 Thread Anubhav Dhoot (JIRA)

[
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995328#comment-13995328
]

Anubhav Dhoot commented on YARN-556:

bq. clustertimestamp is added to containerId so that containerId after RM
restart do not clash with containerId before (as the containerId counter resets
to zero in memory).

The problem is the containerId currently is composed of ApplicationAttemptId +
int. The int part comes from a in memory containerIdCounter from
AppSchedulingInfo. This gets reset after a RM restart. Without any changes the
containerIds for containers allocated after restart would clash with existing
containerIds.
The prototype proposal is to make it ApplicationAttemptId + uniqueid + int
where the uniqueid can be a timestamp set by RM. I feel containerId should be
an opaque string that YARN app developers don't take a dependency on. Also if
we used protobuf serialization/deserialization rules everywhere we could deal
with compatibility changes of different YARN code versions.

RM Restart phase 2 - Work preserving restart

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs


 [ 
https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-1913:
--

Assignee: Karthik Kambatla

 With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
 --

 Key: YARN-1913
 URL: https://issues.apache.org/jira/browse/YARN-1913
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Karthik Kambatla

 It's possible to deadlock a cluster by submitting many applications at once, 
 and have all cluster resources taken up by AMs.
 One solution is for the scheduler to limit resources taken up by AMs, as a 
 percentage of total cluster resources, via a maxApplicationMasterShare 
 config.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-05-12 Thread Jian He (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995354#comment-13995354
]

Jian He commented on YARN-1372:
---

An alternative would be to make NM remember the current containers in memory
until the application is completed. On each re-register NM sends across the
whole list of container statuses. Typically, each NM holds tens of containers
in memory which shouldn't be much memory overhead, as compared to RM which
holds all the active containers in the cluster. This also avoids protocol
changes.

Ensure all completed containers are reported to the AMs across RM restart
-

Key: YARN-1372
URL: https://issues.apache.org/jira/browse/YARN-1372
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

Currently the NM informs the RM about completed containers and then removes
those containers from the RM notification list. The RM passes on that
completed container information to the AM and the AM pulls this data. If the
RM dies before the AM pulls this data then the AM may not be able to get this
information again. To fix this, NM should maintain a separate list of such
completed container notifications sent to the RM. After the AM has pulled the
containers from the RM then the RM will inform the NM about it and the NM can
remove the completed container from the new list. Upon re-register with the
RM (after RM restart) the NM should send the entire list of completed
containers to the RM along with any other containers that completed while the
RM was dead. This ensures that the RM can inform the AM's about all completed
containers. Some container completions may be reported more than once since
the AM may have pulled the container but the RM may die before notifying the
NM about the pull.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()


[ 
https://issues.apache.org/jira/browse/YARN-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995485#comment-13995485
 ] 

Hadoop QA commented on YARN-2042:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644357/YARN-2042.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3732//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3732//console

This message is automatically generated.

 String shouldn't be compared using == in 
 QueuePlacementRule#NestedUserQueue#getQueueForApp()
 

 Key: YARN-2042
 URL: https://issues.apache.org/jira/browse/YARN-2042
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Chen He
Priority: Minor
 Attachments: YARN-2042.patch


 {code}
   if (queueName != null  queueName != ) {
 {code}
 queueName.isEmpty() should be used instead of comparing against 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

[
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995213#comment-13995213
]

Tsuyoshi OZAWA commented on YARN-2001:
--

[~leftnoteasy] , my idea is creating ClusterId-space under the
epoch(cluster-timestamp) like {{MapEpoch, ListClusterID}}.

* Epoch (saved in ZKRMStateStore and RM's memory), just a integer value.
* ClusterID (saved in RM's memory), same as current code.

A rough sketch is as follows:

* When a new active RM starts up, Epoch in RMStateStore is incremented and RM
sets the Epoch. ClusterID is reset to zero.
* Heartbeats between NM and RM include Epoch: RM can distinguish old
cluster-timestamps from the new one when NM is registered. If the Epoch is
older than RM expects, RM can kill the containers via NM.

Please correct me if I'm wrong.

Threshold for RM to accept requests from AM after failover
--

Key: YARN-2001
URL: https://issues.apache.org/jira/browse/YARN-2001
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

After failover, RM may require a certain threshold to determine whether it’s
safe to make scheduling decisions and start accepting new container requests
from AMs. The threshold could be a certain amount of nodes. i.e. RM waits
until a certain amount of nodes joining before accepting new container
requests. Or it could simply be a timeout, only after the timeout RM accepts
new requests.
NMs joined after the threshold can be treated as new NMs and instructed to
kill all its containers.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First


[ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995153#comment-13995153
 ] 

Karthik Kambatla commented on YARN-1969:


Just stating the obvious: we need to add a way to specify per-job deadlines 
too. 

 Fair Scheduler: Add policy for Earliest Deadline First
 --

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, what we require is a kind of 
 variation of *Earliest Deadline First scheduling*, that takes into account 
 the number of already-allocated resources and estimated time to finish.
 http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to *only within a queue*: i.e., 
 adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues 
 to use it by setting the schedulingPolicy field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2011) Fix typo and warning in TestLeafQueue


[ 
https://issues.apache.org/jira/browse/YARN-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995403#comment-13995403
 ] 

Hudson commented on YARN-2011:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-2011. Fix typo and warning in TestLeafQueue (Contributed by Chen He) 
(junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593804)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java


 Fix typo and warning in TestLeafQueue
 -

 Key: YARN-2011
 URL: https://issues.apache.org/jira/browse/YARN-2011
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Chen He
Assignee: Chen He
Priority: Trivial
 Fix For: 2.5.0

 Attachments: YARN-2011-v2.patch, YARN-2011.patch


 a.assignContainers(clusterResource, node_0);
 assertEquals(2*GB, a.getUsedResources().getMemory());
 assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_1.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G
 // Again one to user_0 since he hasn't exceeded user limit yet
 a.assignContainers(clusterResource, node_0);
 assertEquals(3*GB, a.getUsedResources().getMemory());
 assertEquals(2*GB, app_0.getCurrentConsumption().getMemory());
 assertEquals(1*GB, app_1.getCurrentConsumption().getMemory());
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G
 assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995516#comment-13995516
]

Bikas Saha commented on YARN-1372:
--

what happens when they are 100's of long jobs running 100's of containers per
nodes. Do we hold onto info about all those containers that have completed long
ago?

Ensure all completed containers are reported to the AMs across RM restart
-

Key: YARN-1372
URL: https://issues.apache.org/jira/browse/YARN-1372
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions


[ 
https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995398#comment-13995398
 ] 

Hudson commented on YARN-1987:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-1987. Wrapper for leveldb DBIterator to aid in handling database 
exceptions. (Jason Lowe via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593757)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/pom.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/LeveldbIterator.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils/TestLeveldbIterator.java


 Wrapper for leveldb DBIterator to aid in handling database exceptions
 -

 Key: YARN-1987
 URL: https://issues.apache.org/jira/browse/YARN-1987
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 2.5.0

 Attachments: YARN-1987.patch, YARN-1987v2.patch


 Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a 
 utility wrapper around leveldb's DBIterator to translate the raw 
 RuntimeExceptions it can throw into DBExceptions to make it easier to handle 
 database errors while iterating.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored


[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995498#comment-13995498
 ] 

Hadoop QA commented on YARN-2016:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644248/YARN-2016.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3733//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3733//console

This message is automatically generated.

 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-12 Thread Min Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Zhou updated YARN-2048:
---

Attachment: YARN-2048-trunk-v1.patch

Submit a patch on trunk

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-12 Thread Min Zhou (JIRA)

Min Zhou created YARN-2048:
--

 Summary: List all of the containers of an application from the 
yarn web
 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Reporter: Min Zhou


Currently, Yarn haven't provide a way to list all of the containers of an 
application from its web. This kind of information is needed by the application 
user. They can conveniently know how many containers their applications already 
acquired as well as which node those containers were launched on.  They also 
want to view the logs of each container of an application.

One approach is maintain a container list in RMAppImpl and expose this info to 
Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2040) Recover information about finished containers

2014-05-12 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-2040.
--

Resolution: Duplicate

This will be covered by YARN-1337.

 Recover information about finished containers
 -

 Key: YARN-2040
 URL: https://issues.apache.org/jira/browse/YARN-2040
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla

 The NM should store and recover information about finished containers as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()

2014-05-12 Thread Chen He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995505#comment-13995505
 ] 

Chen He commented on YARN-2042:
---

The change in this patch does not need to include test.

 String shouldn't be compared using == in 
 QueuePlacementRule#NestedUserQueue#getQueueForApp()
 

 Key: YARN-2042
 URL: https://issues.apache.org/jira/browse/YARN-2042
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Chen He
Priority: Minor
 Attachments: YARN-2042.patch


 {code}
   if (queueName != null  queueName != ) {
 {code}
 queueName.isEmpty() should be used instead of comparing against 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1702) Expose kill app functionality as part of RM web services

2014-05-12 Thread Varun Vasudev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-1702:


Attachment: apache-yarn-1702.9.patch

New patch with the following fixes -
1. Use right call to get queue and acl managers to check permissions
2. Fix kill api to use PUT on /apps/{appid}/state to kill an app
3. Added documentation for the REST call.

 Expose kill app functionality as part of RM web services
 

 Key: YARN-1702
 URL: https://issues.apache.org/jira/browse/YARN-1702
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, 
 apache-yarn-1702.4.patch, apache-yarn-1702.5.patch, apache-yarn-1702.7.patch, 
 apache-yarn-1702.8.patch, apache-yarn-1702.9.patch


 Expose functionality to kill an app via the ResourceManager web services API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1938) Kerberos authentication for the timeline server


 [ 
https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1938:
--

Attachment: YARN-1938.2.patch

Made some minor touch on the last patch.

 Kerberos authentication for the timeline server
 ---

 Key: YARN-1938
 URL: https://issues.apache.org/jira/browse/YARN-1938
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1938.1.patch, YARN-1938.2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992962#comment-13992962
 ] 

Tsuyoshi OZAWA commented on YARN-2001:
--

{quote}
If possible, I think we should avoid changing container Id format. 
{quote}

+1, if possible. Can we add epoch (cluster timestamp) to 
ResourceTrackerService's state via heartbeat?

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1457) YARN single node install issues on mvn clean install assembly:assembly on mapreduce project

2014-05-12 Thread Shaun Gittens (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992948#comment-13992948
 ] 

Shaun Gittens commented on YARN-1457:
-

I get virtually the same error when executing :

 mvn clean install assembly:assembly -DskipTests

except I'm running it on a Centos VM compiling Hadoop 2.2.0 and proto 2.5.0 ...

 YARN single node install issues on mvn clean install assembly:assembly on 
 mapreduce project
 ---

 Key: YARN-1457
 URL: https://issues.apache.org/jira/browse/YARN-1457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: site
Affects Versions: 2.0.5-alpha
Reporter: Rekha Joshi
Priority: Minor
  Labels: mvn
 Attachments: yarn-mvn-mapreduce.txt


 YARN single node install - 
 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
 On Mac OSX 10.7.3, Java 1.6, Protobuf 2.5.0 and hadoop-2.0.5-alpha.tar,  mvn 
 clean install -DskipTests succeds after a YARN fix on pom.xml(using 2.5.0 
 protobuf)
 But on hadoop-mapreduce-project mvn install fails for tests with below errors
 $ mvn clean install assembly:assembly -Pnative
 errors as in atatched yarn-mvn-mapreduce,txt
 On $mvn clean install assembly:assembly  -DskipTests
 Reactor Summary:
 [INFO] 
 [INFO] hadoop-mapreduce-client ... SUCCESS [2.410s]
 [INFO] hadoop-mapreduce-client-core .. SUCCESS [13.781s]
 [INFO] hadoop-mapreduce-client-common  SUCCESS [8.486s]
 [INFO] hadoop-mapreduce-client-shuffle ... SUCCESS [0.774s]
 [INFO] hadoop-mapreduce-client-app ... SUCCESS [4.409s]
 [INFO] hadoop-mapreduce-client-hs  SUCCESS [1.618s]
 [INFO] hadoop-mapreduce-client-jobclient . SUCCESS [4.470s]
 [INFO] hadoop-mapreduce-client-hs-plugins  SUCCESS [0.561s]
 [INFO] Apache Hadoop MapReduce Examples .. SUCCESS [1.620s]
 [INFO] hadoop-mapreduce .. FAILURE [10.107s]
 [INFO] 
 
 [INFO] BUILD FAILURE
 [INFO] 
 
 [INFO] Total time: 49.606s
 [INFO] Finished at: Thu Nov 28 16:20:52 GMT+05:30 2013
 [INFO] Final Memory: 34M/118M
 [INFO] 
 
 [ERROR] Failed to execute goal 
 org.apache.maven.plugins:maven-assembly-plugin:2.3:assembly (default-cli) on 
 project hadoop-mapreduce: Error reading assemblies: No assembly descriptors 
 found. - [Help 1]
 $mvn package -Pdist -DskipTests=true -Dtar
 works
 The documentation needs to be updated for possible issues and resolutions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart


[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995511#comment-13995511
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

If we can break the compatibility about the container id, I think Anubhav's 
approach has no problem.
If we cannot do this as [~jianhe] mentioned on YARN-2001, I think epoch idea 
[described 
here|https://issues.apache.org/jira/browse/YARN-2001?focusedCommentId=13995213page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13995213]
 might be used.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext


 [ 
https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-2050:
--

Attachment: YARN-2050.patch

Papreduce CLI and yarn CLI pass the configuration to LogCLIHelpers. 
LogCLIHelpers use the same configuration to create remoteRootLogDir and 
remoteAppLogDir, etc. in dumpAllContainersLogs. The fix is to use the same 
configuration to create FileContext.

To follow up on [~jlowe]'s comments,

1. remoteAppLogDir.toUri().getScheme() returns null and 
AbstractFileSystem.createFileSystem doesn't like it if dumpAllContainersLogs 
calls FileContext.getFileContext(remoteAppLogDir.toUri()).

2. If caller of LogCLIHelpers doesn't setConf ahead of time, 
dumpAllContainersLogs will get null pointer exception when it tries to get 
remoteRootLogDir.

 Fix LogCLIHelpers to create the correct FileContext
 ---

 Key: YARN-2050
 URL: https://issues.apache.org/jira/browse/YARN-2050
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-2050.patch


 LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus 
 the FileContext created isn't necessarily the FileContext for remote log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2039) Better reporting of finished containers to AMs


[ 
https://issues.apache.org/jira/browse/YARN-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995241#comment-13995241
 ] 

Bikas Saha commented on YARN-2039:
--

Dupe of YARN-1372?

 Better reporting of finished containers to AMs
 --

 Key: YARN-2039
 URL: https://issues.apache.org/jira/browse/YARN-2039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Priority: Critical

 On RM restart, we shouldn't lose information about finished containers. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995243#comment-13995243
 ] 

Karthik Kambatla commented on YARN-2001:


I think the epoch idea might work very nicely with the versioning work we plan 
to do as part of YARN-667.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-766) TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk


[ 
https://issues.apache.org/jira/browse/YARN-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995402#comment-13995402
 ] 

Hudson commented on YARN-766:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5603/])
YARN-766. TestNodeManagerShutdown in branch-2 should use Shell to form the 
output path and a format issue in trunk. (Contributed by Siddharth Seth) 
(junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593660)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java


 TestNodeManagerShutdown in branch-2 should use Shell to form the output path 
 and a format issue in trunk
 

 Key: YARN-766
 URL: https://issues.apache.org/jira/browse/YARN-766
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.1.0-beta
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Minor
 Attachments: YARN-766.branch-2.txt, YARN-766.trunk.txt, YARN-766.txt


 File scriptFile = new File(tmpDir, scriptFile.sh);
 should be replaced with
 File scriptFile = Shell.appendScriptExtension(tmpDir, scriptFile);
 to match trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart


[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995247#comment-13995247
 ] 

Karthik Kambatla commented on YARN-1372:


Based on offline discussion with Anubhav, Bikas, Jian and Vinod, the control 
flow for notifying the AM of finished containers should be as follows:
# NM informs RM and holds on to the information (YARN-1336 should handle this 
as well)
# RM informs AM
# AM acks RM
# RM acks NM
# NM deletes the information

 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot

 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1412) Allocating Containers on a particular Node in Yarn

2014-05-12 Thread Chris Riccomini (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993616#comment-13993616
 ] 

Chris Riccomini commented on YARN-1412:
---

Seeing this as well (YARN-2027).

 Allocating Containers on a particular Node in Yarn
 --

 Key: YARN-1412
 URL: https://issues.apache.org/jira/browse/YARN-1412
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: centos, Hadoop 2.2.0
Reporter: gaurav gupta

 Summary of the problem: 
  If I pass the node on which I want container and set relax locality default 
 which is true, I don't get back the container on the node specified even if 
 the resources are available on the node. It doesn't matter if I set rack or 
 not.
 Here is the snippet of the code that I am using
 AMRMClientContainerRequest amRmClient =  AMRMClient.createAMRMClient();;
 String host = h1;
 Resource capability = Records.newRecord(Resource.class);
 capability.setMemory(memory);
 nodes = new String[] {host};
 // in order to request a host, we also have to request the rack
 racks = new String[] {/default-rack};
  ListContainerRequest containerRequests = new 
 ArrayListContainerRequest();
 ListContainerId releasedContainers = new ArrayListContainerId();
 containerRequests.add(new ContainerRequest(capability, nodes, racks, 
 Priority.newInstance(priority)));
 if (containerRequests.size()  0) {
   LOG.info(Asking RM for containers:  + containerRequests);
   for (ContainerRequest cr : containerRequests) {
 LOG.info(Requested container: {}, cr.toString());
 amRmClient.addContainerRequest(cr);
   }
 }
 for (ContainerId containerId : releasedContainers) {
   LOG.info(Released container, id={}, containerId.getId());
   amRmClient.releaseAssignedContainer(containerId);
 }
 return amRmClient.allocate(0);



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1936) Secured timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Issue Type: Bug  (was: Sub-task)
Parent: (was: YARN-1530)

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1936) Secured timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Description: TimelineClient should be able to talk to the timeline server 
with kerberos authentication or delegation token

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase

2014-05-12 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995381#comment-13995381
 ] 

Ted Yu commented on YARN-2032:
--

{code}
+  version0.98.0-hadoop2/version
{code}
0.98.2 is the latest relase for 0.98
Since SingleColumnValueFilter is used, you need to have HBASE-10850 which is in 
0.98.2
{code}
+  protected void serviceInit(Configuration conf) throws Exception {
+HBaseAdmin hbase = initHBase(conf);
{code}
Looks like the HBaseAdmin instance is not closed upon leaving serviceInit().



 Implement a scalable, available TimelineStore using HBase
 -

 Key: YARN-2032
 URL: https://issues.apache.org/jira/browse/YARN-2032
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal
 Attachments: YARN-2032-branch-2-1.patch


 As discussed on YARN-1530, we should pursue implementing a scalable, 
 available Timeline store using HBase.
 One goal is to reuse most of the code from the levelDB Based store - 
 YARN-1635.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995812#comment-13995812
 ] 

Xuan Gong commented on YARN-1861:
-

Uploaded a new patch, Explicitly throwing the exception, saying  Can not find 
the active RM, instead of NPE.

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1751) Improve MiniYarnCluster for log aggregation testing


 [ 
https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-1751:
--

Attachment: YARN-1751.patch

Thanks, Jason. Here is the patch for MiniYarnCluster.

I have opened https://issues.apache.org/jira/browse/YARN-2050 for the 
LogCLIHelpers issue and will post more comments there.

 Improve MiniYarnCluster for log aggregation testing
 ---

 Key: YARN-1751
 URL: https://issues.apache.org/jira/browse/YARN-1751
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1751-trunk.patch, YARN-1751.patch


 MiniYarnCluster specifies individual remote log aggregation root dir for each 
 NM. Test code that uses MiniYarnCluster won't be able to get the value of log 
 aggregation root dir. The following code isn't necessary in MiniYarnCluster.
   File remoteLogDir =
   new File(testWorkDir, MiniYARNCluster.this.getName()
   + -remoteLogDir-nm- + index);
   remoteLogDir.mkdir();
   config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
   remoteLogDir.getAbsolutePath());
 In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to 
 FileContext.getFileContext() call.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1936) Secured timeline client


[ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995851#comment-13995851
 ] 

Zhijie Shen commented on YARN-1936:
---

BTW, the patch depends on YARN-2049 for compiling

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart


[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995878#comment-13995878
 ] 

Bikas Saha commented on YARN-556:
-

Folks please take the discussion for container id to its own jira. Spreading it 
in the main jira will make it harder to track.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart


[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995829#comment-13995829
 ] 

Bikas Saha commented on YARN-556:
-

bq. After the configurable wait-time, the RM starts accepting RPCs from both 
new AMs and already existing AMs.
This is not needed. The AM can be allowed to re-sync after state is recovered 
from the store. Allocations to the AM may not occur until the threshold 
elapses. In fact, we want to re-sync the AM's asap so that they dont give up on 
the RM.

bq. Existing AMs are expected to resync with the RM, which essentially 
translates to register followed by an allocate call
We should keep the option open to use a new API called resync that does exactly 
that. It may help to make this operation atomic





 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995536#comment-13995536
 ] 

Tsuyoshi OZAWA commented on YARN-2001:
--

Bikas and Karthik, thanks for the sharing. I'll check YARN-667.

 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995796#comment-13995796
 ] 

Xuan Gong commented on YARN-1861:
-

bq. Can we make this explicit, instead of being an NPE? Like doing a client 
call to find the current active RM or something like that?

Yes, we can do that. DONE

bq. That is what I was thinking, but I am concerned about locking etc. This 
code has become a little convoluted. Per Xuan, we seem to be safe for now, so 
may be look at this separately?

Yes. But I will make a note about it. 


 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled


[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995743#comment-13995743
 ] 

Karthik Kambatla commented on YARN-1861:


bq. Also, we need to make sure that when automatic failover is enabled, all 
external interventions like a fence like this bug (and forced-manual failover 
from CLI?) do a similar reset into the leader election. There may not be cases 
like this today though.
One way to future-proof this is to call resetLeaderElection in 
ResourceManager#transitionToStandby itself. That looks hacky, but doesn't 
require new external interventions to explicitly handle it. [~vinodkv] - do you 
think that would be a better approach?

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

[
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995760#comment-13995760
]

Vinod Kumar Vavilapalli commented on YARN-1861:
---

bq. Without the core code change, this testcase will fail. Because NM is trying
to connect the active RM, but neither of two RMs are active. So, the NPE is
expected.
Can we make this explicit, instead of being an NPE? Like doing a client call to
find the current active RM or something like that?

Tx for the explanation of all the cases, Xuan.

bq. That looks hacky, but doesn't require new external interventions to
explicitly handle it. Vinod Kumar Vavilapalli - do you think that would be a
better approach?
That is what I was thinking, but I am concerned about locking etc. This code
has become a little convoluted. Per Xuan, we seem to be safe for now, so may be
look at this separately?

Both RM stuck in standby mode when automatic failover is enabled

Key: YARN-1861
URL: https://issues.apache.org/jira/browse/YARN-1861
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch,
YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch

In our HA tests we noticed that the tests got stuck because both RM's got
into standby state and no one became active.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-1227) Update Single Cluster doc to use yarn.resourcemanager.hostname

2014-05-12 Thread Akira AJISAKA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA resolved YARN-1227.
-

Resolution: Invalid

Closing this issue. Feel free to reopen if you disagree.

 Update Single Cluster doc to use yarn.resourcemanager.hostname
 --

 Key: YARN-1227
 URL: https://issues.apache.org/jira/browse/YARN-1227
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.1.0-beta
Reporter: Sandy Ryza
Assignee: Ray Chiang
  Labels: newbie

 Now that yarn.resourcemanager.hostname can be used in place or 
 yarn.resourcemanager.address, yarn.resourcemanager.scheduler.address, etc., 
 we should update the doc to use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1936) Secured timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-1935

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute

2014-05-12 Thread Ashwin Shankar (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar reassigned YARN-2012:


Assignee: Ashwin Shankar

 Fair Scheduler : Default rule in queue placement policy can take a queue as 
 an optional attribute
 -

 Key: YARN-2012
 URL: https://issues.apache.org/jira/browse/YARN-2012
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt


 Currently 'default' rule in queue placement policy,if applied,puts the app in 
 root.default queue. It would be great if we can make 'default' rule 
 optionally point to a different queue as default queue . This queue should be 
 an existing queue,if not we fall back to root.default queue hence keeping 
 this rule as terminal.
 This default queue can be a leaf queue or it can also be an parent queue if 
 the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext

Ming Ma created YARN-2050:
-

 Summary: Fix LogCLIHelpers to create the correct FileContext
 Key: YARN-2050
 URL: https://issues.apache.org/jira/browse/YARN-2050
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma


LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus 
the FileContext created isn't necessarily the FileContext for remote log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext


[ 
https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995780#comment-13995780
 ] 

Hadoop QA commented on YARN-2050:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644494/YARN-2050.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3739//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3739//console

This message is automatically generated.

 Fix LogCLIHelpers to create the correct FileContext
 ---

 Key: YARN-2050
 URL: https://issues.apache.org/jira/browse/YARN-2050
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-2050.patch


 LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus 
 the FileContext created isn't necessarily the FileContext for remote log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled


[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995781#comment-13995781
 ] 

Karthik Kambatla commented on YARN-1861:


bq. That is what I was thinking, but I am concerned about locking etc. This 
code has become a little convoluted.
Agree. I did consider going that route, but was worried about the 
maintainability. 

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC


[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995569#comment-13995569
 ] 

Hadoop QA commented on YARN-1515:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644314/YARN-1515.v07.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3734//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3734//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3734//console

This message is automatically generated.

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch, YARN-1515.v07.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-12 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995782#comment-13995782
 ] 

Wangda Tan commented on YARN-2001:
--

[~ozawa], as Bikas said, we should keep at least some early-reported containers 
not been killed via a threshold for NM's to resync. This is why we do work 
preserving restart.



 Threshold for RM to accept requests from AM after failover
 --

 Key: YARN-2001
 URL: https://issues.apache.org/jira/browse/YARN-2001
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 After failover, RM may require a certain threshold to determine whether it’s 
 safe to make scheduling decisions and start accepting new container requests 
 from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
 until a certain amount of nodes joining before accepting new container 
 requests.  Or it could simply be a timeout, only after the timeout RM accepts 
 new requests. 
 NMs joined after the threshold can be treated as new NMs and instructed to 
 kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First

2014-05-12 Thread Maysam Yabandeh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995993#comment-13995993
 ] 

Maysam Yabandeh commented on YARN-1969:
---

Thanks for the comment [~kasha]. I think this is a good point to distinguish 
between the terms deadline and endtime.

deadline would be the user-specified SLA and as you correctly mentioned in 
many cases it is quite likely to be missed due to failures, limited resources, 
etc. Still the user can express the level of urgency by the desired deadline, 
but they could also do that via priorities, so the user-specified deadline 
would be a complementary (and perhaps more expressive) way for users to specify 
the priorities of their jobs.

endtime, on the other hand, is the estimated end time of the job based on the 
current progress and assuming that the RM will give the rest of the required 
resources immediately. endtime is automatically computed by the AppMaster and 
there is no need for user involvement. When scheduling resources, the advantage 
of taking endtime into consideration is that the giant jobs that are close to 
be finished could be prioritized. We in general want to have such jobs finished 
sooner since (i) they would release the resources that they have occupied such 
as the disk space for the mappers' output, (ii) a large job is more susceptible 
to failures and the longer they are hanging around , the more is the likelihood 
of being affected by a loss of a mapper node.

The added subtasks are based on the agenda of (i) estimating the end time, (ii) 
sending it over to RM, (iii) letting RM take it into consideration. We can also 
extend the API to allow the users to specify their desired deadline. As for how 
RM take the specified deadline or estimated endtime into consideration, I think 
once we have the endtime field available in RM, there will be many new 
opportunities to take advantage of it. One way as you mentioned is to translate 
them into weights to be used by the current fair scheduler. Any other 
scheduling algorithm, including EDF, also can be plugged in and do the 
scheduling based on a function of the endtime and other variables. The other 
variables could include the size of the job, as discussed above.

 Fair Scheduler: Add policy for Earliest Deadline First
 --

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, what we require is a kind of 
 variation of *Earliest Deadline First scheduling*, that takes into account 
 the number of already-allocated resources and estimated time to finish.
 http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to *only within a queue*: i.e., 
 adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues 
 to use it by setting the schedulingPolicy field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored


[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995925#comment-13995925
 ] 

Junping Du commented on YARN-2016:
--

Thanks [~jianhe] for review and comments! Filed YARN-2051 to address more tests 
for PBImpl.

 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Fix For: 2.4.1

 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored


[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995981#comment-13995981
 ] 

Hudson commented on YARN-2016:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5604 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5604/])
YARN-2016. Fix a bug in GetApplicationsRequestPBImpl to add the missed fields 
to proto. Contributed by Junping Du (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594085)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java


 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Fix For: 2.4.1

 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever


 [ 
https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2049:
--

Attachment: YARN-2049.1.patch

In this patch, I implemented the delegation token service via HTTP protocol by 
leveraging the hadoop-auth modules, and I significantly referred to  the design 
of the delegation token service of HttpFS.

1. Make the TimelineDelegationTokenIdenifier and secretManager as usual.
2. Extend the KerberosAuthenticationFilter and KerberosAuthenticationHandler to 
accept authentication based either the kerberos principle or the delegation 
token.
3. Extend KerberosAuthenticator to encapsulate DT based communication, and add 
the APIs to get/renew/cancel DT.
4. Modify the web stack to enable SPNEGO for the timeline server, and make 
secret manager service callable from the filter.
5. Fix the test cases accordingly.

This patch is only compilable based on YARN-1938 and HADOOP-10596

 Delegation token stuff for the timeline sever
 -

 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2049.1.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-1861:


Attachment: YARN-1861.7.patch

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1936) Secured timeline client


[ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995919#comment-13995919
 ] 

Hadoop QA commented on YARN-1936:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644522/YARN-1936.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3742//console

This message is automatically generated.

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1751) Improve MiniYarnCluster for log aggregation testing


 [ 
https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-1751:
--

Summary: Improve MiniYarnCluster for log aggregation testing  (was: Improve 
MiniYarnCluster and LogCLIHelpers for log aggregation testing)

 Improve MiniYarnCluster for log aggregation testing
 ---

 Key: YARN-1751
 URL: https://issues.apache.org/jira/browse/YARN-1751
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1751-trunk.patch


 MiniYarnCluster specifies individual remote log aggregation root dir for each 
 NM. Test code that uses MiniYarnCluster won't be able to get the value of log 
 aggregation root dir. The following code isn't necessary in MiniYarnCluster.
   File remoteLogDir =
   new File(testWorkDir, MiniYARNCluster.this.getName()
   + -remoteLogDir-nm- + index);
   remoteLogDir.mkdir();
   config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
   remoteLogDir.getAbsolutePath());
 In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to 
 FileContext.getFileContext() call.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1937) Access control of per-framework data