[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994837#comment-13994837 ] Karthik Kambatla commented on YARN-1474: # Correct me if I am wrong, but changes to AllocationFileLoaderService look unrelated. Can we do it in a different JIRA? # Nothing to do with this patch, but may be we can add spaces between each interface ResourceSchedulerWrapper implements? {code} public class ResourceSchedulerWrapper extends AbstractYarnScheduler implements SchedulerWrapper,ResourceScheduler,Configurable { {code} # Correct me if I am wrong. We need to set the rmContext only once. Can we update the comment to say, need to call this immediately after instantiating a scheduler? # Do we need the changes to {{reinitialize()}} implementations in each scheduler? Also, don't think we need a separate serviceInitInternal. Why not just have serviceInit call reinitialize? # FairScheduler: we can do without these variables. {code} private volatile boolean isUpdateThreadRunning = false; private volatile boolean isSchedulingThreadRunning = false; {code} # FairScheduler: serviceStartInternal and serviceStopInternal are fairly small methods - do we need these separate methods? # Can we call join(timeout) after interrupt, may be use a constant THREAD_JOIN_TIMEOUT = 1000? Also, set updateThread to null after join. {code} if (updateThread != null) { updateThread.interrupt(); } {code} # Check for schedulingThread is null? Also, set the thread to null after join() {code} if (continuousSchedulingEnabled) { isSchedulingThreadRunning = false; schedulingThread.interrupt(); } {code} Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2044) thrift interface for YARN?
[ https://issues.apache.org/jira/browse/YARN-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-2044. --- Resolution: Not a Problem We have protocol buffer based interfaces that you can look at. Also, please post questions on the mailing lists instead of opening tickets on the issue-tracker. Thanks. thrift interface for YARN? -- Key: YARN-2044 URL: https://issues.apache.org/jira/browse/YARN-2044 Project: Hadoop YARN Issue Type: Bug Reporter: Nikhil Mulley Hi, I was searching for the thrift interface definitions for YARN but could not come across any. Is there any plan to have a thrift interface to YARN ? If there is already one, could some one please redirect me to the appropriate place? thanks, Nikhil -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994865#comment-13994865 ] Vinod Kumar Vavilapalli commented on YARN-1515: --- bq. Vinod Kumar Vavilapalli, I am interested in your feedback in the context of your comment on MAPREDUCE-5044 . Yeah, sorry. This was in my blind spot. I understand this patch was online for a while, and is likely also being run in production, but I have some comments. As I mentioned on MAPREDUCE-5044, this feature is and should be done via YARN-445. Dumping threads is strictly a java construct and so far we have avoided any language feature in the YARN APIs (not willingly anyways). Can we instead implement this feature using YARN-445 and getting clients/AMs to send a SIGQUIT signal or some such signal command instead? I looked at the patch and that is indeed what it is doing eventually in the NM. We need to keep the API clean. Thoughts? Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: New Feature Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1515: -- Issue Type: Sub-task (was: New Feature) Parent: YARN-445 Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: Sub-task Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994878#comment-13994878 ] Vinod Kumar Vavilapalli commented on YARN-445: -- Folks, I just made YARN-1515 a sub-tasks of this. This JIRA is today focusing on exposing a signalling interface on the ResourceManager. It seems like we can simply expose the same API as part of ContainerManagement and get most of the thread-dump functionality with minimal changes. Ability to signal containers Key: YARN-445 URL: https://issues.apache.org/jira/browse/YARN-445 Project: Hadoop YARN Issue Type: Task Components: nodemanager Reporter: Jason Lowe Assignee: Andrey Klochkov Attachments: MRJob.png, MRTasks.png, YARN-445--n2.patch, YARN-445--n3.patch, YARN-445--n4.patch, YARN-445-signal-container-via-rm.patch, YARN-445.patch, YARNContainers.png It would be nice if an ApplicationMaster could send signals to contaniers such as SIGQUIT, SIGUSR1, etc. For example, in order to replicate the jstack-on-task-timeout feature implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an interface for sending SIGQUIT to a container. For that specific feature we could implement it as an additional field in the StopContainerRequest. However that would not address other potential features like the ability for an AM to trigger jstacks on arbitrary tasks *without* killing them. The latter feature would be a very useful debugging tool for users who do not have shell access to the nodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2017: -- Summary: Merge some of the common lib code in schedulers (was: Merge common code in schedulers) Edited the title to reflect what is being actually done. Merge some of the common lib code in schedulers --- Key: YARN-2017 URL: https://issues.apache.org/jira/browse/YARN-2017 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He A bunch of same code is repeated among schedulers, e.g: between FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1803) Signal container support in nodemanager
[ https://issues.apache.org/jira/browse/YARN-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994879#comment-13994879 ] Vinod Kumar Vavilapalli commented on YARN-1803: --- Tx for working on this Ming. Few comments, in line with [my comment on YARN-445|https://issues.apache.org/jira/browse/YARN-445?focusedCommentId=13994878page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13994878] about combining this functionality with the thread-dump feature, - We need to consolidate stopContainer* APIs into the signalContainer APIs. Logically, they are a subset of signalling. - To make that happen, we will need to have bulk signalling APIs to signal multiple containers simultaneously - One other requirement as part of that is to to be able to send an ordered list of signals so that NM can for example do things like sigterm+sigkill or thread-dump+sigterm+sigkill etc. - SignalContainerCommand defines a bunch of commands that aren't going to implemented today - let's only add those that are required and are going to be implemented as part of this set of patches. Still navigating the entire arena w.r.t to the signalling work being done across several JIRAs. Signal container support in nodemanager --- Key: YARN-1803 URL: https://issues.apache.org/jira/browse/YARN-1803 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1803.patch It could include the followings. 1. ContainerManager is able to process a new event type ContainerManagerEventType.SIGNAL_CONTAINERS coming from NodeStatusUpdater and deliver the request to ContainerExecutor. 2. Translate the platform independent signal command to Linux specific signals. Windows support will be tracked by another task. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reassigned YARN-1368: - Assignee: Jian He (was: Anubhav Dhoot) Assigning to Jian as he started putting up patches.. [~adhoot]/[~wangda], please help with reviews. Common work to re-populate containers’ state into scheduler --- Key: YARN-1368 URL: https://issues.apache.org/jira/browse/YARN-1368 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Jian He Attachments: YARN-1368.1.patch, YARN-1368.preliminary.patch YARN-1367 adds support for the NM to tell the RM about all currently running containers upon registration. The RM needs to send this information to the schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-1366: Assignee: Rohith ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994920#comment-13994920 ] Rohith commented on YARN-1366: -- Thank you for offering! It was just wait to finsih prototype by Anubhav Dhoot. I assign to myself:-) ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Attachments: YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-796: --- Assignee: Wangda Tan (was: Arun C Murthy) Working on this JIRA, assigned it to myself. And will post a design doc in a day or two. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994891#comment-13994891 ] Vinod Kumar Vavilapalli commented on YARN-1366: --- [~rohithsharma], are you interested in taking this patch further? If so, assign it to yourselves and [~adhoot] can provide review comments and help. Otherwise, he will take it over from what I can see. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Attachments: YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994893#comment-13994893 ] Vinod Kumar Vavilapalli commented on YARN-556: -- Also, if there is a general agreement on how patches should go in which order, please create that ordering through JIRA dependencies. Thanks. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar reassigned YARN-2026: Assignee: Ashwin Shankar Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1989) Adding shell scripts to launch multiple servers on localhost
[ https://issues.apache.org/jira/browse/YARN-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-1989: --- Attachment: YARN-1989-0.patch attaching patch. local-resourcemanagers-ha.sh starts multiple resourcemanagers in HA mode on localhost. local-nodemanagers.sh starts multiple nodemanagers on localhost. Adding shell scripts to launch multiple servers on localhost Key: YARN-1989 URL: https://issues.apache.org/jira/browse/YARN-1989 Project: Hadoop YARN Issue Type: New Feature Reporter: Masatake Iwasaki Priority: Minor Attachments: YARN-1989-0.patch Adding shell scripts to launch multiple servers on localhost for test and debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1986) After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
[ https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992950#comment-13992950 ] Tsuyoshi OZAWA commented on YARN-1986: -- +1(non-binding). Let's wait for [~sandyr]'s comment. After upgrade from 2.2.0 to 2.4.0, NPE on first job start. -- Key: YARN-1986 URL: https://issues.apache.org/jira/browse/YARN-1986 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Jon Bringhurst Assignee: Hong Zhiguo Attachments: YARN-1986-2.patch, YARN-1986-3.patch, YARN-1986-testcase.patch, YARN-1986.patch After upgrade from 2.2.0 to 2.4.0, NPE on first job start. After RM was restarted, the job runs without a problem. {noformat} 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) at java.lang.Thread.run(Thread.java:744) 19:11:13,443 INFO ResourceManager:604 - Exiting, bbye.. {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-667) Data persisted in RM should be versioned
[ https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-667: Summary: Data persisted in RM should be versioned (was: Data persisted by YARN daemons should be versioned) Data persisted in RM should be versioned Key: YARN-667 URL: https://issues.apache.org/jira/browse/YARN-667 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Assignee: Junping Du Includes data persisted for RM restart, NodeManager directory structure and the Aggregated Log Format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994886#comment-13994886 ] Vinod Kumar Vavilapalli commented on YARN-556: -- Tx for the community update, Karthik. Also, Jian/Abhinav, can you both please file all the known sub-tasks and assign things to yourselves according as you are working on them rightaway? Other folks like [~ozawa] and [~rohithsharma] have been requesting repeatedly expressed interest to work on this feature. It'll be great to find stuff for everyone instead of creating all tickets and assigning them to the two of you. Thanks. [~ozawa] and [~rohithsharma], let others know what you specifically want to work on, if you have something in mind. bq. 6. clustertimestamp is added to containerId so that containerId after RM restart do not clash with containerId before (as the containerId counter resets to zero in memory) I totally missed this line item. Can you throw more detail on what the problem is and what the proposal is? What is done in the prototype patch is a major compatibility issue - I'd like to avoid it if we can. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2045) Data persisted in NM should be versioned
Junping Du created YARN-2045: Summary: Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Junping Du Assignee: Junping Du As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-667) Data persisted in RM should be versioned
[ https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994984#comment-13994984 ] Junping Du commented on YARN-667: - Filed YARN-2045 to address NM part data. Data persisted in RM should be versioned Key: YARN-667 URL: https://issues.apache.org/jira/browse/YARN-667 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Assignee: Junping Du Includes data persisted for RM restart, NodeManager directory structure and the Aggregated Log Format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-182) Unnecessary Container killed by the ApplicationMaster message for successful containers
[ https://issues.apache.org/jira/browse/YARN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994967#comment-13994967 ] Deepak Kumar V commented on YARN-182: - Hello, I am seeing this in Note section of each map task. Each of Map task state is SUCCEEDED. Message: attempt_1399912169384_0001_m_13_0 100.00 SUCCEEDED map datanode-9-281920.slc01.dev.company.com:8042logsMon, 12 May 2014 16:33:33 GMT Mon, 12 May 2014 16:34:26 GMT 52sec Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Hadoop Details: NameNode 'namenode-284133.slc01.dev.company.com:8020' (active) Started:Tue May 06 16:18:04 GMT-07:00 2014 Version:2.4.0.2.1.1.0-385, 68ceccf06a4441273e81a5ec856d41fc7e11c792 Compiled: 2014-04-16T21:24Z by jenkins from (no branch) Cluster ID: CID-fb86b3cf-7787-4c67-998f-24f00e43c137 Block Pool ID: BP-1163369527-10.65.216.196-1399412949036 Unnecessary Container killed by the ApplicationMaster message for successful containers - Key: YARN-182 URL: https://issues.apache.org/jira/browse/YARN-182 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.1-alpha Reporter: zhengqiu cai Assignee: Omkar Vinit Joshi Labels: hadoop, usability Attachments: Log.txt I was running wordcount and the resourcemanager web UI shown the status as FINISHED SUCCEEDED, but the log shown Container killed by the ApplicationMaster -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-667) Data persisted in RM should be versioned
[ https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994975#comment-13994975 ] Junping Du commented on YARN-667: - Limit the scope of this JIRA to persistent data in RM - RMStateStore. Will file separate JIRA to address persistent data in NM. Data persisted in RM should be versioned Key: YARN-667 URL: https://issues.apache.org/jira/browse/YARN-667 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Assignee: Junping Du Includes data persisted for RM restart, NodeManager directory structure and the Aggregated Log Format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994957#comment-13994957 ] Carlo Curino commented on YARN-2022: Sunil, I am travelling abroad till 26th (please forgive delays)... I could only skim the patch from a mobile device. It looks reasonable, a concern I have is that we rely on a user set Priority to choose whether to preempt or not. Unless there are check in place preventing the user from abusing this value, this is egregiously gameable (set my containers all to AM priority and get away with murder). Also I thought more about the possible corner cases, after conversation with Chris Douglas, and Mayank: we should keep an eye out for max percentage of resources dedicated to AMs... we should save the AMs from earlier (higher-pri) applications up till the max % of AM we can allocate in the Queue, and at the very least not protect the AMs past that point. Similar check should be in place for userLimitFactor. Without this it is entirely possible that a queue is wedged with 100% AMs or that a user has in its AM more resources than he deserve (and it is systematically skipped, even if the cluster is empty). We have seen some of this in particular extreme test cases (espilon-size queues, many apps moved to a queue etc...). Please share your thoughts on this... Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994877#comment-13994877 ] Vinod Kumar Vavilapalli commented on YARN-1515: --- YARN-445 BTW is today focusing on exposing a signalling interface on the ResourceManager. It seems like we can simply expose the same API as part of ContainerManagement and get most of this functionality with minimal changes. Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: Sub-task Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994866#comment-13994866 ] Vinod Kumar Vavilapalli commented on YARN-1515: --- We can still implement the 'single RPC' functionality you wanted, by making the signal API take in a list of signals and optional time-intervals in between. Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: Sub-task Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Attachment: yarn-2010-2.patch New patch with following changes - # Noticed that RMAppManager#recoverApplication wasn't failing running applications in all the code-paths corresponding to failed recovery. Fixed that and cleaned it up futher. # Changed the config name to be shorter. # Added comments to make sure we document why we are doing what we are doing. RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Rohith Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()
Ted Yu created YARN-2042: Summary: String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp() Key: YARN-2042 URL: https://issues.apache.org/jira/browse/YARN-2042 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Priority: Minor {code} if (queueName != null queueName != ) { {code} queueName.isEmpty() should be used instead of comparing against -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1515: -- Target Version/s: 2.5.0 (was: 2.4.0) Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: Sub-task Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2046) Out of band heartbeats are sent only on container kill and possibly too early
Jason Lowe created YARN-2046: Summary: Out of band heartbeats are sent only on container kill and possibly too early Key: YARN-2046 URL: https://issues.apache.org/jira/browse/YARN-2046 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0, 0.23.10 Reporter: Jason Lowe [~mingma] pointed out in the review discussion for MAPREDUCE-5465 that the NM is currently sending out of band heartbeats only when stopContainer is called. In addition those heartbeats might be sent too early because the container kill event is asynchronously posted then the heartbeat monitor is notified. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995165#comment-13995165 ] Karthik Kambatla commented on YARN-556: --- Oh. Forgot to mention that. [~adhoot] offered to split up the prototype into multiple patches, one for each of the sub-tasks. If I understand right, his prototype covers almost all the sub-tasks already created. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1918) Typo in description and error message for 'yarn.resourcemanager.cluster-id'
[ https://issues.apache.org/jira/browse/YARN-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995173#comment-13995173 ] Tsuyoshi OZAWA commented on YARN-1918: -- Thanks for your contribution, [~analog.sony]. It looks good to me(non-binding). Please wait for a review by commiters. Typo in description and error message for 'yarn.resourcemanager.cluster-id' --- Key: YARN-1918 URL: https://issues.apache.org/jira/browse/YARN-1918 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Devaraj K Assignee: Anandha L Ranganathan Priority: Trivial Labels: newbie Attachments: YARN-1918.1.patch 1. In yarn-default.xml {code:xml} property descriptionName of the cluster. In a HA setting, this is used to ensure the RM participates in leader election fo this cluster and ensures it does not affect other clusters/description nameyarn.resourcemanager.cluster-id/name !--valueyarn-cluster/value-- /property {code} Here the line 'election fo this cluster and ensures it does not affect' should be replaced with 'election for this cluster and ensures it does not affect'. 2. {code:xml} org.apache.hadoop.HadoopIllegalArgumentException: Configuration doesn't specifyyarn.resourcemanager.cluster-id at org.apache.hadoop.yarn.conf.YarnConfiguration.getClusterId(YarnConfiguration.java:1336) {code} In the above exception message, it is missing a space between message and configuration name. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1701) Improve default paths of timeline store and generic history store
[ https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994348#comment-13994348 ] Hudson commented on YARN-1701: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #560 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/560/]) YARN-1701. Improved default paths of the timeline store and the generic history store. Contributed by Tsuyoshi Ozawa. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593481) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Improve default paths of timeline store and generic history store - Key: YARN-1701 URL: https://issues.apache.org/jira/browse/YARN-1701 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Tsuyoshi OZAWA Fix For: 2.4.1 Attachments: YARN-1701.3.patch, YARN-1701.v01.patch, YARN-1701.v02.patch When I enable AHS via yarn.ahs.enabled, the app history is still not visible in AHS webUI. This is due to NullApplicationHistoryStore as yarn.resourcemanager.history-writer.class. It would be good to have just one key to enable basic functionality. yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is local file system location. However, FileSystemApplicationHistoryStore uses DFS by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1474: - Attachment: YARN-1474.11.patch Added missing file to pass compile. Make schedulers services Key: YARN-1474 URL: https://issues.apache.org/jira/browse/YARN-1474 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.3.0 Reporter: Sandy Ryza Assignee: Tsuyoshi OZAWA Attachments: YARN-1474.1.patch, YARN-1474.10.patch, YARN-1474.11.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, YARN-1474.9.patch Schedulers currently have a reinitialize but no start and stop. Fitting them into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2027) YARN ignores host-specific resource requests
[ https://issues.apache.org/jira/browse/YARN-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995262#comment-13995262 ] Bikas Saha commented on YARN-2027: -- Was the relaxLocality flag set to false in order to make a hard constraint for the node? Or is the jira stating that even soft locality constraints (where YARN is allowed to relax the locality from node to rack to *) is also not working? Soft locality would need delay scheduling to be enabled and that needs the configs that Sandy mentioned. YARN ignores host-specific resource requests Key: YARN-2027 URL: https://issues.apache.org/jira/browse/YARN-2027 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.0 Environment: RHEL 6.1 YARN 2.4 Reporter: Chris Riccomini YARN appears to be ignoring host-level ContainerRequests. I am creating a container request with code that pretty closely mirrors the DistributedShell code: {code} protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) { info(Requesting %d container(s) with %dmb of memory format (containers, memMb)) val capability = Records.newRecord(classOf[Resource]) val priority = Records.newRecord(classOf[Priority]) priority.setPriority(0) capability.setMemory(memMb) capability.setVirtualCores(cpuCores) // Specifying a host in the String[] host parameter here seems to do nothing. Setting relaxLocality to false also doesn't help. (0 until containers).foreach(idx = amClient.addContainerRequest(new ContainerRequest(capability, null, null, priority))) } {code} When I run this code with a specific host in the ContainerRequest, YARN does not honor the request. Instead, it puts the container on an arbitrary host. This appears to be true for both the FifoScheduler and the CapacityScheduler. Currently, we are running the CapacityScheduler with the following settings: {noformat} configuration property nameyarn.scheduler.capacity.maximum-applications/name value1/value description Maximum number of applications that can be pending and running. /description /property property nameyarn.scheduler.capacity.maximum-am-resource-percent/name value0.1/value description Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. /description /property property nameyarn.scheduler.capacity.resource-calculator/name valueorg.apache.hadoop.yarn.util.resource.DefaultResourceCalculator/value description The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. /description /property property nameyarn.scheduler.capacity.root.queues/name valuedefault/value description The queues at the this level (root is the root queue). /description /property property nameyarn.scheduler.capacity.root.default.capacity/name value100/value descriptionSamza queue target capacity./description /property property nameyarn.scheduler.capacity.root.default.user-limit-factor/name value1/value description Default queue user limit a percentage from 0.0 to 1.0. /description /property property nameyarn.scheduler.capacity.root.default.maximum-capacity/name value100/value description The maximum capacity of the default queue. /description /property property nameyarn.scheduler.capacity.root.default.state/name valueRUNNING/value description The state of the default queue. State can be one of RUNNING or STOPPED. /description /property property nameyarn.scheduler.capacity.root.default.acl_submit_applications/name value*/value description The ACL of who can submit jobs to the default queue. /description /property property nameyarn.scheduler.capacity.root.default.acl_administer_queue/name value*/value description The ACL of who can administer jobs on the default queue. /description /property property nameyarn.scheduler.capacity.node-locality-delay/name value40/value description Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster, By default is setting approximately number of nodes in
[jira] [Created] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup
Karthik Kambatla created YARN-2036: -- Summary: Document yarn.resourcemanager.hostname in ClusterSetup Key: YARN-2036 URL: https://issues.apache.org/jira/browse/YARN-2036 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Attachments: YARN2036-01.patch, YARN2036-02.patch ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people should just be able to use that directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995290#comment-13995290 ] Anubhav Dhoot commented on YARN-2001: - Won't killing the containers on RM restart/fail over defeat the purpose of the Work Preserving effort. Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2046) Out of band heartbeats are sent only on container kill and possibly too early
[ https://issues.apache.org/jira/browse/YARN-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995147#comment-13995147 ] Jason Lowe commented on YARN-2046: -- We should consider sending out of band heartbeats after a container completes rather than when a container is killed. For a cluster running MapReduce this should be almost equivalent in terms of number of OOB heartbeats sent since the MR AM always kills completed task attempts until MAPREDUCE-5465 is addressed. Out of band heartbeats are sent only on container kill and possibly too early - Key: YARN-2046 URL: https://issues.apache.org/jira/browse/YARN-2046 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe [~mingma] pointed out in the review discussion for MAPREDUCE-5465 that the NM is currently sending out of band heartbeats only when stopContainer is called. In addition those heartbeats might be sent too early because the container kill event is asynchronously posted then the heartbeat monitor is notified. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2039) Better reporting of finished containers to AMs
[ https://issues.apache.org/jira/browse/YARN-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2039. Resolution: Duplicate Thanks for pointing that out, Bikas. Resolving as duplicate. Better reporting of finished containers to AMs -- Key: YARN-2039 URL: https://issues.apache.org/jira/browse/YARN-2039 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Priority: Critical On RM restart, we shouldn't lose information about finished containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995361#comment-13995361 ] Jian He commented on YARN-2016: --- Adding unit tests for all records would be another big effort. Junping, you can open a new jira to discuss about this if needed. Committing this. Yarn getApplicationRequest start time range is not honored -- Key: YARN-2016 URL: https://issues.apache.org/jira/browse/YARN-2016 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Venkat Ranganathan Assignee: Junping Du Attachments: YARN-2016.patch, YarnTest.java When we query for the previous applications by creating an instance of GetApplicationsRequest and setting the start time range and application tag, we see that the start range provided is not honored and all applications with the tag are returned Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995316#comment-13995316 ] Bikas Saha commented on YARN-2001: -- I think the offline discussion agreement was that there would be a threshold for NM's to resync. After that threshold the scheduler would be started. After that the NM's have until the NM heartbeat expire interval to resync. After the NM expiry interval, the NM's are considered lost (consistent with current behavior). Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1982) Rename the daemon name to timelineserver
[ https://issues.apache.org/jira/browse/YARN-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995404#comment-13995404 ] Hudson commented on YARN-1982: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5603/]) YARN-1982. Renamed the daemon name to be TimelineServer instead of History Server and deprecated the old usage. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593748) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn.cmd * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/yarn-env.sh Rename the daemon name to timelineserver Key: YARN-1982 URL: https://issues.apache.org/jira/browse/YARN-1982 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: cli Fix For: 2.5.0 Attachments: YARN-1982.1.patch Nowadays, it's confusing that we call the new component timeline server, but we use {code} yarn historyserver yarn-daemon.sh start historyserver {code} to start the daemon. Before the confusion keeps being propagated, we'd better to modify command line asap. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1962) Timeline server is enabled by default
[ https://issues.apache.org/jira/browse/YARN-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995407#comment-13995407 ] Hudson commented on YARN-1962: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5603/]) YARN-1962. Changed Timeline Service client configuration to be off by default given the non-readiness of the feature yet. Contributed by Mohammad Kamrul Islam. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593750) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Timeline server is enabled by default - Key: YARN-1962 URL: https://issues.apache.org/jira/browse/YARN-1962 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Reporter: Mohammad Kamrul Islam Assignee: Mohammad Kamrul Islam Fix For: 2.4.1 Attachments: YARN-1962.1.patch, YARN-1962.2.patch Since Timeline server is not matured and secured yet, enabling it by default might create some confusion. We were playing with 2.4.0 and found a lot of exceptions for distributed shell example related to connection refused error. Btw, we didn't run TS because it is not secured yet. Although it is possible to explicitly turn it off through yarn-site config. In my opinion, this extra change for this new service is not worthy at this point,. This JIRA is to turn it off by default. If there is an agreement, i can put a simple patch about this. {noformat} 14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response from the timeline server. com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: Connection refused at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1072) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:515) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:281) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.in14/04/17 23:24:33 ERROR impl.TimelineClientImpl: Failed to get the response from the timeline server. com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: Connection refused at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingEntities(TimelineClientImpl.java:131) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:104) at
[jira] [Commented] (YARN-2036) Document yarn.resourcemanager.hostname in ClusterSetup
[ https://issues.apache.org/jira/browse/YARN-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995405#comment-13995405 ] Hudson commented on YARN-2036: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5603/]) YARN-2036. Document yarn.resourcemanager.hostname in ClusterSetup (Ray Chiang via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593631) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Document yarn.resourcemanager.hostname in ClusterSetup -- Key: YARN-2036 URL: https://issues.apache.org/jira/browse/YARN-2036 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Ray Chiang Priority: Minor Fix For: 2.5.0 Attachments: YARN2036-01.patch, YARN2036-02.patch ClusterSetup doesn't talk about yarn.resourcemanager.hostname - most people should just be able to use that directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1975) Used resources shows escaped html in CapacityScheduler and FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995408#comment-13995408 ] Hudson commented on YARN-1975: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5603/]) YARN-1975. Fix yarn application CLI to print the scheme of the tracking url of failed/killed applications. Contributed by Junping Du (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593874) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java Used resources shows escaped html in CapacityScheduler and FairScheduler page - Key: YARN-1975 URL: https://issues.apache.org/jira/browse/YARN-1975 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.4.0 Reporter: Nathan Roberts Assignee: Mit Desai Fix For: 3.0.0, 2.4.1 Attachments: YARN-1975.patch, screenshot-1975.png Used resources displays as amp;lt;memory:, vCores;amp;gt; with capacity scheduler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2032: Attachment: YARN-2032-branch-2-1.patch Updating patch for branch-2 Thanks, Mayank Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2032-branch-2-1.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995438#comment-13995438 ] Karthik Kambatla commented on YARN-2033: Thanks Vinod. Would like to hear your thoughts on the following: What are the perceived scalability requirements of the history store and timeline store? I would think the timeline store might have to supporting storing a lot more information than the history store. In that case, one might want to keep them separate? Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995326#comment-13995326 ] Tsuyoshi OZAWA commented on YARN-2001: -- [~adhoot], you're correct basically. I meant that if epoch gap between RM and NM is too large to handle for RM, it can be killed. It saves memory usage of RM. Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995328#comment-13995328 ] Anubhav Dhoot commented on YARN-556: bq. clustertimestamp is added to containerId so that containerId after RM restart do not clash with containerId before (as the containerId counter resets to zero in memory). The problem is the containerId currently is composed of ApplicationAttemptId + int. The int part comes from a in memory containerIdCounter from AppSchedulingInfo. This gets reset after a RM restart. Without any changes the containerIds for containers allocated after restart would clash with existing containerIds. The prototype proposal is to make it ApplicationAttemptId + uniqueid + int where the uniqueid can be a timestamp set by RM. I feel containerId should be an opaque string that YARN app developers don't take a dependency on. Also if we used protobuf serialization/deserialization rules everywhere we could deal with compatibility changes of different YARN code versions. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1913: -- Assignee: Karthik Kambatla With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995354#comment-13995354 ] Jian He commented on YARN-1372: --- An alternative would be to make NM remember the current containers in memory until the application is completed. On each re-register NM sends across the whole list of container statuses. Typically, each NM holds tens of containers in memory which shouldn't be much memory overhead, as compared to RM which holds all the active containers in the cluster. This also avoids protocol changes. Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()
[ https://issues.apache.org/jira/browse/YARN-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995485#comment-13995485 ] Hadoop QA commented on YARN-2042: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644357/YARN-2042.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3732//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3732//console This message is automatically generated. String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp() Key: YARN-2042 URL: https://issues.apache.org/jira/browse/YARN-2042 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Chen He Priority: Minor Attachments: YARN-2042.patch {code} if (queueName != null queueName != ) { {code} queueName.isEmpty() should be used instead of comparing against -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995213#comment-13995213 ] Tsuyoshi OZAWA commented on YARN-2001: -- [~leftnoteasy] , my idea is creating ClusterId-space under the epoch(cluster-timestamp) like {{MapEpoch, ListClusterID}}. * Epoch (saved in ZKRMStateStore and RM's memory), just a integer value. * ClusterID (saved in RM's memory), same as current code. A rough sketch is as follows: * When a new active RM starts up, Epoch in RMStateStore is incremented and RM sets the Epoch. ClusterID is reset to zero. * Heartbeats between NM and RM include Epoch: RM can distinguish old cluster-timestamps from the new one when NM is registered. If the Epoch is older than RM expects, RM can kill the containers via NM. Please correct me if I'm wrong. Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First
[ https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995153#comment-13995153 ] Karthik Kambatla commented on YARN-1969: Just stating the obvious: we need to add a way to specify per-job deadlines too. Fair Scheduler: Add policy for Earliest Deadline First -- Key: YARN-1969 URL: https://issues.apache.org/jira/browse/YARN-1969 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, what we require is a kind of variation of *Earliest Deadline First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to *only within a queue*: i.e., adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues to use it by setting the schedulingPolicy field. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2011) Fix typo and warning in TestLeafQueue
[ https://issues.apache.org/jira/browse/YARN-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995403#comment-13995403 ] Hudson commented on YARN-2011: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5603/]) YARN-2011. Fix typo and warning in TestLeafQueue (Contributed by Chen He) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593804) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java Fix typo and warning in TestLeafQueue - Key: YARN-2011 URL: https://issues.apache.org/jira/browse/YARN-2011 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Chen He Assignee: Chen He Priority: Trivial Fix For: 2.5.0 Attachments: YARN-2011-v2.patch, YARN-2011.patch a.assignContainers(clusterResource, node_0); assertEquals(2*GB, a.getUsedResources().getMemory()); assertEquals(2*GB, app_0.getCurrentConsumption().getMemory()); assertEquals(0*GB, app_1.getCurrentConsumption().getMemory()); assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G assertEquals(0*GB, app_0.getHeadroom().getMemory()); // User limit = 2G // Again one to user_0 since he hasn't exceeded user limit yet a.assignContainers(clusterResource, node_0); assertEquals(3*GB, a.getUsedResources().getMemory()); assertEquals(2*GB, app_0.getCurrentConsumption().getMemory()); assertEquals(1*GB, app_1.getCurrentConsumption().getMemory()); assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G assertEquals(0*GB, app_0.getHeadroom().getMemory()); // 3G - 2G -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995516#comment-13995516 ] Bikas Saha commented on YARN-1372: -- what happens when they are 100's of long jobs running 100's of containers per nodes. Do we hold onto info about all those containers that have completed long ago? Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions
[ https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995398#comment-13995398 ] Hudson commented on YARN-1987: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5603/]) YARN-1987. Wrapper for leveldb DBIterator to aid in handling database exceptions. (Jason Lowe via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593757) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/LeveldbIterator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/utils/TestLeveldbIterator.java Wrapper for leveldb DBIterator to aid in handling database exceptions - Key: YARN-1987 URL: https://issues.apache.org/jira/browse/YARN-1987 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.5.0 Attachments: YARN-1987.patch, YARN-1987v2.patch Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a utility wrapper around leveldb's DBIterator to translate the raw RuntimeExceptions it can throw into DBExceptions to make it easier to handle database errors while iterating. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995498#comment-13995498 ] Hadoop QA commented on YARN-2016: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644248/YARN-2016.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3733//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3733//console This message is automatically generated. Yarn getApplicationRequest start time range is not honored -- Key: YARN-2016 URL: https://issues.apache.org/jira/browse/YARN-2016 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Venkat Ranganathan Assignee: Junping Du Attachments: YARN-2016.patch, YarnTest.java When we query for the previous applications by creating an instance of GetApplicationsRequest and setting the start time range and application tag, we see that the start range provided is not honored and all applications with the tag are returned Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2048) List all of the containers of an application from the yarn web
[ https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhou updated YARN-2048: --- Attachment: YARN-2048-trunk-v1.patch Submit a patch on trunk List all of the containers of an application from the yarn web -- Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Reporter: Min Zhou Attachments: YARN-2048-trunk-v1.patch Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which nodes those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2048) List all of the containers of an application from the yarn web
Min Zhou created YARN-2048: -- Summary: List all of the containers of an application from the yarn web Key: YARN-2048 URL: https://issues.apache.org/jira/browse/YARN-2048 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, webapp Reporter: Min Zhou Currently, Yarn haven't provide a way to list all of the containers of an application from its web. This kind of information is needed by the application user. They can conveniently know how many containers their applications already acquired as well as which node those containers were launched on. They also want to view the logs of each container of an application. One approach is maintain a container list in RMAppImpl and expose this info to Application page. I will submit a patch soon -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2040) Recover information about finished containers
[ https://issues.apache.org/jira/browse/YARN-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-2040. -- Resolution: Duplicate This will be covered by YARN-1337. Recover information about finished containers - Key: YARN-2040 URL: https://issues.apache.org/jira/browse/YARN-2040 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla The NM should store and recover information about finished containers as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2042) String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp()
[ https://issues.apache.org/jira/browse/YARN-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995505#comment-13995505 ] Chen He commented on YARN-2042: --- The change in this patch does not need to include test. String shouldn't be compared using == in QueuePlacementRule#NestedUserQueue#getQueueForApp() Key: YARN-2042 URL: https://issues.apache.org/jira/browse/YARN-2042 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Chen He Priority: Minor Attachments: YARN-2042.patch {code} if (queueName != null queueName != ) { {code} queueName.isEmpty() should be used instead of comparing against -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1702) Expose kill app functionality as part of RM web services
[ https://issues.apache.org/jira/browse/YARN-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-1702: Attachment: apache-yarn-1702.9.patch New patch with the following fixes - 1. Use right call to get queue and acl managers to check permissions 2. Fix kill api to use PUT on /apps/{appid}/state to kill an app 3. Added documentation for the REST call. Expose kill app functionality as part of RM web services Key: YARN-1702 URL: https://issues.apache.org/jira/browse/YARN-1702 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-1702.2.patch, apache-yarn-1702.3.patch, apache-yarn-1702.4.patch, apache-yarn-1702.5.patch, apache-yarn-1702.7.patch, apache-yarn-1702.8.patch, apache-yarn-1702.9.patch Expose functionality to kill an app via the ResourceManager web services API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1938) Kerberos authentication for the timeline server
[ https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1938: -- Attachment: YARN-1938.2.patch Made some minor touch on the last patch. Kerberos authentication for the timeline server --- Key: YARN-1938 URL: https://issues.apache.org/jira/browse/YARN-1938 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1938.1.patch, YARN-1938.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992962#comment-13992962 ] Tsuyoshi OZAWA commented on YARN-2001: -- {quote} If possible, I think we should avoid changing container Id format. {quote} +1, if possible. Can we add epoch (cluster timestamp) to ResourceTrackerService's state via heartbeat? Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1457) YARN single node install issues on mvn clean install assembly:assembly on mapreduce project
[ https://issues.apache.org/jira/browse/YARN-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992948#comment-13992948 ] Shaun Gittens commented on YARN-1457: - I get virtually the same error when executing : mvn clean install assembly:assembly -DskipTests except I'm running it on a Centos VM compiling Hadoop 2.2.0 and proto 2.5.0 ... YARN single node install issues on mvn clean install assembly:assembly on mapreduce project --- Key: YARN-1457 URL: https://issues.apache.org/jira/browse/YARN-1457 Project: Hadoop YARN Issue Type: Bug Components: site Affects Versions: 2.0.5-alpha Reporter: Rekha Joshi Priority: Minor Labels: mvn Attachments: yarn-mvn-mapreduce.txt YARN single node install - http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html On Mac OSX 10.7.3, Java 1.6, Protobuf 2.5.0 and hadoop-2.0.5-alpha.tar, mvn clean install -DskipTests succeds after a YARN fix on pom.xml(using 2.5.0 protobuf) But on hadoop-mapreduce-project mvn install fails for tests with below errors $ mvn clean install assembly:assembly -Pnative errors as in atatched yarn-mvn-mapreduce,txt On $mvn clean install assembly:assembly -DskipTests Reactor Summary: [INFO] [INFO] hadoop-mapreduce-client ... SUCCESS [2.410s] [INFO] hadoop-mapreduce-client-core .. SUCCESS [13.781s] [INFO] hadoop-mapreduce-client-common SUCCESS [8.486s] [INFO] hadoop-mapreduce-client-shuffle ... SUCCESS [0.774s] [INFO] hadoop-mapreduce-client-app ... SUCCESS [4.409s] [INFO] hadoop-mapreduce-client-hs SUCCESS [1.618s] [INFO] hadoop-mapreduce-client-jobclient . SUCCESS [4.470s] [INFO] hadoop-mapreduce-client-hs-plugins SUCCESS [0.561s] [INFO] Apache Hadoop MapReduce Examples .. SUCCESS [1.620s] [INFO] hadoop-mapreduce .. FAILURE [10.107s] [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 49.606s [INFO] Finished at: Thu Nov 28 16:20:52 GMT+05:30 2013 [INFO] Final Memory: 34M/118M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.3:assembly (default-cli) on project hadoop-mapreduce: Error reading assemblies: No assembly descriptors found. - [Help 1] $mvn package -Pdist -DskipTests=true -Dtar works The documentation needs to be updated for possible issues and resolutions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995511#comment-13995511 ] Tsuyoshi OZAWA commented on YARN-556: - If we can break the compatibility about the container id, I think Anubhav's approach has no problem. If we cannot do this as [~jianhe] mentioned on YARN-2001, I think epoch idea [described here|https://issues.apache.org/jira/browse/YARN-2001?focusedCommentId=13995213page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13995213] might be used. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext
[ https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-2050: -- Attachment: YARN-2050.patch Papreduce CLI and yarn CLI pass the configuration to LogCLIHelpers. LogCLIHelpers use the same configuration to create remoteRootLogDir and remoteAppLogDir, etc. in dumpAllContainersLogs. The fix is to use the same configuration to create FileContext. To follow up on [~jlowe]'s comments, 1. remoteAppLogDir.toUri().getScheme() returns null and AbstractFileSystem.createFileSystem doesn't like it if dumpAllContainersLogs calls FileContext.getFileContext(remoteAppLogDir.toUri()). 2. If caller of LogCLIHelpers doesn't setConf ahead of time, dumpAllContainersLogs will get null pointer exception when it tries to get remoteRootLogDir. Fix LogCLIHelpers to create the correct FileContext --- Key: YARN-2050 URL: https://issues.apache.org/jira/browse/YARN-2050 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-2050.patch LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus the FileContext created isn't necessarily the FileContext for remote log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2039) Better reporting of finished containers to AMs
[ https://issues.apache.org/jira/browse/YARN-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995241#comment-13995241 ] Bikas Saha commented on YARN-2039: -- Dupe of YARN-1372? Better reporting of finished containers to AMs -- Key: YARN-2039 URL: https://issues.apache.org/jira/browse/YARN-2039 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Priority: Critical On RM restart, we shouldn't lose information about finished containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995243#comment-13995243 ] Karthik Kambatla commented on YARN-2001: I think the epoch idea might work very nicely with the versioning work we plan to do as part of YARN-667. Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-766) TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk
[ https://issues.apache.org/jira/browse/YARN-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995402#comment-13995402 ] Hudson commented on YARN-766: - SUCCESS: Integrated in Hadoop-trunk-Commit #5603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5603/]) YARN-766. TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk. (Contributed by Siddharth Seth) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1593660) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java TestNodeManagerShutdown in branch-2 should use Shell to form the output path and a format issue in trunk Key: YARN-766 URL: https://issues.apache.org/jira/browse/YARN-766 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.1.0-beta Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Minor Attachments: YARN-766.branch-2.txt, YARN-766.trunk.txt, YARN-766.txt File scriptFile = new File(tmpDir, scriptFile.sh); should be replaced with File scriptFile = Shell.appendScriptExtension(tmpDir, scriptFile); to match trunk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995247#comment-13995247 ] Karthik Kambatla commented on YARN-1372: Based on offline discussion with Anubhav, Bikas, Jian and Vinod, the control flow for notifying the AM of finished containers should be as follows: # NM informs RM and holds on to the information (YARN-1336 should handle this as well) # RM informs AM # AM acks RM # RM acks NM # NM deletes the information Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1412) Allocating Containers on a particular Node in Yarn
[ https://issues.apache.org/jira/browse/YARN-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993616#comment-13993616 ] Chris Riccomini commented on YARN-1412: --- Seeing this as well (YARN-2027). Allocating Containers on a particular Node in Yarn -- Key: YARN-1412 URL: https://issues.apache.org/jira/browse/YARN-1412 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: centos, Hadoop 2.2.0 Reporter: gaurav gupta Summary of the problem: If I pass the node on which I want container and set relax locality default which is true, I don't get back the container on the node specified even if the resources are available on the node. It doesn't matter if I set rack or not. Here is the snippet of the code that I am using AMRMClientContainerRequest amRmClient = AMRMClient.createAMRMClient();; String host = h1; Resource capability = Records.newRecord(Resource.class); capability.setMemory(memory); nodes = new String[] {host}; // in order to request a host, we also have to request the rack racks = new String[] {/default-rack}; ListContainerRequest containerRequests = new ArrayListContainerRequest(); ListContainerId releasedContainers = new ArrayListContainerId(); containerRequests.add(new ContainerRequest(capability, nodes, racks, Priority.newInstance(priority))); if (containerRequests.size() 0) { LOG.info(Asking RM for containers: + containerRequests); for (ContainerRequest cr : containerRequests) { LOG.info(Requested container: {}, cr.toString()); amRmClient.addContainerRequest(cr); } } for (ContainerId containerId : releasedContainers) { LOG.info(Released container, id={}, containerId.getId()); amRmClient.releaseAssignedContainer(containerId); } return amRmClient.allocate(0); -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1936: -- Issue Type: Bug (was: Sub-task) Parent: (was: YARN-1530) Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1936: -- Description: TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase
[ https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995381#comment-13995381 ] Ted Yu commented on YARN-2032: -- {code} + version0.98.0-hadoop2/version {code} 0.98.2 is the latest relase for 0.98 Since SingleColumnValueFilter is used, you need to have HBASE-10850 which is in 0.98.2 {code} + protected void serviceInit(Configuration conf) throws Exception { +HBaseAdmin hbase = initHBase(conf); {code} Looks like the HBaseAdmin instance is not closed upon leaving serviceInit(). Implement a scalable, available TimelineStore using HBase - Key: YARN-2032 URL: https://issues.apache.org/jira/browse/YARN-2032 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2032-branch-2-1.patch As discussed on YARN-1530, we should pursue implementing a scalable, available Timeline store using HBase. One goal is to reuse most of the code from the levelDB Based store - YARN-1635. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995812#comment-13995812 ] Xuan Gong commented on YARN-1861: - Uploaded a new patch, Explicitly throwing the exception, saying Can not find the active RM, instead of NPE. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1751) Improve MiniYarnCluster for log aggregation testing
[ https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-1751: -- Attachment: YARN-1751.patch Thanks, Jason. Here is the patch for MiniYarnCluster. I have opened https://issues.apache.org/jira/browse/YARN-2050 for the LogCLIHelpers issue and will post more comments there. Improve MiniYarnCluster for log aggregation testing --- Key: YARN-1751 URL: https://issues.apache.org/jira/browse/YARN-1751 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1751-trunk.patch, YARN-1751.patch MiniYarnCluster specifies individual remote log aggregation root dir for each NM. Test code that uses MiniYarnCluster won't be able to get the value of log aggregation root dir. The following code isn't necessary in MiniYarnCluster. File remoteLogDir = new File(testWorkDir, MiniYARNCluster.this.getName() + -remoteLogDir-nm- + index); remoteLogDir.mkdir(); config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR, remoteLogDir.getAbsolutePath()); In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to FileContext.getFileContext() call. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995851#comment-13995851 ] Zhijie Shen commented on YARN-1936: --- BTW, the patch depends on YARN-2049 for compiling Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1936.1.patch TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995878#comment-13995878 ] Bikas Saha commented on YARN-556: - Folks please take the discussion for container id to its own jira. Spreading it in the main jira will make it harder to track. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995829#comment-13995829 ] Bikas Saha commented on YARN-556: - bq. After the configurable wait-time, the RM starts accepting RPCs from both new AMs and already existing AMs. This is not needed. The AM can be allowed to re-sync after state is recovered from the store. Allocations to the AM may not occur until the threshold elapses. In fact, we want to re-sync the AM's asap so that they dont give up on the RM. bq. Existing AMs are expected to resync with the RM, which essentially translates to register followed by an allocate call We should keep the option open to use a new API called resync that does exactly that. It may help to make this operation atomic RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995536#comment-13995536 ] Tsuyoshi OZAWA commented on YARN-2001: -- Bikas and Karthik, thanks for the sharing. I'll check YARN-667. Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995796#comment-13995796 ] Xuan Gong commented on YARN-1861: - bq. Can we make this explicit, instead of being an NPE? Like doing a client call to find the current active RM or something like that? Yes, we can do that. DONE bq. That is what I was thinking, but I am concerned about locking etc. This code has become a little convoluted. Per Xuan, we seem to be safe for now, so may be look at this separately? Yes. But I will make a note about it. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995743#comment-13995743 ] Karthik Kambatla commented on YARN-1861: bq. Also, we need to make sure that when automatic failover is enabled, all external interventions like a fence like this bug (and forced-manual failover from CLI?) do a similar reset into the leader election. There may not be cases like this today though. One way to future-proof this is to call resetLeaderElection in ResourceManager#transitionToStandby itself. That looks hacky, but doesn't require new external interventions to explicitly handle it. [~vinodkv] - do you think that would be a better approach? Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995760#comment-13995760 ] Vinod Kumar Vavilapalli commented on YARN-1861: --- bq. Without the core code change, this testcase will fail. Because NM is trying to connect the active RM, but neither of two RMs are active. So, the NPE is expected. Can we make this explicit, instead of being an NPE? Like doing a client call to find the current active RM or something like that? Tx for the explanation of all the cases, Xuan. bq. That looks hacky, but doesn't require new external interventions to explicitly handle it. Vinod Kumar Vavilapalli - do you think that would be a better approach? That is what I was thinking, but I am concerned about locking etc. This code has become a little convoluted. Per Xuan, we seem to be safe for now, so may be look at this separately? Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1227) Update Single Cluster doc to use yarn.resourcemanager.hostname
[ https://issues.apache.org/jira/browse/YARN-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA resolved YARN-1227. - Resolution: Invalid Closing this issue. Feel free to reopen if you disagree. Update Single Cluster doc to use yarn.resourcemanager.hostname -- Key: YARN-1227 URL: https://issues.apache.org/jira/browse/YARN-1227 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.1.0-beta Reporter: Sandy Ryza Assignee: Ray Chiang Labels: newbie Now that yarn.resourcemanager.hostname can be used in place or yarn.resourcemanager.address, yarn.resourcemanager.scheduler.address, etc., we should update the doc to use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1936: -- Issue Type: Sub-task (was: Bug) Parent: YARN-1935 Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar reassigned YARN-2012: Assignee: Ashwin Shankar Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute - Key: YARN-2012 URL: https://issues.apache.org/jira/browse/YARN-2012 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt Currently 'default' rule in queue placement policy,if applied,puts the app in root.default queue. It would be great if we can make 'default' rule optionally point to a different queue as default queue . This queue should be an existing queue,if not we fall back to root.default queue hence keeping this rule as terminal. This default queue can be a leaf queue or it can also be an parent queue if the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext
Ming Ma created YARN-2050: - Summary: Fix LogCLIHelpers to create the correct FileContext Key: YARN-2050 URL: https://issues.apache.org/jira/browse/YARN-2050 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus the FileContext created isn't necessarily the FileContext for remote log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2050) Fix LogCLIHelpers to create the correct FileContext
[ https://issues.apache.org/jira/browse/YARN-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995780#comment-13995780 ] Hadoop QA commented on YARN-2050: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644494/YARN-2050.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3739//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3739//console This message is automatically generated. Fix LogCLIHelpers to create the correct FileContext --- Key: YARN-2050 URL: https://issues.apache.org/jira/browse/YARN-2050 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-2050.patch LogCLIHelpers calls FileContext.getFileContext() without any parameters. Thus the FileContext created isn't necessarily the FileContext for remote log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995781#comment-13995781 ] Karthik Kambatla commented on YARN-1861: bq. That is what I was thinking, but I am concerned about locking etc. This code has become a little convoluted. Agree. I did consider going that route, but was worried about the maintainability. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995569#comment-13995569 ] Hadoop QA commented on YARN-1515: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644314/YARN-1515.v07.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3734//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3734//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3734//console This message is automatically generated. Ability to dump the container threads and stop the containers in a single RPC - Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: Sub-task Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995782#comment-13995782 ] Wangda Tan commented on YARN-2001: -- [~ozawa], as Bikas said, we should keep at least some early-reported containers not been killed via a threshold for NM's to resync. This is why we do work preserving restart. Threshold for RM to accept requests from AM after failover -- Key: YARN-2001 URL: https://issues.apache.org/jira/browse/YARN-2001 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First
[ https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995993#comment-13995993 ] Maysam Yabandeh commented on YARN-1969: --- Thanks for the comment [~kasha]. I think this is a good point to distinguish between the terms deadline and endtime. deadline would be the user-specified SLA and as you correctly mentioned in many cases it is quite likely to be missed due to failures, limited resources, etc. Still the user can express the level of urgency by the desired deadline, but they could also do that via priorities, so the user-specified deadline would be a complementary (and perhaps more expressive) way for users to specify the priorities of their jobs. endtime, on the other hand, is the estimated end time of the job based on the current progress and assuming that the RM will give the rest of the required resources immediately. endtime is automatically computed by the AppMaster and there is no need for user involvement. When scheduling resources, the advantage of taking endtime into consideration is that the giant jobs that are close to be finished could be prioritized. We in general want to have such jobs finished sooner since (i) they would release the resources that they have occupied such as the disk space for the mappers' output, (ii) a large job is more susceptible to failures and the longer they are hanging around , the more is the likelihood of being affected by a loss of a mapper node. The added subtasks are based on the agenda of (i) estimating the end time, (ii) sending it over to RM, (iii) letting RM take it into consideration. We can also extend the API to allow the users to specify their desired deadline. As for how RM take the specified deadline or estimated endtime into consideration, I think once we have the endtime field available in RM, there will be many new opportunities to take advantage of it. One way as you mentioned is to translate them into weights to be used by the current fair scheduler. Any other scheduling algorithm, including EDF, also can be plugged in and do the scheduling based on a function of the endtime and other variables. The other variables could include the size of the job, as discussed above. Fair Scheduler: Add policy for Earliest Deadline First -- Key: YARN-1969 URL: https://issues.apache.org/jira/browse/YARN-1969 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh What we are observing is that some big jobs with many allocated containers are waiting for a few containers to finish. Under *fair-share scheduling* however they have a low priority since there are other jobs (usually much smaller, new comers) that are using resources way below their fair share, hence new released containers are not offered to the big, yet close-to-be-finished job. Nevertheless, everybody would benefit from an unfair scheduling that offers the resource to the big job since the sooner the big job finishes, the sooner it releases its many allocated resources to be used by other jobs.In other words, what we require is a kind of variation of *Earliest Deadline First scheduling*, that takes into account the number of already-allocated resources and estimated time to finish. http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling For example, if a job is using MEM GB of memory and is expected to finish in TIME minutes, the priority in scheduling would be a function p of (MEM, TIME). The expected time to finish can be estimated by the AppMaster using TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource request messages. To be less susceptible to the issue of apps gaming the system, we can have this scheduling limited to *only within a queue*: i.e., adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues to use it by setting the schedulingPolicy field. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995925#comment-13995925 ] Junping Du commented on YARN-2016: -- Thanks [~jianhe] for review and comments! Filed YARN-2051 to address more tests for PBImpl. Yarn getApplicationRequest start time range is not honored -- Key: YARN-2016 URL: https://issues.apache.org/jira/browse/YARN-2016 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Venkat Ranganathan Assignee: Junping Du Fix For: 2.4.1 Attachments: YARN-2016.patch, YarnTest.java When we query for the previous applications by creating an instance of GetApplicationsRequest and setting the start time range and application tag, we see that the start range provided is not honored and all applications with the tag are returned Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored
[ https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995981#comment-13995981 ] Hudson commented on YARN-2016: -- FAILURE: Integrated in Hadoop-trunk-Commit #5604 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5604/]) YARN-2016. Fix a bug in GetApplicationsRequestPBImpl to add the missed fields to proto. Contributed by Junping Du (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1594085) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java Yarn getApplicationRequest start time range is not honored -- Key: YARN-2016 URL: https://issues.apache.org/jira/browse/YARN-2016 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Venkat Ranganathan Assignee: Junping Du Fix For: 2.4.1 Attachments: YARN-2016.patch, YarnTest.java When we query for the previous applications by creating an instance of GetApplicationsRequest and setting the start time range and application tag, we see that the start range provided is not honored and all applications with the tag are returned Attaching a reproducer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2049: -- Attachment: YARN-2049.1.patch In this patch, I implemented the delegation token service via HTTP protocol by leveraging the hadoop-auth modules, and I significantly referred to the design of the delegation token service of HttpFS. 1. Make the TimelineDelegationTokenIdenifier and secretManager as usual. 2. Extend the KerberosAuthenticationFilter and KerberosAuthenticationHandler to accept authentication based either the kerberos principle or the delegation token. 3. Extend KerberosAuthenticator to encapsulate DT based communication, and add the APIs to get/renew/cancel DT. 4. Modify the web stack to enable SPNEGO for the timeline server, and make secret manager service callable from the filter. 5. Fix the test cases accordingly. This patch is only compilable based on YARN-1938 and HADOOP-10596 Delegation token stuff for the timeline sever - Key: YARN-2049 URL: https://issues.apache.org/jira/browse/YARN-2049 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2049.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1861: Attachment: YARN-1861.7.patch Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Karthik Kambatla Priority: Blocker Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1936) Secured timeline client
[ https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995919#comment-13995919 ] Hadoop QA commented on YARN-1936: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644522/YARN-1936.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3742//console This message is automatically generated. Secured timeline client --- Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1936.1.patch TimelineClient should be able to talk to the timeline server with kerberos authentication or delegation token -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1751) Improve MiniYarnCluster for log aggregation testing
[ https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-1751: -- Summary: Improve MiniYarnCluster for log aggregation testing (was: Improve MiniYarnCluster and LogCLIHelpers for log aggregation testing) Improve MiniYarnCluster for log aggregation testing --- Key: YARN-1751 URL: https://issues.apache.org/jira/browse/YARN-1751 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Ming Ma Assignee: Ming Ma Attachments: YARN-1751-trunk.patch MiniYarnCluster specifies individual remote log aggregation root dir for each NM. Test code that uses MiniYarnCluster won't be able to get the value of log aggregation root dir. The following code isn't necessary in MiniYarnCluster. File remoteLogDir = new File(testWorkDir, MiniYARNCluster.this.getName() + -remoteLogDir-nm- + index); remoteLogDir.mkdir(); config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR, remoteLogDir.getAbsolutePath()); In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to FileContext.getFileContext() call. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1937) Access control of per-framework data
[ https://issues.apache.org/jira/browse/YARN-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1937: -- Issue Type: Bug (was: Sub-task) Parent: (was: YARN-1530) Access control of per-framework data Key: YARN-1937 URL: https://issues.apache.org/jira/browse/YARN-1937 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)