[jira] [Updated] (YARN-1874) Cleanup: Move RMActiveServices out of ResourceManager into its own file
[ https://issues.apache.org/jira/browse/YARN-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1874: - Attachment: YARN-1874.2.patch > Cleanup: Move RMActiveServices out of ResourceManager into its own file > --- > > Key: YARN-1874 > URL: https://issues.apache.org/jira/browse/YARN-1874 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Karthik Kambatla >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1874.1.patch, YARN-1874.2.patch > > > As [~vinodkv] noticed on YARN-1867, ResourceManager is hard to maintain. We > should move RMActiveServices out to make it more manageable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2022: -- Description: Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J4 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. was: Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priroity among multiple applications. In this same scenario, map tasks from J3 and J4 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. > Preempting an Application Master container can be kept as least priority when > multiple applications are marked for preemption by > ProportionalCapacityPreemptionPolicy > - > > Key: YARN-2022 > URL: https://issues.apache.org/jira/browse/YARN-2022 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Sunil G >Assignee: Sunil G > > Cluster Size = 16GB [2NM's] > Queue A Capacity = 50% > Queue B Capacity = 50% > Consider there are 3 applications running in Queue A which has taken the full > cluster capacity. > J1 = 2GB AM + 1GB * 4 Maps > J2 = 2GB AM + 1GB * 4 Maps > J3 = 2GB AM + 1GB * 2 Maps > Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. > Currently in this scenario, Jobs J3 will get killed including its AM. > It is better if AM can be given least priority among multiple applications. > In this same scenario, map tasks from J3 and J4 can be preempted. > Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
Sunil G created YARN-2022: - Summary: Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priroity among multiple applications. In this same scenario, map tasks from J3 and J4 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-2003: - Assignee: Sunil G > Support to process Job priority from Submission Context in > AppAttemptAddedSchedulerEvent [RM side] > -- > > Key: YARN-2003 > URL: https://issues.apache.org/jira/browse/YARN-2003 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > > AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from > Submission Context and store. > Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990304#comment-13990304 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643410/YARN-1857.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3697//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3697//console This message is automatically generated. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He > Attachments: YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2001: -- Description: After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. NMs joined after the threshold can be treated as new NMs and instructed to kill all its containers. was:After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. > Threshold for RM to accept requests from AM after failover > -- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > After failover, RM may require a certain threshold to determine whether it’s > safe to make scheduling decisions and start accepting new container requests > from AMs. The threshold could be a certain amount of nodes. i.e. RM waits > until a certain amount of nodes joining before accepting new container > requests. Or it could simply be a timeout, only after the timeout RM accepts > new requests. > NMs joined after the threshold can be treated as new NMs and instructed to > kill all its containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1986) After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
[ https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo reassigned YARN-1986: - Assignee: Hong Zhiguo > After upgrade from 2.2.0 to 2.4.0, NPE on first job start. > -- > > Key: YARN-1986 > URL: https://issues.apache.org/jira/browse/YARN-1986 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Jon Bringhurst >Assignee: Hong Zhiguo > > After upgrade from 2.2.0 to 2.4.0, NPE on first job start. > After RM was restarted, the job runs without a problem. > {noformat} > 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type > NODE_UPDATE to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) > at java.lang.Thread.run(Thread.java:744) > 19:11:13,443 INFO ResourceManager:604 - Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2001: -- Description: After failover, RM may require a certain threshold to determine whether it’s safe to make scheduling decisions and start accepting new container requests from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining before accepting new container requests. Or it could simply be a timeout, only after the timeout RM accepts new requests. (was: RM may not accept allocate requests from AMs until all the NMs have re-synced back with RM. This is to eliminate some race conditions like containerIds overlapping between ) > Threshold for RM to accept requests from AM after failover > -- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > After failover, RM may require a certain threshold to determine whether it’s > safe to make scheduling decisions and start accepting new container requests > from AMs. The threshold could be a certain amount of nodes. i.e. RM waits > until a certain amount of nodes joining before accepting new container > requests. Or it could simply be a timeout, only after the timeout RM accepts > new requests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2001: -- Description: RM may not accept allocate requests from AMs until all the NMs have re-synced back with RM. This is to eliminate some race conditions like containerIds overlapping between was: RM should not accept allocate requests from AMs until all the NMs have registered with RM. For that, RM needs to remember the previous NMs and wait for all the NMs to register. This is also useful for remembering decommissioned nodes across restarts. > Threshold for RM to accept requests from AM after failover > -- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM may not accept allocate requests from AMs until all the NMs have re-synced > back with RM. This is to eliminate some race conditions like containerIds > overlapping between -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990254#comment-13990254 ] Jian He commented on YARN-2001: --- bq. Then node1 comes back to RM, RM recovers all containers on node1. On a second thought, this can be changed to not recover those containers and choose to kill them to meet the resource limit. That's another decision to make. > Threshold for RM to accept requests from AM after failover > -- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2001: -- Summary: Threshold for RM to accept requests from AM after failover (was: Persist NMs info for RM restart) > Threshold for RM to accept requests from AM after failover > -- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Persist NMs info for RM restart
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990252#comment-13990252 ] Jian He commented on YARN-2001: --- In a simple case that an application is granted 50% of the cluster resource. The cluster has 2 nodes. the application used up all its resource quota and launched all containers on node1. RM fails over and node2 first re-syncs back with RM. Since node2 has no containers running for this application, AM asks for more containers and RM will think this AM hasn’t used any resources and will grant it more resources on node1. Then node1 comes back to RM, RM recovers all containers on node1. The application end up with more than 50% resource limit. Another example would be RM needs to generate new container Id for the new containers requested from AM. If RM accepts new requests from AM before nodes sync back, the new container Id may overlap with the Ids of the recovered containers. > Persist NMs info for RM restart > --- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Persist NMs info for RM restart
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990241#comment-13990241 ] Karthik Kambatla commented on YARN-2001: bq. we may run into condition like the resource usage, capacity limit (e.g. headroom, queue capacity etc. ) in scheduler is not yet correct until all the nodes sync back all the running containers belong to the app, applications/queues can potentially go beyond its limit. My understanding has been that the RM's scheduler starts from scratch on restart/failover and rebuilds its state as nodes heartbeat. At any point in time, the cluster's resources correspond only to the NMs that have registered with the "new" RM. IOW, this should be no different from a new cluster. Given this, I am not sure how the scheduler can have incorrect information. > Persist NMs info for RM restart > --- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)
[ https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990239#comment-13990239 ] Subramaniam Krishnan commented on YARN-1708: Attaching the patch > Add a public API to reserve resources (part of YARN-1051) > - > > Key: YARN-1708 > URL: https://issues.apache.org/jira/browse/YARN-1708 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Subramaniam Krishnan > Attachments: YARN-1708.patch > > > This JIRA tracks the definition of a new public API for YARN, which allows > users to reserve resources (think of time-bounded queues). This is part of > the admission control enhancement proposed in YARN-1051. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)
[ https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1708: --- Attachment: YARN-1708.patch > Add a public API to reserve resources (part of YARN-1051) > - > > Key: YARN-1708 > URL: https://issues.apache.org/jira/browse/YARN-1708 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Subramaniam Krishnan > Attachments: YARN-1708.patch > > > This JIRA tracks the definition of a new public API for YARN, which allows > users to reserve resources (think of time-bounded queues). This is part of > the admission control enhancement proposed in YARN-1051. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)
[ https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990240#comment-13990240 ] Carlo Curino commented on YARN-1708: The attached patch represents a proposal for an extension of YARN's APIs. This is the externally visible portion of the the umbrella JIRA YARN-1051, and provides users with the opportunity to create/update/deleted time varying resource reservations within a queue (if the queue allows it). Reservations are expressed leveraging existing ResourceRequest objects and extending them with temporal semantics (e.g., I need 1h of 20 containers of size <2GB,1vcore> some time between 2pm and 6pm). We allow also to express minimum concurrency constraints, and dependencies among different stages of a pipeline. The reservationID token obtained by the user during the reservation process is passed during application submission, and instructs the RM to use the reserved resources to satisfy this application needs. The patch posted here is not submitted, since it depends on many other patches part of the umbrella JIRA, the separation is designed only for ease of reviewing. A more broad discussion of this idea, and some experimental results are provided in the tech-report attached to the umbrella JIRA YARN-1051. We have a complete solution backing this API, which we are testing/hardening and will be posting the rest of it in the upcoming days/weeks. > Add a public API to reserve resources (part of YARN-1051) > - > > Key: YARN-1708 > URL: https://issues.apache.org/jira/browse/YARN-1708 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Subramaniam Krishnan > Attachments: YARN-1708.patch > > > This JIRA tracks the definition of a new public API for YARN, which allows > users to reserve resources (think of time-bounded queues). This is part of > the admission control enhancement proposed in YARN-1051. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990233#comment-13990233 ] Wangda Tan commented on YARN-1368: -- Hi [~jianhe], thanks for this patch, I'm agree with major strategies. But I've some comments and questions, In AbstractYarnScheduler:recoverContainersOnNode {code} + if (rmApp.getApplicationSubmissionContext().getUnmanagedAM()) { +if (LOG.isDebugEnabled()) { + LOG.debug("Skip recovering container " + status + + " for unmanaged AM." + rmApp.getApplicationId()); +} +continue; + } {code} Why we don't recover container in unmanaged AM case? In my understand, no matter it's managed or unmanaged AM, the recover process should be same. Is there any difference between them? Should this be included in schedulerAttempt.recoverContainer(...)? {code} + // recover app scheduling info + schedulerAttempt.appSchedulingInfo.recoverContainer(rmContainer); {code} In AppSchedulingInfo.recoverContainer(...) {code} +QueueMetrics metrics = queue.getMetrics(); +if (pending) { + // If there was any running containers, the application was + // running from scheduler's POV. + pending = false; + metrics.runAppAttempt(applicationId, user); +} +if (rmContainer.getState().equals(RMContainerState.COMPLETED)) { + return; +} +metrics.allocateResources(user, 1, Resource.newInstance(1024, 1), false); {code} Should this be a part of queue.recoverContainer(...)? Is it better to create QueueMetrics.recoverContainer(...)? In CapacityScheduler, {code} -Collection nodes = cs.getAllNodes().values(); +Collection nodes = cs.getAllNodes().values(); {code} Could you elaborate why do this and a series of change between SchedulerNode and FiCaSchedulerNode? Not really understand. For recoverContainer in queue, should we do top-down (recover from root queue) or bottom-up (recover from leaf queue). I found in the patch it's bottom-up, should this be decided by scheduler implementation? > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1368.1.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990221#comment-13990221 ] Rohith commented on YARN-2010: -- Thank you [~kasha] for reviewing patch. I update the patch for continue on recovery failure for Finished applications. Correct me if am wrong, continuing on recovery failure for running application,I think it may cause application to hang. So need to consider final state of applicaion. > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Rohith >Priority: Critical > Attachments: YARN-2010.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Persist NMs info for RM restart
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990220#comment-13990220 ] Bikas Saha commented on YARN-2001: -- What if users want to have multiple standbys for fault tolerance? In a large 1 nodes cluster there could be 3-4 distinct fault domains where more than 1 standby may be good to guarantee availability. Until now, in the design we have not restricted the number of standby's. Having all NM's ping all RM's will cause a lot of communication overhead in a healthy cluster. The design already encompasses NM's discovering and syncing with the new active RM. So that is not the problem. The problem is restart during an upgrade where it may be common that a bunch of NM's dont come back up. The RM needs to be resilient to that while maintaining availability. Having a threshold of NM's sounds like a reasonable solution. The threshold can be calculated based on the scheduling margin of error wrt queue capacity. At this point my suggestion would be to clarify the problem being addressed in this jira. Is the problem that after RM failover, the new RM needs to have a certain minimum number of machines join it before it can safely make scheduling decisions? If thats the case then please update the title to reflect that problem and not the solution. > Persist NMs info for RM restart > --- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Persist NMs info for RM restart
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990215#comment-13990215 ] Ming Ma commented on YARN-2001: --- 1. In the HA set up, could we make standby RM hot by having NMs send heartbeat to all RMs? NMs will ignore the heartbeat response's commands from standby RMs. In that way, the new active will have most recent NMs state right after the failover. 2. Decommission handling. If decommission state can be reconstructed via include and exclude files, maybe we can ask admins to update include and exclude files on all RM nodes during decommission process. > Persist NMs info for RM restart > --- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java
[ https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990210#comment-13990210 ] yeqi commented on YARN-2020: Carlo Curino, thanks for your explaination. You are right, observeOnly is useful for debugging and it's meaningless if change it as I proposed. I will close this ticket. > observeOnly should be checked before any preemption computation started > inside containerBasedPreemptOrKill() of > ProportionalCapacityPreemptionPolicy.java > - > > Key: YARN-2020 > URL: https://issues.apache.org/jira/browse/YARN-2020 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: all >Reporter: yeqi >Priority: Trivial > Fix For: 2.5.0 > > Attachments: YARN-2020.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > observeOnly should be checked in the very beginning of > ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(), so that > to avoid unnecessary workload. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java
[ https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990203#comment-13990203 ] Carlo Curino commented on YARN-2020: I might be missing your point, but it seems to me that observeOnly is used to compute the idea allocation, and log it without affecting the actual scheduler allocation... This is useful for debugging, and for operators to gain insight of what would happens if they turn on preemption in their cluster, before actually doing so. By anticipating the observeOnly check as you propose you prevent the computation and log to happen, which you are right will save some computation, but also makes it pointless to do the invocation altogether. The effect you desire can be obtained by turning off the preemption policy altogether. Is there anything else I am missing? > observeOnly should be checked before any preemption computation started > inside containerBasedPreemptOrKill() of > ProportionalCapacityPreemptionPolicy.java > - > > Key: YARN-2020 > URL: https://issues.apache.org/jira/browse/YARN-2020 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: all >Reporter: yeqi >Priority: Trivial > Fix For: 2.5.0 > > Attachments: YARN-2020.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > observeOnly should be checked in the very beginning of > ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(), so that > to avoid unnecessary workload. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2021) Allow AM to set failed final status
[ https://issues.apache.org/jira/browse/YARN-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990196#comment-13990196 ] David Chen commented on YARN-2021: -- I would like to learn more about YARN and would like to pick this up. > Allow AM to set failed final status > --- > > Key: YARN-2021 > URL: https://issues.apache.org/jira/browse/YARN-2021 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jakob Homan > > Background: SAMZA-117. It would be good if an AM were able to signal via its > final status the job itself has failed, even if the AM itself has finished up > in a tidy fashion. It would be good if either (a) the AM can signal a final > status of failed and exit cleanly, or (b) we had another status, says > Application Failed, to indicate that the AM itself gave up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1868) YARN status web ui does not show correctly in IE 11
[ https://issues.apache.org/jira/browse/YARN-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990145#comment-13990145 ] Vinod Kumar Vavilapalli commented on YARN-1868: --- Seems reasonable. From what I understand, this header is only interpreted by IE. Right? Can you leave a code comment as to why the header is set and add a test? > YARN status web ui does not show correctly in IE 11 > --- > > Key: YARN-1868 > URL: https://issues.apache.org/jira/browse/YARN-1868 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 3.0.0 >Reporter: Chuan Liu >Assignee: Chuan Liu > Attachments: YARN-1868.patch, YARN_status.png > > > The YARN status web ui does not show correctly in IE 11. The drop down menu > for app entries are not shown. Also the navigation menu displays incorrectly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Persist NMs info for RM restart
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990120#comment-13990120 ] Jian He commented on YARN-2001: --- If RM start accepting application requests before NMs sync back, for example, we may run into condition like the resource usage, capacity limit (e.g. headroom, queue capacity etc. ) in scheduler is not yet correct until all the nodes sync back all the running containers belong to the app, applications/queues can potentially go beyond its limit. It'll be definitely good if we can think of a way to not make RM wait without hitting race conditions. > Persist NMs info for RM restart > --- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge common code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990110#comment-13990110 ] Jian He commented on YARN-2017: --- yup. Put common code in an abstract class from which specific node can extend the class and implement its own specific logic. > Merge common code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990108#comment-13990108 ] Karthik Kambatla commented on YARN-2019: [~djp] - any particular ideas on how this should behave? > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Priority: Critical > Labels: ha > Attachments: YARN-2019.1-wip.patch > > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge common code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990104#comment-13990104 ] Karthik Kambatla commented on YARN-2017: s/too/two/ > Merge common code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge common code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990100#comment-13990100 ] Karthik Kambatla commented on YARN-2017: Instead of removing these too, can we make them extend AbstractSchedulerNode so we can store any other information that is scheduler specific? > Merge common code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2001) Persist NMs info for RM restart
[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990091#comment-13990091 ] Karthik Kambatla commented on YARN-2001: I am not quite sure if this is a good idea. The RM should start doing what it can do irrespective of the NMs coming up - accepting applications, serving information etc. > Persist NMs info for RM restart > --- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990092#comment-13990092 ] Tsuyoshi OZAWA commented on YARN-1474: -- Yes, it's OK :-) > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990088#comment-13990088 ] Karthik Kambatla commented on YARN-1474: Just got back and catching up on a number of things. Is it okay if I take a look later this week? > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions
[ https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990077#comment-13990077 ] Karthik Kambatla commented on YARN-1987: Looks good to me, except for one nit - LevelDBIterator is marked Public. We should probably add Evolving to say it is not completely stable yet. > Wrapper for leveldb DBIterator to aid in handling database exceptions > - > > Key: YARN-1987 > URL: https://issues.apache.org/jira/browse/YARN-1987 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1987.patch > > > Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a > utility wrapper around leveldb's DBIterator to translate the raw > RuntimeExceptions it can throw into DBExceptions to make it easier to handle > database errors while iterating. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2021) Allow AM to set failed final status
Jakob Homan created YARN-2021: - Summary: Allow AM to set failed final status Key: YARN-2021 URL: https://issues.apache.org/jira/browse/YARN-2021 Project: Hadoop YARN Issue Type: Improvement Reporter: Jakob Homan Background: SAMZA-117. It would be good if an AM were able to signal via its final status the job itself has failed, even if the AM itself has finished up in a tidy fashion. It would be good if either (a) the AM can signal a final status of failed and exit cleanly, or (b) we had another status, says Application Failed, to indicate that the AM itself gave up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990050#comment-13990050 ] Tsuyoshi OZAWA commented on YARN-2019: -- This means that all RM can terminates when ZK cannot be accessed from RMs. If we should retry until ZK come up, one solution is handling STATE_STORE_OP_FAILED in RMFatalEventDispatcher and going into standby state. Please see an attached patch . > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Priority: Critical > Labels: ha > Attachments: YARN-2019.1-wip.patch > > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2019: - Attachment: YARN-2019.1-wip.patch > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Priority: Critical > Labels: ha > Attachments: YARN-2019.1-wip.patch > > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990037#comment-13990037 ] Tsuyoshi OZAWA commented on YARN-2019: -- RMStateStore handles the exceptions in ZKRMStateStore like this: {code} try { // ZK related operations removeRMDTMasterKeyState(delegationKey); } catch (Exception e) { notifyStoreOperationFailed(e); } {code} If it's fenced, RMFatalEventDispatcher handles the exceptions and RM goes into standby state. However, if STATE_STORE_OP_FAILED occurs, Active RM terminates. After fail-over to standby RM, the exception could be repeated on new active RM. Maybe this is the case [~djp] mentioned. Please correct me if I get wrong. {code} @Private public static class RMFatalEventDispatcher implements EventHandler { @Override public void handle(RMFatalEvent event) { LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " + event.getType().name() + ". Cause:\n" + event.getCause()); if (event.getType() == RMFatalEventType.STATE_STORE_FENCED) { LOG.info("RMStateStore has been fenced"); if (rmContext.isHAEnabled()) { try { // Transition to standby and reinit active services LOG.info("Transitioning RM to Standby mode"); rm.transitionToStandby(true); return; } catch (Exception e) { LOG.fatal("Failed to transition RM to Standby mode."); } } } ExitUtil.terminate(1, event.getCause()); } } {code} > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Priority: Critical > Labels: ha > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990026#comment-13990026 ] Karthik Kambatla commented on YARN-2010: I do think this is a critical issue, but don't believe it is essentially a blocker for 2.4.1. In terms of the fix, I believe the fix shouldn't be security-specific. We should probably add a recovery-specific config to say it is okay to continue with starting the RM even if we fail to recover some applications. > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Rohith >Priority: Critical > Attachments: YARN-2010.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Priority: Critical (was: Major) Target Version/s: 2.5.0 > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Rohith >Priority: Critical > Attachments: YARN-2010.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1701) Improve default paths of timeline store and generic history store
[ https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990009#comment-13990009 ] Tsuyoshi OZAWA commented on YARN-1701: -- Sure. > Improve default paths of timeline store and generic history store > - > > Key: YARN-1701 > URL: https://issues.apache.org/jira/browse/YARN-1701 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Blocker > Attachments: YARN-1701.3.patch, YARN-1701.v01.patch, > YARN-1701.v02.patch > > > When I enable AHS via yarn.ahs.enabled, the app history is still not visible > in AHS webUI. This is due to NullApplicationHistoryStore as > yarn.resourcemanager.history-writer.class. It would be good to have just one > key to enable basic functionality. > yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is > local file system location. However, FileSystemApplicationHistoryStore uses > DFS by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java
[ https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989928#comment-13989928 ] Tsuyoshi OZAWA commented on YARN-2020: -- [~yeqi], thank you for taking this JIRA. Your patch doesn't include path information. {{git diff --no-prefix}} command can be helpful for you to create it. Thanks. > observeOnly should be checked before any preemption computation started > inside containerBasedPreemptOrKill() of > ProportionalCapacityPreemptionPolicy.java > - > > Key: YARN-2020 > URL: https://issues.apache.org/jira/browse/YARN-2020 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: all >Reporter: yeqi >Priority: Trivial > Fix For: 2.5.0 > > Attachments: YARN-2020.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > observeOnly should be checked in the very beginning of > ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(), so that > to avoid unnecessary workload. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2
[ https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989915#comment-13989915 ] Mit Desai commented on YARN-1906: - [~zjshen] and [~wangda], Totally agree. We should add the message for the assertion in with the fix. [~ashwinshankar77]: Thanks for pointing this out. I also got the failure on this line last time. Additionally, I ran the test couple of time and found out that the test was failing randomly on any of the asserts. Still trying to figure out where the race is. > TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and > branch2 > --- > > Key: YARN-1906 > URL: https://issues.apache.org/jira/browse/YARN-1906 > Project: Hadoop YARN > Issue Type: Test >Affects Versions: 2.4.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-1906.patch, YARN-1906.patch > > > Here is the output of the format > {noformat} > testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 9.757 sec <<< FAILURE! > java.lang.AssertionError: expected:<2> but was:<1> > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.failNotEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:128) > at org.junit.Assert.assertEquals(Assert.java:472) > at org.junit.Assert.assertEquals(Assert.java:456) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1864) Fair Scheduler Dynamic Hierarchical User Queues
[ https://issues.apache.org/jira/browse/YARN-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989888#comment-13989888 ] Ashwin Shankar commented on YARN-1864: -- FYI, there is some work going on to fix these two test failures : YARN-1906 and YARN-2018. > Fair Scheduler Dynamic Hierarchical User Queues > --- > > Key: YARN-1864 > URL: https://issues.apache.org/jira/browse/YARN-1864 > Project: Hadoop YARN > Issue Type: New Feature > Components: scheduler >Reporter: Ashwin Shankar > Labels: scheduler > Attachments: YARN-1864-v1.txt, YARN-1864-v2.txt, YARN-1864-v3.txt, > YARN-1864-v4.txt, YARN-1864-v5.txt > > > In Fair Scheduler, we want to be able to create user queues under any parent > queue in the hierarchy. For eg. Say user1 submits a job to a parent queue > called root.allUserQueues, we want be able to create a new queue called > root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted > by this user to root.allUserQueues will be run in this newly created > root.allUserQueues.user1. > This is very similar to the 'user-as-default' feature in Fair Scheduler which > creates user queues under root queue. But we want the ability to create user > queues under ANY parent queue. > Why do we want this ? > 1. Preemption : these dynamically created user queues can preempt each other > if its fair share is not met. So there is fairness among users. > User queues can also preempt other non-user leaf queue as well if below fair > share. > 2. Allocation to user queues : we want all the user queries(adhoc) to consume > only a fraction of resources in the shared cluster. By creating this > feature,we could do that by giving a fair share to the parent user queue > which is then redistributed to all the dynamically created user queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2
[ https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989883#comment-13989883 ] Ashwin Shankar commented on YARN-1906: -- Hey [~mitdesai], I encountered this issue in my pre-commit build,but it seemed to have happened at a different place in this test. Here is the link: https://builds.apache.org/job/PreCommit-YARN-Build/3686//testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testQueueMetricsOnRMRestart/ > TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and > branch2 > --- > > Key: YARN-1906 > URL: https://issues.apache.org/jira/browse/YARN-1906 > Project: Hadoop YARN > Issue Type: Test >Affects Versions: 2.4.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-1906.patch, YARN-1906.patch > > > Here is the output of the format > {noformat} > testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) > Time elapsed: 9.757 sec <<< FAILURE! > java.lang.AssertionError: expected:<2> but was:<1> > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.failNotEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:128) > at org.junit.Assert.assertEquals(Assert.java:472) > at org.junit.Assert.assertEquals(Assert.java:456) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1368: -- Attachment: YARN-1368.1.patch Uploaded a new patch. - AbstractYarnScheduler#recoverContainersOnNode() does the majority of recovery mechanism which recovers RMContainer, SchedulerNode,Queue. SchedulerApplicationAttempt, appSchedulingInfo accordingly. - ResourceTrackerService#handleContainerStatus is not needed anymore, that’s handled in the common recovery flow. - Changed RMAppRecoveredTransition to add the current attempt to scheduler. - Changed a few RMAppAttempt transitions to capture the completed containers that are recovered. - some modifications in CapacityScheduler to not send unnecessary app_accepted/attempt_added event to the recovered apps/attempts. Todo: - Replace the containerStatus sent across via NM registration with a new object which captures the resource capability of the container. - FSQueue needs to implements its own recoverContainer method > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1368.1.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1857: -- Attachment: YARN-1857.patch > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He > Attachments: YARN-1857.patch, YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989865#comment-13989865 ] Chen He commented on YARN-1857: --- This failure is related to YARN-1906. > CapacityScheduler headroom doesn't account for other AM's running > - > > Key: YARN-1857 > URL: https://issues.apache.org/jira/browse/YARN-1857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Chen He > Attachments: YARN-1857.patch > > > Its possible to get an application to hang forever (or a long time) in a > cluster with multiple users. The reason why is that the headroom sent to the > application is based on the user limit but it doesn't account for other > Application masters using space in that queue. So the headroom (user limit - > user consumed) can be > 0 even though the cluster is 100% full because the > other space is being used by application masters from other users. > For instance if you have a cluster with 1 queue, user limit is 100%, you have > multiple users submitting applications. One very large application by user 1 > starts up, runs most of its maps and starts running reducers. other users try > to start applications and get their application masters started but not > tasks. The very large application then gets to the point where it has > consumed the rest of the cluster resources with all reduces. But at this > point it needs to still finish a few maps. The headroom being sent to this > application is only based on the user limit (which is 100% of the cluster > capacity) its using lets say 95% of the cluster for reduces and then other 5% > is being used by other users running application masters. The MRAppMaster > thinks it still has 5% so it doesn't know that it should kill a reduce in > order to run a map. > This can happen in other scenarios also. Generally in a large cluster with > multiple queues this shouldn't cause a hang forever but it could cause the > application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1896) For FairScheduler expose MinimumQueueResource of each queue in QueueMetrics
[ https://issues.apache.org/jira/browse/YARN-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-1896: -- Attachment: YARN-1896.v2.patch > For FairScheduler expose MinimumQueueResource of each queue in QueueMetrics > --- > > Key: YARN-1896 > URL: https://issues.apache.org/jira/browse/YARN-1896 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li > Attachments: YARN-1896.v1.patch, YARN-1896.v2.patch > > > For FairScheduler, it's very useful to expose MinimumQueueResource and > MaximumQueueResource of each queu in QueueMetrics. Therefore, people can use > monitoring graph to see what are their current usage and their limit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1805) Signal container request delivery from resourcemanager to nodemanager
[ https://issues.apache.org/jira/browse/YARN-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989672#comment-13989672 ] Hadoop QA commented on YARN-1805: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643371/YARN-1805.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3695//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3695//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3695//console This message is automatically generated. > Signal container request delivery from resourcemanager to nodemanager > - > > Key: YARN-1805 > URL: https://issues.apache.org/jira/browse/YARN-1805 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: YARN-1805.patch > > > 1. Update ResourceTracker's HeartbeatResponse to include the list of > SignalContainerRequest. > 2. Upon receiving the request, NM's NodeStatusUpdater will deliver the > request to ContainerManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java
[ https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989643#comment-13989643 ] Hadoop QA commented on YARN-2020: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643377/YARN-2020.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3696//console This message is automatically generated. > observeOnly should be checked before any preemption computation started > inside containerBasedPreemptOrKill() of > ProportionalCapacityPreemptionPolicy.java > - > > Key: YARN-2020 > URL: https://issues.apache.org/jira/browse/YARN-2020 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: all >Reporter: yeqi >Priority: Trivial > Fix For: 2.5.0 > > Attachments: YARN-2020.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > observeOnly should be checked in the very beginning of > ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(), so that > to avoid unnecessary workload. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java
[ https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yeqi updated YARN-2020: --- Attachment: YARN-2020.patch patch submitted > observeOnly should be checked before any preemption computation started > inside containerBasedPreemptOrKill() of > ProportionalCapacityPreemptionPolicy.java > - > > Key: YARN-2020 > URL: https://issues.apache.org/jira/browse/YARN-2020 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.4.0 > Environment: all >Reporter: yeqi >Priority: Trivial > Fix For: 2.5.0 > > Attachments: YARN-2020.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > observeOnly should be checked in the very beginning of > ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(), so that > to avoid unnecessary workload. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java
yeqi created YARN-2020: -- Summary: observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java Key: YARN-2020 URL: https://issues.apache.org/jira/browse/YARN-2020 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Environment: all Reporter: yeqi Priority: Trivial Fix For: 2.5.0 observeOnly should be checked in the very beginning of ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(), so that to avoid unnecessary workload. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1805) Signal container request delivery from resourcemanager to nodemanager
[ https://issues.apache.org/jira/browse/YARN-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-1805: -- Attachment: YARN-1805.patch The patch includes YARN-1803 and YARN-1897 for jenkins to build. > Signal container request delivery from resourcemanager to nodemanager > - > > Key: YARN-1805 > URL: https://issues.apache.org/jira/browse/YARN-1805 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Ming Ma > Attachments: YARN-1805.patch > > > 1. Update ResourceTracker's HeartbeatResponse to include the list of > SignalContainerRequest. > 2. Upon receiving the request, NM's NodeStatusUpdater will deliver the > request to ContainerManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1805) Signal container request delivery from resourcemanager to nodemanager
[ https://issues.apache.org/jira/browse/YARN-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma reassigned YARN-1805: - Assignee: Ming Ma > Signal container request delivery from resourcemanager to nodemanager > - > > Key: YARN-1805 > URL: https://issues.apache.org/jira/browse/YARN-1805 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: YARN-1805.patch > > > 1. Update ResourceTracker's HeartbeatResponse to include the list of > SignalContainerRequest. > 2. Upon receiving the request, NM's NodeStatusUpdater will deliver the > request to ContainerManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1201) TestAMAuthorization fails with local hostname cannot be resolved
[ https://issues.apache.org/jira/browse/YARN-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989571#comment-13989571 ] Hudson commented on YARN-1201: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1775 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1775/]) YARN-1201. TestAMAuthorization fails with local hostname cannot be resolved. (Wangda Tan via junping_du) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1592197) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAMAuthorization.java > TestAMAuthorization fails with local hostname cannot be resolved > > > Key: YARN-1201 > URL: https://issues.apache.org/jira/browse/YARN-1201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.1.0-beta > Environment: SUSE Linux Enterprise Server 11 (x86_64) >Reporter: Nemon Lou >Assignee: Wangda Tan >Priority: Minor > Fix For: 2.4.1 > > Attachments: YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, > YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, YARN-1201.patch > > > When hostname is 158-1-131-10, TestAMAuthorization fails. > {code} > Running org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization > Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 14.034 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization > testUnauthorizedAccess[0](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization) > Time elapsed: 3.952 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284) > testUnauthorizedAccess[1](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization) > Time elapsed: 3.116 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284) > Results : > Tests in error: > TestAMAuthorization.testUnauthorizedAccess:284 NullPointer > TestAMAuthorization.testUnauthorizedAccess:284 NullPointer > Tests run: 4, Failures: 0, Errors: 2, Skipped: 0 > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989487#comment-13989487 ] Junping Du commented on YARN-2019: -- The bad news could be: the exception could be repeated on new active RM as ZKRMStateStore is shared. Am I missing anything here? > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junping Du >Priority: Critical > Labels: ha > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1201) TestAMAuthorization fails with local hostname cannot be resolved
[ https://issues.apache.org/jira/browse/YARN-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989451#comment-13989451 ] Hudson commented on YARN-1201: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1749 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1749/]) YARN-1201. TestAMAuthorization fails with local hostname cannot be resolved. (Wangda Tan via junping_du) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1592197) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAMAuthorization.java > TestAMAuthorization fails with local hostname cannot be resolved > > > Key: YARN-1201 > URL: https://issues.apache.org/jira/browse/YARN-1201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.1.0-beta > Environment: SUSE Linux Enterprise Server 11 (x86_64) >Reporter: Nemon Lou >Assignee: Wangda Tan >Priority: Minor > Fix For: 2.4.1 > > Attachments: YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, > YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, YARN-1201.patch > > > When hostname is 158-1-131-10, TestAMAuthorization fails. > {code} > Running org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization > Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 14.034 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization > testUnauthorizedAccess[0](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization) > Time elapsed: 3.952 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284) > testUnauthorizedAccess[1](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization) > Time elapsed: 3.116 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284) > Results : > Tests in error: > TestAMAuthorization.testUnauthorizedAccess:284 NullPointer > TestAMAuthorization.testUnauthorizedAccess:284 NullPointer > Tests run: 4, Failures: 0, Errors: 2, Skipped: 0 > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1201) TestAMAuthorization fails with local hostname cannot be resolved
[ https://issues.apache.org/jira/browse/YARN-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989424#comment-13989424 ] Hudson commented on YARN-1201: -- FAILURE: Integrated in Hadoop-Yarn-trunk #558 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/558/]) YARN-1201. TestAMAuthorization fails with local hostname cannot be resolved. (Wangda Tan via junping_du) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1592197) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAMAuthorization.java > TestAMAuthorization fails with local hostname cannot be resolved > > > Key: YARN-1201 > URL: https://issues.apache.org/jira/browse/YARN-1201 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.1.0-beta > Environment: SUSE Linux Enterprise Server 11 (x86_64) >Reporter: Nemon Lou >Assignee: Wangda Tan >Priority: Minor > Fix For: 2.4.1 > > Attachments: YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, > YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, YARN-1201.patch > > > When hostname is 158-1-131-10, TestAMAuthorization fails. > {code} > Running org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization > Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 14.034 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization > testUnauthorizedAccess[0](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization) > Time elapsed: 3.952 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284) > testUnauthorizedAccess[1](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization) > Time elapsed: 3.116 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284) > Results : > Tests in error: > TestAMAuthorization.testUnauthorizedAccess:284 NullPointer > TestAMAuthorization.testUnauthorizedAccess:284 NullPointer > Tests run: 4, Failures: 0, Errors: 2, Skipped: 0 > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)