[jira] [Commented] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN
[ https://issues.apache.org/jira/browse/YARN-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012134#comment-14012134 ] Nishkam Ravi commented on YARN-2041: Sure. There seem to be multiple issues. Two seem to clearly stand out: 1. Performance degrades with FIFO for large values of memory-mb in both single-job and multi-job mode. Observed for multiple benchmarks including TeraSort, TeraValidate, TeraGen, WordCount, ShuffleText. Issue: FIFO seems to be allocating too many jobs at once on a single node. 2. Performance with Capacity scheduler suffers for large values of memory-mb only for TeraValidate (in single-job mode). Issue: why does Capacity scheduler regress for TeraValidate? > Hard to co-locate MR2 and Spark jobs on the same cluster in YARN > > > Key: YARN-2041 > URL: https://issues.apache.org/jira/browse/YARN-2041 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Nishkam Ravi > > Performance of MR2 jobs falls drastically as YARN config parameter > yarn.nodemanager.resource.memory-mb is increased beyond a certain value. > Performance of Spark falls drastically as the value of > yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a > large data set. > This makes it hard to co-locate MR2 and Spark jobs in YARN. > The experiments are being conducted on a 6-node cluster. The following > workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, > ShuffleText and PageRank. > Will add more details to this JIRA over time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2113) CS Preemption should respect user-limits
Vinod Kumar Vavilapalli created YARN-2113: - Summary: CS Preemption should respect user-limits Key: YARN-2113 URL: https://issues.apache.org/jira/browse/YARN-2113 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 2.5.0 This is different from (even if related to, and likely share code with) YARN-2069. YARN-2069 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012052#comment-14012052 ] Hudson commented on YARN-596: - SUCCESS: Integrated in Hadoop-trunk-Commit #5619 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5619/]) YARN-596. Use scheduling policies throughout the queue hierarchy to decide which containers to preempt (Wei Yan via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598197) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AppSchedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSParentQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FakeSchedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java > Use scheduling policies throughout the queue hierarchy to decide which > containers to preempt > > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Fix For: 2.5.0 > > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012038#comment-14012038 ] Sandy Ryza commented on YARN-596: - I just committed this to trunk and branch-2. Thanks Wei for the patch and Ashwin for taking a look. > Use scheduling policies throughout the queue hierarchy to decide which > containers to preempt > > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Fix For: 2.5.0 > > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-596: Summary: Use scheduling policies throughout the queue hierarchy to decide which containers to preempt (was: Use scheduling policies throughout the hierarchy to decide which containers to preempt) > Use scheduling policies throughout the queue hierarchy to decide which > containers to preempt > > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-596) Use scheduling policies throughout the hierarchy to decide which containers to preempt
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-596: Summary: Use scheduling policies throughout the hierarchy to decide which containers to preempt (was: In fair scheduler, intra-application container priorities affect inter-application preemption decisions) > Use scheduling policies throughout the hierarchy to decide which containers > to preempt > -- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012030#comment-14012030 ] Sandy Ryza commented on YARN-596: - +1 > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN
[ https://issues.apache.org/jira/browse/YARN-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011982#comment-14011982 ] Vinod Kumar Vavilapalli commented on YARN-2041: --- Tx for all the updates, [~nravi], but can you please make clear the issues that you think are needed to be fixed? > Hard to co-locate MR2 and Spark jobs on the same cluster in YARN > > > Key: YARN-2041 > URL: https://issues.apache.org/jira/browse/YARN-2041 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Nishkam Ravi > > Performance of MR2 jobs falls drastically as YARN config parameter > yarn.nodemanager.resource.memory-mb is increased beyond a certain value. > Performance of Spark falls drastically as the value of > yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a > large data set. > This makes it hard to co-locate MR2 and Spark jobs in YARN. > The experiments are being conducted on a 6-node cluster. The following > workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, > ShuffleText and PageRank. > Will add more details to this JIRA over time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011943#comment-14011943 ] Hadoop QA commented on YARN-2010: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647268/yarn-2010-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3854//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3854//console This message is automatically generated. > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Rohith >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, > yarn-2010-3.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ...
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011918#comment-14011918 ] Hadoop QA commented on YARN-2091: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647261/YARN-2091.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3853//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3853//console This message is automatically generated. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011907#comment-14011907 ] Ashwin Shankar commented on YARN-2026: -- Hi [~sandyr], bq. We would see that parentA is below its minShare, so we would preempt resources on its behalf. minShare preemption at parent queue is not yet implemented ,FairScheduler.resToPreempt() is not recursive(YARN-596 doesn't address this). I had created YARN-1961 for this purpose,which I plan to work on. But yes you are right,if YARN-1961 is in place, we can set minShare and minShareTimeout at parentA,which would reclaim resource from parentB. This solves problem-1 in the description,but what about problem-2 ? When we have many leaf queues under a parent,say using NestedUserQueue rule. Eg. - parentA has 100 user queues under it - fair share of each user queue is 1% of parentA(assuming weight=1) - Say user queue parentA.user1 is taking up 100% of cluster since its the only active queue. - parentA.user2 which was inactive till now ,submits a job and needs say 20%. - parentA.user2 would get only 1% through preemption and parentA.user1 would have 99%. This seems unfair considering users have equal weight. Eventually,as user1 releases its containers, it would go to user2,but until that happens user1 can hog the cluster. In our cluster we have about 200 users(so 200 user queues),but only about 20%(avg) are active at a point in time. Fair share for each user becomes really low (1/200)*parent and can causes this 'unfairness' mentioned in above example. This can be solved by dividing fair share only to active queues. How about this,can we have a new property say 'fairShareForActiveQueues' which turns on/off this feature,that way people who need it can use it and other's can turn it off and would get the usual static fair share behavior. Thoughts ? > Fair scheduler : Fair share for inactive queues causes unfair allocation in > some scenarios > -- > > Key: YARN-2026 > URL: https://issues.apache.org/jira/browse/YARN-2026 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Attachments: YARN-2026-v1.txt > > > Problem1- While using hierarchical queues in fair scheduler,there are few > scenarios where we have seen a leaf queue with least fair share can take > majority of the cluster and starve a sibling parent queue which has greater > weight/fair share and preemption doesn’t kick in to reclaim resources. > The root cause seems to be that fair share of a parent queue is distributed > to all its children irrespective of whether its an active or an inactive(no > apps running) queue. Preemption based on fair share kicks in only if the > usage of a queue is less than 50% of its fair share and if it has demands > greater than that. When there are many queues under a parent queue(with high > fair share),the child queue’s fair share becomes really low. As a result when > only few of these child queues have apps running,they reach their *tiny* fair > share quickly and preemption doesn’t happen even if other leaf > queues(non-sibling) are hogging the cluster. > This can be solved by dividing fair share of parent queue only to active > child queues. > Here is an example describing the problem and proposed solution: > root.lowPriorityQueue is a leaf queue with weight 2 > root.HighPriorityQueue is parent queue with weight 8 > root.HighPriorityQueue has 10 child leaf queues : > root.HighPriorityQueue.childQ(1..10) > Above config,results in root.HighPriorityQueue having 80% fair share > and each of its ten child queue would have 8% fair share. Preemption would > happen only if the child queue is <4% (0.5*8=4). > Lets say at the moment no apps are running in any of the > root.HighPriorityQueue.childQ(1..10) and few apps are running in > root.lowPriorityQueue which is taking up 95% of the cluster. > Up till this point,the behavior of FS is correct. > Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% > of the cluster. It would get only the available 5% in the cluster and > preemption wouldn't kick in since its above 4%(half fair share).This is bad > considering childQ1 is under a highPriority parent queue which has *80% fair > share*. > Until root.lowPriorityQueue starts relinquishing containers,we would see the > following allocation on the scheduler page: > *root.lowPriorityQueue = 95%* > *root.HighPriorityQueue.childQ1=5%* > This can be solved by distributing a parent’s fair share only to active > queues. > So in the example above,since childQ1 is the only active queue > under root.HighPriorityQueue, it would get all its parent
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011906#comment-14011906 ] Hadoop QA commented on YARN-596: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647256/YARN-596.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3852//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3852//console This message is automatically generated. > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Attachment: yarn-2010-3.patch New patch that gets rid of the config and addresses the issue where the masterKey is null. > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Rohith >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, > yarn-2010-3.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2091: - Attachment: YARN-2091.1.patch Added ContainerExitStatus.KILL_EXCEEDED_MEMORY and test to pass the exit status from NM to RM correctly. > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2091.1.patch > > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011879#comment-14011879 ] Hadoop QA commented on YARN-1474: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647249/YARN-1474.18.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 9 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/3851//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3851//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3851//console This message is automatically generated. > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, > YARN-1474.17.patch, YARN-1474.18.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-596: - Attachment: YARN-596.patch > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011847#comment-14011847 ] Hadoop QA commented on YARN-2110: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647243/YARN-2110.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3850//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3850//console This message is automatically generated. > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler > --- > > Key: YARN-2110 > URL: https://issues.apache.org/jira/browse/YARN-2110 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Chen He > Labels: test > Attachments: YARN-2110.patch > > > The TestAMRestart#testAMRestartWithExistingContainers does a cast to > CapacityScheduler in a couple of places > {code} > ((CapacityScheduler) rm1.getResourceScheduler()) > {code} > If run with FairScheduler as default scheduler the test throws > {code} java.lang.ClassCastException {code}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2026: - Description: Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is <4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. was: While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is <4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root
[jira] [Commented] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN
[ https://issues.apache.org/jira/browse/YARN-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011844#comment-14011844 ] Nishkam Ravi commented on YARN-2041: Unlike FIFO, whose performance deteriorates consistently across multiple benchmarks as value of yarn.nodemanager.resource.memory-mb is increased from 16GB to 40GB, Capacity scheduler performs well for all benchmarks except for TeraValidate. For TeraValidate in single-job mode: Exec. time with Fair: 38 sec (yarn.nodemanager.resource.memory-mb = 16GB) Exec. time with Fair: 38 sec (yarn.nodemanager.resource.memory-mb = 40GB) Exec. time with Capacity: 51 sec (yarn.nodemanager.resource.memory-mb = 16GB) Exec. time with Capacity: 100 sec (yarn.nodemanager.resource.memory-mb = 40GB) Also, in multi-job mode, Capacity seems to be behaving like FIFO. Scheduling one job at a time for execution. > Hard to co-locate MR2 and Spark jobs on the same cluster in YARN > > > Key: YARN-2041 > URL: https://issues.apache.org/jira/browse/YARN-2041 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Nishkam Ravi > > Performance of MR2 jobs falls drastically as YARN config parameter > yarn.nodemanager.resource.memory-mb is increased beyond a certain value. > Performance of Spark falls drastically as the value of > yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a > large data set. > This makes it hard to co-locate MR2 and Spark jobs in YARN. > The experiments are being conducted on a 6-node cluster. The following > workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, > ShuffleText and PageRank. > Will add more details to this JIRA over time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1474: - Attachment: YARN-1474.18.patch Thanks for sharing the opinions, Sandy and Karthik. I also think it's OK to change internal APIs' semantics because its interface is Evolving one. [~vinodkv], please let us know if you have additional comments. Updated patch with following changes to address Karthik's comments: 1. Removed {{initialized}} flag from *Schedulers. All initialization is done in {{serviceInit}} and {{serviceStart}}, instead of {{reinitialize()}}. 2. Changed ResourceSchedulerWrapper to override {{serviceInit}}, {{serviceStart}}, {{serviceStop}}. 3. Updated some tests to call scheduler.init() right after scheduler.setRMContext() without ResourceManager/MockRM. > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, > YARN-1474.17.patch, YARN-1474.18.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)
[ https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011820#comment-14011820 ] Vinod Kumar Vavilapalli commented on YARN-1708: --- Thanks for the patch [~subru]! I started looking at this. Few comments: h4. Misc - I think we should create a ReservationID or ReservationHandle and use it instead of strings - ReservationResponse.message -> errorMesage? Or Errors? h4. ApplicationClientProtocol - createReservation -> submitReservation? - Let's have separate request/response records for submission, update and deletion of reservations. Deletion of reservations, for e.g only needs to supplied a reservationID. See submit/kill app for analogy. Similarly, ReservationRequest.reservationID doesn't need to be part of the request for the reservation-submission. h4. ReservationDefinition - Seems like there is a notion of absolute time. We should make it clear what the arrival/deadline long's really represent. Particularly given the possibility of different timezones between the RM and the client. - It may be also very useful to let users specify time in relative terms - 6hrs from now, etc. - It let's you specify a list of ResourceRequests. Not sure how we can specify things like RR1 for the first 5 mins, RR2 for the next 15 etc. h4. ReservationDefinitionType - It seems like if we instead have a list of records of type (arrival, ResourceRequest, deadline), we will cover all the cases in the definition-type and then some more? Thoughts? - Also any examples of where R_ANY is useful? Similarly as to how R_ORDER is not enough and instead we have a need for R_ORDER_NO_GAP? Focusing mainly on use-cases here. h4. ResourceRequest - concurrency is really a request for a gang of containers? - Meaning of leaseDuration? Is it indicating the scheduler as to how long the container will run for? I have suggestions for configuration props renames follow. We follow a component.sub-component.sub-component.property-name convention. (OT: I wish I looked at preemption related config names :) ) IAC, I need to see the bigger picture with the rest of the patches before I can suggest correct naming, let's drop the YarnConfiguration changes from this patch. Will look more carefully at the PB impls in the next cycle. bq. The patch posted here is not submitted, since it depends on many other patches part of the umbrella JIRA, the separation is designed only for ease of reviewing. I see this patch to be fairly independent and committable in isolation. Though we should wait till we have the entire set to make sure the changes here are all sufficient and necessary. > Add a public API to reserve resources (part of YARN-1051) > - > > Key: YARN-1708 > URL: https://issues.apache.org/jira/browse/YARN-1708 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Subramaniam Krishnan > Attachments: YARN-1708.patch > > > This JIRA tracks the definition of a new public API for YARN, which allows > users to reserve resources (think of time-bounded queues). This is part of > the admission control enhancement proposed in YARN-1051. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011818#comment-14011818 ] Hadoop QA commented on YARN-2054: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647233/yarn-2054-4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3849//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3849//console This message is automatically generated. > Poor defaults for YARN ZK configs for retries and retry-inteval > --- > > Key: YARN-2054 > URL: https://issues.apache.org/jira/browse/YARN-2054 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch, > yarn-2054-4.patch > > > Currenly, we have the following default values: > # yarn.resourcemanager.zk-num-retries - 500 > # yarn.resourcemanager.zk-retry-interval-ms - 2000 > This leads to a cumulate 1000 seconds before the RM gives up trying to > connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011814#comment-14011814 ] Wei Yan commented on YARN-596: -- Thanks, [~ashwinshankar77]. I'll update a patch quickly. > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011807#comment-14011807 ] Ashwin Shankar commented on YARN-596: - [~ywskycn],minor comment : can you please update javadoc comment for {code:title=FairScheduler.java} protected void preemptResources(Resource toPreempt) {code} it still talks about the previous preemption algorithm. > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2110: -- Labels: test (was: ) > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler > --- > > Key: YARN-2110 > URL: https://issues.apache.org/jira/browse/YARN-2110 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Chen He > Labels: test > Attachments: YARN-2110.patch > > > The TestAMRestart#testAMRestartWithExistingContainers does a cast to > CapacityScheduler in a couple of places > {code} > ((CapacityScheduler) rm1.getResourceScheduler()) > {code} > If run with FairScheduler as default scheduler the test throws > {code} java.lang.ClassCastException {code}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2109: -- Labels: test (was: ) > TestRM fails some tests when some tests run with CapacityScheduler and some > with FairScheduler > -- > > Key: YARN-2109 > URL: https://issues.apache.org/jira/browse/YARN-2109 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Anubhav Dhoot >Assignee: Chen He > Labels: test > > testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in > [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set > it to be CapacityScheduler. But if the default scheduler is set to > FairScheduler then the rest of the tests that execute after this will fail > with invalid cast exceptions when getting queuemetrics. This is based on test > execution order as only the tests that execute after this test will fail. > This is because the queuemetrics will be initialized by this test to > QueueMetrics and shared by the subsequent tests. > We can explicitly clear the metrics at the end of this test to fix this. > For example > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot > be cast to > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:90) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:85) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011788#comment-14011788 ] Chen He commented on YARN-2110: --- change casting from CapacityScheduler to AbstractYarnScheduler which is the parent of both FairScheduler and CapacityScheduler. > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler > --- > > Key: YARN-2110 > URL: https://issues.apache.org/jira/browse/YARN-2110 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Chen He > Attachments: YARN-2110.patch > > > The TestAMRestart#testAMRestartWithExistingContainers does a cast to > CapacityScheduler in a couple of places > {code} > ((CapacityScheduler) rm1.getResourceScheduler()) > {code} > If run with FairScheduler as default scheduler the test throws > {code} java.lang.ClassCastException {code}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2110: -- Attachment: YARN-2110.patch > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler > --- > > Key: YARN-2110 > URL: https://issues.apache.org/jira/browse/YARN-2110 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Chen He > Attachments: YARN-2110.patch > > > The TestAMRestart#testAMRestartWithExistingContainers does a cast to > CapacityScheduler in a couple of places > {code} > ((CapacityScheduler) rm1.getResourceScheduler()) > {code} > If run with FairScheduler as default scheduler the test throws > {code} java.lang.ClassCastException {code}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2054: --- Attachment: yarn-2054-4.patch Sorry for the bulky patch - forgot to rebase against trunk before generating a diff against it :) Here is the right one. > Poor defaults for YARN ZK configs for retries and retry-inteval > --- > > Key: YARN-2054 > URL: https://issues.apache.org/jira/browse/YARN-2054 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch, > yarn-2054-4.patch > > > Currenly, we have the following default values: > # yarn.resourcemanager.zk-num-retries - 500 > # yarn.resourcemanager.zk-retry-interval-ms - 2000 > This leads to a cumulate 1000 seconds before the RM gives up trying to > connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
[ https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011706#comment-14011706 ] Hadoop QA commented on YARN-2112: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647222/YARN-2112.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3848//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3848//console This message is automatically generated. > Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml > - > > Key: YARN-2112 > URL: https://issues.apache.org/jira/browse/YARN-2112 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2112.1.patch > > > Now YarnClient is using TimelineClient, which has dependency on jackson libs. > However, the current dependency configurations make the hadoop-client > artifect miss 2 jackson libs, such that the applications which have > hadoop-client dependency will see the following exception > {code} > java.lang.NoClassDefFoundError: > org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) > at java.lang.ClassLoader.defineClass(ClassLoader.java:621) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) > at java.net.URLClassLoader.access$000(URLClassLoader.java:58) > at java.net.URLClassLoader$1.run(URLClassLoader.java:197) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:190) > at java.lang.ClassLoader.loadClass(ClassLoader.java:306) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > at java.lang.ClassLoader.loadClass(ClassLoader.java:247) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.(TimelineClientImpl.java:92) > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:88) > at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:111) > at > org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:394) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250) > at or
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011686#comment-14011686 ] Jian He commented on YARN-2010: --- I agree that failing to recover an app shouldn’t fail the RM. I think for cases where the failure will be simply resolved by launching a new attempt like this, we should not fail the app. We can fail the app for cases where starting a new attempt can’t resolve the issue like failing to renew DT on recovery. > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Rohith >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
[ https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2112: -- Target Version/s: 2.5.0 > Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml > - > > Key: YARN-2112 > URL: https://issues.apache.org/jira/browse/YARN-2112 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2112.1.patch > > > Now YarnClient is using TimelineClient, which has dependency on jackson libs. > However, the current dependency configurations make the hadoop-client > artifect miss 2 jackson libs, such that the applications which have > hadoop-client dependency will see the following exception > {code} > java.lang.NoClassDefFoundError: > org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) > at java.lang.ClassLoader.defineClass(ClassLoader.java:621) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) > at java.net.URLClassLoader.access$000(URLClassLoader.java:58) > at java.net.URLClassLoader$1.run(URLClassLoader.java:197) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:190) > at java.lang.ClassLoader.loadClass(ClassLoader.java:306) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > at java.lang.ClassLoader.loadClass(ClassLoader.java:247) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.(TimelineClientImpl.java:92) > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:88) > at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:111) > at > org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:394) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) > at > org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) > at > org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) > at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) > at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > Caused by: java.lang.ClassNotFoundException: > org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider > at java.net.URLClassLoader$1.run(URLClassLoader.java:202) > at java.securit
[jira] [Updated] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
[ https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2112: -- Attachment: YARN-2112.1.patch Create a patch to correct the configs in pom.xml, and make sure all 4 jackson libs are available in hadoop-client > Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml > - > > Key: YARN-2112 > URL: https://issues.apache.org/jira/browse/YARN-2112 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2112.1.patch > > > Now YarnClient is using TimelineClient, which has dependency on jackson libs. > However, the current dependency configurations make the hadoop-client > artifect miss 2 jackson libs, such that the applications which have > hadoop-client dependency will see the following exception > {code} > java.lang.NoClassDefFoundError: > org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) > at java.lang.ClassLoader.defineClass(ClassLoader.java:621) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) > at java.net.URLClassLoader.access$000(URLClassLoader.java:58) > at java.net.URLClassLoader$1.run(URLClassLoader.java:197) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:190) > at java.lang.ClassLoader.loadClass(ClassLoader.java:306) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > at java.lang.ClassLoader.loadClass(ClassLoader.java:247) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.(TimelineClientImpl.java:92) > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:88) > at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:111) > at > org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:394) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) > at > org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) > at > org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) > at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) > at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > Caused by: java.lang.ClassNotFoundException: > org.codehaus.jacks
[jira] [Created] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
Zhijie Shen created YARN-2112: - Summary: Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml Key: YARN-2112 URL: https://issues.apache.org/jira/browse/YARN-2112 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Now YarnClient is using TimelineClient, which has dependency on jackson libs. However, the current dependency configurations make the hadoop-client artifect miss 2 jackson libs, such that the applications which have hadoop-client dependency will see the following exception {code} java.lang.NoClassDefFoundError: org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637) at java.lang.ClassLoader.defineClass(ClassLoader.java:621) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.(TimelineClientImpl.java:92) at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:88) at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:111) at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34) at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255) at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:24
[jira] [Commented] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011643#comment-14011643 ] Hadoop QA commented on YARN-2098: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647212/YARN-2098.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/3847//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3847//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3847//console This message is automatically generated. > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Wei Yan > Attachments: YARN-2098.patch > > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1,we should > change this to get priority from ApplicationSubmissionContext. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2111) In FairScheduler.attemptScheduling, we don't count containers as assigned if they have 0 memory but non-zero cores
[ https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-2111: - Summary: In FairScheduler.attemptScheduling, we don't count containers as assigned if they have 0 memory but non-zero cores (was: In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores) > In FairScheduler.attemptScheduling, we don't count containers as assigned if > they have 0 memory but non-zero cores > -- > > Key: YARN-2111 > URL: https://issues.apache.org/jira/browse/YARN-2111 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-2111.patch > > > {code} > if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource, > queueMgr.getRootQueue().assignContainer(node), > Resources.none())) { > {code} > As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores > here into account. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores
[ https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011625#comment-14011625 ] Hadoop QA commented on YARN-2111: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647204/YARN-2111.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3846//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3846//console This message is automatically generated. > In FairScheduler.attemptScheduling, we won't count containers as assigned if > they have 0 memory but non-zero cores > -- > > Key: YARN-2111 > URL: https://issues.apache.org/jira/browse/YARN-2111 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-2111.patch > > > {code} > if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource, > queueMgr.getRootQueue().assignContainer(node), > Resources.none())) { > {code} > As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores > here into account. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1801) NPE in public localizer
[ https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011601#comment-14011601 ] Jason Lowe commented on YARN-1801: -- Strictly speaking, the patch does prevent the NPE. However the public localizer is still effectively doomed if this condition occurs because it returns from the run() method. That will shutdown the localizer thread and public local resource requests will stop being processed. In that sense we've traded an NPE with a traceback for a one-line log message. I'm not sure this is an improvement, since at least the traceback is easier to notice in the NM log and we get a corresponding fatal log when someone goes hunting for what went wrong with the public localizer. The real issue is we need to understand what happened to cause pending.remove(completed) to return null. This should never happen, and if it does then it means we have a bug. Trying to recover from this condition is patching a symptom rather than a root cause. The problem that lead to the null request event _might_ have been fixed by YARN-1575 which wasn't present in 2.2 where the original bug occurred. It would be interesting to know if this has reoccurred since 2.3.0. Assuming this is still a potential issue, we should either find a way to prevent it from ever occurring or recover in a way that keeps the public localizer working as much as possible. It'd be great if we could just pull from the queue and receive a structure that has both the request event and the Future so we don't have to worry about a Future with no associated event. If we're going to try to recover instead, we'd have to log an error and try to cleanup. With no associated request event and no path if we got an execution error, it's going to be particularly difficult to recover properly. > NPE in public localizer > --- > > Key: YARN-1801 > URL: https://issues.apache.org/jira/browse/YARN-1801 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Jason Lowe >Assignee: Hong Zhiguo >Priority: Critical > Attachments: YARN-1801.patch > > > While investigating YARN-1800 found this in the NM logs that caused the > public localizer to shutdown: > {noformat} > 2014-01-23 01:26:38,655 INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(651)) - Downloading public > rsrc:{ > hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar, > 1390440382009, FILE, null } > 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(726)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) > 2014-01-23 01:26:38,656 INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(728)) - Public cache exiting > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011587#comment-14011587 ] Sandy Ryza commented on YARN-1913: -- I think we should avoid doing approximate calculation through the minimum allocation. We need to handle situations where AM resources are much larger than the min, and situations where the minimum allocation will be 0 (common on Llama-enabled clusters). This would have the added benefit of avoiding touching the "runnability" machinery, which is already bordering on over-complicated. > With Fair Scheduler, cluster can logjam when all resources are consumed by AMs > -- > > Key: YARN-1913 > URL: https://issues.apache.org/jira/browse/YARN-1913 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Wei Yan > Labels: easyfix > Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, > YARN-1913.patch > > > It's possible to deadlock a cluster by submitting many applications at once, > and have all cluster resources taken up by AMs. > One solution is for the scheduler to limit resources taken up by AMs, as a > percentage of total cluster resources, via a "maxApplicationMasterShare" > config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He reassigned YARN-2109: - Assignee: Chen He > TestRM fails some tests when some tests run with CapacityScheduler and some > with FairScheduler > -- > > Key: YARN-2109 > URL: https://issues.apache.org/jira/browse/YARN-2109 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Anubhav Dhoot >Assignee: Chen He > > testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in > [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set > it to be CapacityScheduler. But if the default scheduler is set to > FairScheduler then the rest of the tests that execute after this will fail > with invalid cast exceptions when getting queuemetrics. This is based on test > execution order as only the tests that execute after this test will fail. > This is because the queuemetrics will be initialized by this test to > QueueMetrics and shared by the subsequent tests. > We can explicitly clear the metrics at the end of this test to fix this. > For example > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot > be cast to > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:90) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:85) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011582#comment-14011582 ] Chen He commented on YARN-2109: --- This is interesting and I will work on it. > TestRM fails some tests when some tests run with CapacityScheduler and some > with FairScheduler > -- > > Key: YARN-2109 > URL: https://issues.apache.org/jira/browse/YARN-2109 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Anubhav Dhoot > > testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in > [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set > it to be CapacityScheduler. But if the default scheduler is set to > FairScheduler then the rest of the tests that execute after this will fail > with invalid cast exceptions when getting queuemetrics. This is based on test > execution order as only the tests that execute after this test will fail. > This is because the queuemetrics will be initialized by this test to > QueueMetrics and shared by the subsequent tests. > We can explicitly clear the metrics at the end of this test to fix this. > For example > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot > be cast to > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:90) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:85) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011571#comment-14011571 ] Chen He commented on YARN-2110: --- I will take this. > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler > --- > > Key: YARN-2110 > URL: https://issues.apache.org/jira/browse/YARN-2110 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Chen He > > The TestAMRestart#testAMRestartWithExistingContainers does a cast to > CapacityScheduler in a couple of places > {code} > ((CapacityScheduler) rm1.getResourceScheduler()) > {code} > If run with FairScheduler as default scheduler the test throws > {code} java.lang.ClassCastException {code}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He reassigned YARN-2110: - Assignee: Chen He > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler > --- > > Key: YARN-2110 > URL: https://issues.apache.org/jira/browse/YARN-2110 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot >Assignee: Chen He > > The TestAMRestart#testAMRestartWithExistingContainers does a cast to > CapacityScheduler in a couple of places > {code} > ((CapacityScheduler) rm1.getResourceScheduler()) > {code} > If run with FairScheduler as default scheduler the test throws > {code} java.lang.ClassCastException {code}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2098: -- Attachment: YARN-2098.patch > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Wei Yan > Attachments: YARN-2098.patch > > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1,we should > change this to get priority from ApplicationSubmissionContext. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011542#comment-14011542 ] Sandy Ryza commented on YARN-596: - (pending Jenkins) > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011540#comment-14011540 ] Sandy Ryza commented on YARN-596: - +1 > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores
[ https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-2111: - Attachment: YARN-2111.patch > In FairScheduler.attemptScheduling, we won't count containers as assigned if > they have 0 memory but non-zero cores > -- > > Key: YARN-2111 > URL: https://issues.apache.org/jira/browse/YARN-2111 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-2111.patch > > > {code} > if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource, > queueMgr.getRootQueue().assignContainer(node), > Resources.none())) { > {code} > As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores > here into account. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores
[ https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned YARN-2111: Assignee: Sandy Ryza > In FairScheduler.attemptScheduling, we won't count containers as assigned if > they have 0 memory but non-zero cores > -- > > Key: YARN-2111 > URL: https://issues.apache.org/jira/browse/YARN-2111 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > {code} > if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource, > queueMgr.getRootQueue().assignContainer(node), > Resources.none())) { > {code} > As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores > here into account. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores
Sandy Ryza created YARN-2111: Summary: In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores Key: YARN-2111 URL: https://issues.apache.org/jira/browse/YARN-2111 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Sandy Ryza {code} if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource, queueMgr.getRootQueue().assignContainer(node), Resources.none())) { {code} As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores here into account. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011524#comment-14011524 ] Vinod Kumar Vavilapalli commented on YARN-1063: --- Scanned through the patch. It's dense and full of windows related stuff which I am not entirely familiar with. Looked at the code from YARN container localization and launch POV. I have posted some comments on YARN-1972 which may cause some changes here too. > Winutils needs ability to create task as domain user > > > Key: YARN-1063 > URL: https://issues.apache.org/jira/browse/YARN-1063 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Environment: Windows >Reporter: Kyle Leckie >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, > YARN-1063.patch > > > h1. Summary: > Securing a Hadoop cluster requires constructing some form of security > boundary around the processes executed in YARN containers. Isolation based on > Windows user isolation seems most feasible. This approach is similar to the > approach taken by the existing LinuxContainerExecutor. The current patch to > winutils.exe adds the ability to create a process as a domain user. > h1. Alternative Methods considered: > h2. Process rights limited by security token restriction: > On Windows access decisions are made by examining the security token of a > process. It is possible to spawn a process with a restricted security token. > Any of the rights granted by SIDs of the default token may be restricted. It > is possible to see this in action by examining the security tone of a > sandboxed process launch be a web browser. Typically the launched process > will have a fully restricted token and need to access machine resources > through a dedicated broker process that enforces a custom security policy. > This broker process mechanism would break compatibility with the typical > Hadoop container process. The Container process must be able to utilize > standard function calls for disk and network IO. I performed some work > looking at ways to ACL the local files to the specific launched without > granting rights to other processes launched on the same machine but found > this to be an overly complex solution. > h2. Relying on APP containers: > Recent versions of windows have the ability to launch processes within an > isolated container. Application containers are supported for execution of > WinRT based executables. This method was ruled out due to the lack of > official support for standard windows APIs. At some point in the future > windows may support functionality similar to BSD jails or Linux containers, > at that point support for containers should be added. > h1. Create As User Feature Description: > h2. Usage: > A new sub command was added to the set of task commands. Here is the syntax: > winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] > Some notes: > * The username specified is in the format of "user@domain" > * The machine executing this command must be joined to the domain of the user > specified > * The domain controller must allow the account executing the command access > to the user information. For this join the account to the predefined group > labeled "Pre-Windows 2000 Compatible Access" > * The account running the command must have several rights on the local > machine. These can be managed manually using secpol.msc: > ** "Act as part of the operating system" - SE_TCB_NAME > ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME > ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME > * The launched process will not have rights to the desktop so will not be > able to display any information or create UI. > * The launched process will have no network credentials. Any access of > network resources that requires domain authentication will fail. > h2. Implementation: > Winutils performs the following steps: > # Enable the required privileges for the current process. > # Register as a trusted process with the Local Security Authority (LSA). > # Create a new logon for the user passed on the command line. > # Load/Create a profile on the local machine for the new logon. > # Create a new environment for the new logon. > # Launch the new process in a job with the task name specified and using the > created logon. > # Wait for the JOB to exit. > h2. Future work: > The following work was scoped out of this check in: > * Support for non-domain users or machine that are not domain joined. > * Support for privilege isolation by running the task launcher in a high > privilege service with access over an ACLed named pipe. -- This message was sent by
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011523#comment-14011523 ] Sandy Ryza commented on YARN-2026: -- The nice thing about fair share currently is that it's interpretable as an amount of resources that, as long as you stay under, you won't get preempted. Changing it to depend on the running apps in the cluster severely complicates this. It used to be that each app and queue's fair share was min'd with its resource usage+demand, which is sort of a continuous analog to what you're suggesting, but we moved to the current definition when we added multi-resource scheduling. I'm wondering if the right way to solve this problem is to allow preemption to be triggered at higher levels in the queue hierarchy. I.e. suppose we have the following situation: * root has two children - parentA and parentB * each of root's children has two children - childA1, childA2, childB1, and childB2 * the parent queues' minShares are each set to half of the cluster resources * the child queue' minShares are each set to a quarter of the cluster resources * childA1 has a third of the cluster resources * childB1 and childB2 each have a third of the cluster resources Even though childA1 is above its fair/minShare, We would see that parentA is below its minShare, so we would preempt resources on its behalf. Once we have YARN-596 in, these resources would end up coming from parentB, and end up going to childA1. > Fair scheduler : Fair share for inactive queues causes unfair allocation in > some scenarios > -- > > Key: YARN-2026 > URL: https://issues.apache.org/jira/browse/YARN-2026 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Attachments: YARN-2026-v1.txt > > > While using hierarchical queues in fair scheduler,there are few scenarios > where we have seen a leaf queue with least fair share can take majority of > the cluster and starve a sibling parent queue which has greater weight/fair > share and preemption doesn’t kick in to reclaim resources. > The root cause seems to be that fair share of a parent queue is distributed > to all its children irrespective of whether its an active or an inactive(no > apps running) queue. Preemption based on fair share kicks in only if the > usage of a queue is less than 50% of its fair share and if it has demands > greater than that. When there are many queues under a parent queue(with high > fair share),the child queue’s fair share becomes really low. As a result when > only few of these child queues have apps running,they reach their *tiny* fair > share quickly and preemption doesn’t happen even if other leaf > queues(non-sibling) are hogging the cluster. > This can be solved by dividing fair share of parent queue only to active > child queues. > Here is an example describing the problem and proposed solution: > root.lowPriorityQueue is a leaf queue with weight 2 > root.HighPriorityQueue is parent queue with weight 8 > root.HighPriorityQueue has 10 child leaf queues : > root.HighPriorityQueue.childQ(1..10) > Above config,results in root.HighPriorityQueue having 80% fair share > and each of its ten child queue would have 8% fair share. Preemption would > happen only if the child queue is <4% (0.5*8=4). > Lets say at the moment no apps are running in any of the > root.HighPriorityQueue.childQ(1..10) and few apps are running in > root.lowPriorityQueue which is taking up 95% of the cluster. > Up till this point,the behavior of FS is correct. > Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% > of the cluster. It would get only the available 5% in the cluster and > preemption wouldn't kick in since its above 4%(half fair share).This is bad > considering childQ1 is under a highPriority parent queue which has *80% fair > share*. > Until root.lowPriorityQueue starts relinquishing containers,we would see the > following allocation on the scheduler page: > *root.lowPriorityQueue = 95%* > *root.HighPriorityQueue.childQ1=5%* > This can be solved by distributing a parent’s fair share only to active > queues. > So in the example above,since childQ1 is the only active queue > under root.HighPriorityQueue, it would get all its parent’s fair share i.e. > 80%. > This would cause preemption to reclaim the 30% needed by childQ1 from > root.lowPriorityQueue after fairSharePreemptionTimeout seconds. > Also note that similar situation can happen between > root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 > hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck > at 5%,until chil
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011515#comment-14011515 ] Vinod Kumar Vavilapalli commented on YARN-1972: --- Thanks for working on this Remus. Can you upload the "short design"? Questions/comments on the approach and the patch in the mean-while h4. Approach - What are the requirements on the NodeManager user? Can it run as a regular 'yarn' user, spawn the winutils shell and automatically launch task as some other user? Is there any admin setup that is needed for this to say grant such privileges to 'yarn' user? - One reason why we resorted to duplicate most of the code in DefaultContainerExecutor in container-executor.c for linux is performance. You are launching so many commands for every container - to chown files, to copy files etc. You should measure the performance impact of this to figure out if what the patch does is fine or if we should imitate what the linux-executor does. h4. Patch WindowsSecureContainerExecutor - The overridden getRunCommand skips things like the setting niceness feature (YARN-443) in linux. Arguably this isn't working in non-secure mode before anyways. Is there a way we can bump process-priority in windows? If so, when we add that feature, we'll need to be careful to change both the default and the secure Executor. - namenodeGroup -> nodeManagerGroup - The division of responsibility between launching multiple commands before starting the localizer and the stuff that happens inside the localizer: Localizer already does createUserLocalDirs etc. So you don't need to do them explicitly in the java code inside NodeManager process. - In the minimum we should definitely move exec.localizeClasspathJar() related stuff into the winutils start-process code. - Why is appLocalizationCounter needed? Once we tackle container-preserving NM-restart (YARN-1336), this will be an issue. Why cannot we simply use the localizerId? That is unique enough if we want uniqueness. - Also the startLocalizer() method is a near clone of what exists in LinuxContainerExecutor. We should refactor and reuse, otherwise it will be a maintenance headache. > Implement secure Windows Container Executor > --- > > Key: YARN-1972 > URL: https://issues.apache.org/jira/browse/YARN-1972 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1972.1.patch > > > This work item represents the Java side changes required to implement a > secure windows container executor, based on the YARN-1063 changes on > native/winutils side. > Necessary changes include leveraging the winutils task createas to launch the > container process as the required user and a secure localizer (launch > localization as a separate process running as the container user). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2068) FairScheduler uses the same ResourceCalculator for all policies
[ https://issues.apache.org/jira/browse/YARN-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved YARN-2068. -- Resolution: Invalid Closing this as invalid. Obviously feel free to reopen if I'm missing something. > FairScheduler uses the same ResourceCalculator for all policies > --- > > Key: YARN-2068 > URL: https://issues.apache.org/jira/browse/YARN-2068 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > FairScheduler uses the same ResourceCalculator for all policies including > DRF. Need to fix that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011507#comment-14011507 ] Sandy Ryza commented on YARN-1474: -- My opinion is that it's ok to change these semantics, as ResourceScheduler is marked Evolving. Given the complexity of writing a YARN scheduler, I also seriously doubt that there are custom ones out there outside of academic contexts, so I'm comfortable erring on the opposite side of caution. > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, > YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, > YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, > YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
Anubhav Dhoot created YARN-2110: --- Summary: TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler Key: YARN-2110 URL: https://issues.apache.org/jira/browse/YARN-2110 Project: Hadoop YARN Issue Type: Bug Environment: The TestAMRestart#testAMRestartWithExistingContainers does a cast to CapacityScheduler in a couple of places {code} ((CapacityScheduler) rm1.getResourceScheduler()) {code} If run with FairScheduler as default scheduler the test throws {code} java.lang.ClassCastException {code}. Reporter: Anubhav Dhoot -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2110: Description: The TestAMRestart#testAMRestartWithExistingContainers does a cast to CapacityScheduler in a couple of places {code} ((CapacityScheduler) rm1.getResourceScheduler()) {code} If run with FairScheduler as default scheduler the test throws {code} java.lang.ClassCastException {code}. Environment: (was: The TestAMRestart#testAMRestartWithExistingContainers does a cast to CapacityScheduler in a couple of places {code} ((CapacityScheduler) rm1.getResourceScheduler()) {code} If run with FairScheduler as default scheduler the test throws {code} java.lang.ClassCastException {code}.) > TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler > --- > > Key: YARN-2110 > URL: https://issues.apache.org/jira/browse/YARN-2110 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anubhav Dhoot > > The TestAMRestart#testAMRestartWithExistingContainers does a cast to > CapacityScheduler in a couple of places > {code} > ((CapacityScheduler) rm1.getResourceScheduler()) > {code} > If run with FairScheduler as default scheduler the test throws > {code} java.lang.ClassCastException {code}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan reassigned YARN-2098: - Assignee: Wei Yan > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Wei Yan > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1,we should > change this to get priority from ApplicationSubmissionContext. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2107) Refactor timeline classes into server.timeline package
[ https://issues.apache.org/jira/browse/YARN-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011429#comment-14011429 ] Hudson commented on YARN-2107: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5616 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5616/]) YARN-2107. Refactored timeline classes into o.a.h.y.s.timeline package. Contributed by Vinod Kumar Vavilapalli. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598094) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AHSWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/EntityIdentifier.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/GenericObjectMapper.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/NameValuePair.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineReader.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineWriter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/package-info.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilterInitializer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineClientAuthenticationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineDelegationTokenSecretManagerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp * /hadoop/common/trunk/hadoop-yarn-project
[jira] [Commented] (YARN-800) Clicking on an AM link for a running app leads to a HTTP 500
[ https://issues.apache.org/jira/browse/YARN-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011420#comment-14011420 ] Dave Disser commented on YARN-800: -- I'm seeing this issue regardless of the status of yarn.resourcemanager.hostname and yarn.web-proxy.address. I notice the following in my nodemanager log file: 2014-05-28 14:06:20,478 INFO webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(330)) - dr.who is accessing unchecked http://hdp003-3:59959/ which is the app master GUI of application_1401300304842_0001 owned by hdfs I can try to retrieve this URL directly: hdp003-2:~ # wget -O - http://hdp003-3:59959/ --2014-05-28 14:06:47-- http://hdp003-3:59959/ Resolving hdp003-3... 39.64.24.3 Connecting to hdp003-3|39.64.24.3|:59959... connected. HTTP request sent, awaiting response... 302 Found Location: http://hdp003-3:59959/mapreduce [following] --2014-05-28 14:06:47-- http://hdp003-3:59959/mapreduce Reusing existing connection to hdp003-3:59959. HTTP request sent, awaiting response... 302 Found Location: http://hdp003-3:8088/proxy/application_1401300304842_0001/mapreduce [following] --2014-05-28 14:06:47-- http://hdp003-3:8088/proxy/application_1401300304842_0001/mapreduce Connecting to hdp003-3|39.64.24.3|:8088... failed: Connection refused. Resolving hdp003-3... 39.64.24.3 Connecting to hdp003-3|39.64.24.3|:8088... failed: Connection refused. The node running the AM is proxying the request to itself, where there is no proxy running. If I do the same on the node where AM is running, I get the proper result: hdp003-3:~ # wget -O - http://hdp003-3:59959/ --2014-05-28 14:07:25-- http://hdp003-3:59959/ Resolving hdp003-3... 39.64.24.3 Connecting to hdp003-3|39.64.24.3|:59959... connected. HTTP request sent, awaiting response... 302 Found Location: http://hdp003-3:59959/mapreduce [following] --2014-05-28 14:07:25-- http://hdp003-3:59959/mapreduce Reusing existing connection to hdp003-3:59959. HTTP request sent, awaiting response... 200 OK Length: 6224 (6.1K) [text/html] Saving to: `STDOUT' 0% [ ] 0 --.-K/s < !DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd";>
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011362#comment-14011362 ] Karthik Kambatla commented on YARN-2010: I see. Thanks for the input. Let me check if that is indeed the case, and attempt recovering the app even if the key is null Regardless, do we agree that we still need to address the case where the app recovery fails for potentially other reasons? > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Rohith >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011355#comment-14011355 ] Jian He commented on YARN-2010: --- bq. The stack trace corresponds to non-work-preserving restart. I am not sure I understand the concern. What I meant is, in this scenario, it shouldn't matter whether the old attempt has the master key or not, since the old attempt will be anyways killed by NM on RM restart. The newly started attempt will have the proper master key generated. If we just check whether the key is null and move on, the next attempt should be able to succeed. So we don't need to explicitly fail the app ? > RM can't transition to active if it can't recover an app attempt > > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Rohith >Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make > it more resilient. > Specifically, the underlying error is that the app was submitted before > Kerberos security got turned on. Makes sense for the app to fail in this > case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to > Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: > java.lang.IllegalArgumentException: Missing argument > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at > org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > > at > org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > > ...
[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-596: - Attachment: YARN-596.patch > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011335#comment-14011335 ] Xuan Gong commented on YARN-2054: - [~kasha] Looks like you need to update the patch. There are lots of unrelated changes.. > Poor defaults for YARN ZK configs for retries and retry-inteval > --- > > Key: YARN-2054 > URL: https://issues.apache.org/jira/browse/YARN-2054 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch > > > Currenly, we have the following default values: > # yarn.resourcemanager.zk-num-retries - 500 > # yarn.resourcemanager.zk-retry-interval-ms - 2000 > This leads to a cumulate 1000 seconds before the RM gives up trying to > connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions
[ https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-596: - Attachment: YARN-596.patch Thanks, Sandy. Upload a new patch to fix your comments. > In fair scheduler, intra-application container priorities affect > inter-application preemption decisions > --- > > Key: YARN-596 > URL: https://issues.apache.org/jira/browse/YARN-596 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, > YARN-596.patch, YARN-596.patch, YARN-596.patch > > > In the fair scheduler, containers are chosen for preemption in the following > way: > All containers for all apps that are in queues that are over their fair share > are put in a list. > The list is sorted in order of the priority that the container was requested > in. > This means that an application can shield itself from preemption by > requesting it's containers at higher priorities, which doesn't really make > sense. > Also, an application that is not over its fair share, but that is in a queue > that is over it's fair share is just as likely to have containers preempted > as an application that is over its fair share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011331#comment-14011331 ] Karthik Kambatla commented on YARN-1474: I think that is the step in the right direction. I agree it is a change in semantics. Might be a good idea to see what others think. [~sandyr], [~vinodkv] - do you guys think it is okay to change the semantics on how a scheduler is used: - Before this patch, we create a scheduler and call reinitialize(). - After this patch, I am proposing scheduler.setRMContext(), scheduler.init(), and then scheduler.reinitialize() for later updates to allocation-files etc. Scheduler initialization is within the RM, and we haven't exposed the scheduler API for users to write custom schedulers yet. > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, > YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, > YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, > YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011325#comment-14011325 ] Hadoop QA commented on YARN-1338: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647161/YARN-1338v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 16 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3844//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3844//console This message is automatically generated. > Recover localized resource cache state upon nodemanager restart > --- > > Key: YARN-1338 > URL: https://issues.apache.org/jira/browse/YARN-1338 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1338.patch, YARN-1338v2.patch, > YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch, > YARN-1338v6.patch > > > Today when node manager restarts we clean up all the distributed cache files > from disk. This is definitely not ideal from 2 aspects. > * For work preserving restart we definitely want them as running containers > are using them > * For even non work preserving restart this will be useful in the sense that > we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011320#comment-14011320 ] Tsuyoshi OZAWA commented on YARN-1474: -- Thanks Karthik for the comments. I'd like to make sure one point: {quote} 1. In each of the schedulers, I don't think we need the following snippet or for that matter the variable initialized at all. reinitialize() would have just the contents of else-block. {quote} If we change that {{reinitialize()}} would have just contents of else-black, we need to change lots schedulers-related test cases without ResourceManager/MockRM to call {{scheduler.init()}} right after {{scheduler.setRMContext()}}. Is it acceptable for us? > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, > YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, > YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, > YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler
Anubhav Dhoot created YARN-2109: --- Summary: TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler Key: YARN-2109 URL: https://issues.apache.org/jira/browse/YARN-2109 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Anubhav Dhoot testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set it to be CapacityScheduler. But if the default scheduler is set to FairScheduler then the rest of the tests that execute after this will fail with invalid cast exceptions when getting queuemetrics. This is based on test execution order as only the tests that execute after this test will fail. This is because the queuemetrics will be initialized by this test to QueueMetrics and shared by the subsequent tests. We can explicitly clear the metrics at the end of this test to fix this. For example java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be cast to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:90) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:85) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:81) at org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1338: - Attachment: YARN-1338v6.patch Thanks for the additional comments, Junping. bq. Do we have any code to destroy DB items for NMState when NM is decommissioned (not expecting short-term restart)? Good point. I added shutdown code that removes the recovery directory if the shutdown is due to a decommission. I also added a unit test for this scenario. {quote} In LocalResourcesTrackerImpl#recoverResource() +incrementFileCountForLocalCacheDirectory(localDir.getParent()); Given localDir is already the parent of localPath, may be we should just increment locaDir rather than its parent? I didn't see we have unit test to check file count for resource directory after recovery. May be we should add some? {quote} The last component of localDir is the unique resource ID and not a directory managed by the local cache directory manager. The directory allocated by the local cache directory manager has an additional directory added by the localization process which is named after the unique ID for the local resource. For example, the localPath might be something like /local/root/0/1/52/resource.jar and localDir is /local/root/0/1/52. The '52' is the unique resource ID (always >= 10 so it can't conflict with single-character cache mgr subdirs) and /local/root/0/1 is the directory managed by the local dir cache manager. If we passed localDir to the local dir cache manager it would get confused since it would try to parse the last component as a subdirectory it created but it isn't that. I did add a unit test to verify local cache directory counts are incremented properly when resources are recovered. This required exposing a couple of methods as package-private to get the necessary information for the test. > Recover localized resource cache state upon nodemanager restart > --- > > Key: YARN-1338 > URL: https://issues.apache.org/jira/browse/YARN-1338 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1338.patch, YARN-1338v2.patch, > YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch, > YARN-1338v6.patch > > > Today when node manager restarts we clean up all the distributed cache files > from disk. This is definitely not ideal from 2 aspects. > * For work preserving restart we definitely want them as running containers > are using them > * For even non work preserving restart this will be useful in the sense that > we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011183#comment-14011183 ] Hudson commented on YARN-2012: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1784 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1784/]) YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > Fair Scheduler: allow default queue placement rule to take an arbitrary queue > - > > Key: YARN-2012 > URL: https://issues.apache.org/jira/browse/YARN-2012 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Fix For: 2.5.0 > > Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt > > > Currently 'default' rule in queue placement policy,if applied,puts the app in > root.default queue. It would be great if we can make 'default' rule > optionally point to a different queue as default queue . > This default queue can be a leaf queue or it can also be an parent queue if > the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2105) Fix TestFairScheduler after YARN-2012
[ https://issues.apache.org/jira/browse/YARN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011182#comment-14011182 ] Hudson commented on YARN-2105: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1784 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1784/]) YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > Fix TestFairScheduler after YARN-2012 > - > > Key: YARN-2105 > URL: https://issues.apache.org/jira/browse/YARN-2105 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: Ashwin Shankar > Fix For: 2.5.0 > > Attachments: YARN-2105-v1.txt > > > The following tests fail in trunk: > {code} > Failed tests: > TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:<1> but was:<0> > Tests in error: > TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer > TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2105) Fix TestFairScheduler after YARN-2012
[ https://issues.apache.org/jira/browse/YARN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011101#comment-14011101 ] Hudson commented on YARN-2105: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1757 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1757/]) YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > Fix TestFairScheduler after YARN-2012 > - > > Key: YARN-2105 > URL: https://issues.apache.org/jira/browse/YARN-2105 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: Ashwin Shankar > Fix For: 2.5.0 > > Attachments: YARN-2105-v1.txt > > > The following tests fail in trunk: > {code} > Failed tests: > TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:<1> but was:<0> > Tests in error: > TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer > TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011103#comment-14011103 ] Hudson commented on YARN-2012: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1757 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1757/]) YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > Fair Scheduler: allow default queue placement rule to take an arbitrary queue > - > > Key: YARN-2012 > URL: https://issues.apache.org/jira/browse/YARN-2012 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Fix For: 2.5.0 > > Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt > > > Currently 'default' rule in queue placement policy,if applied,puts the app in > root.default queue. It would be great if we can make 'default' rule > optionally point to a different queue as default queue . > This default queue can be a leaf queue or it can also be an parent queue if > the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011022#comment-14011022 ] Hudson commented on YARN-2012: -- FAILURE: Integrated in Hadoop-Yarn-trunk #566 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/566/]) YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > Fair Scheduler: allow default queue placement rule to take an arbitrary queue > - > > Key: YARN-2012 > URL: https://issues.apache.org/jira/browse/YARN-2012 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Fix For: 2.5.0 > > Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt > > > Currently 'default' rule in queue placement policy,if applied,puts the app in > root.default queue. It would be great if we can make 'default' rule > optionally point to a different queue as default queue . > This default queue can be a leaf queue or it can also be an parent queue if > the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2105) Fix TestFairScheduler after YARN-2012
[ https://issues.apache.org/jira/browse/YARN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011021#comment-14011021 ] Hudson commented on YARN-2105: -- FAILURE: Integrated in Hadoop-Yarn-trunk #566 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/566/]) YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > Fix TestFairScheduler after YARN-2012 > - > > Key: YARN-2105 > URL: https://issues.apache.org/jira/browse/YARN-2105 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: Ashwin Shankar > Fix For: 2.5.0 > > Attachments: YARN-2105-v1.txt > > > The following tests fail in trunk: > {code} > Failed tests: > TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:<1> but was:<0> > Tests in error: > TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer > TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010998#comment-14010998 ] Hadoop QA commented on YARN-2054: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647073/yarn-2054-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 27 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-common-project/hadoop-nfs hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-nfs hadoop-tools/hadoop-distcp hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.client.TestRMAdminCLI org.apache.hadoop.hdfs.server.namenode.TestSecondaryNameNodeUpgrade {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3843//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3843//console This message is automatically generated. > Poor defaults for YARN ZK configs for retries and retry-inteval > --- > > Key: YARN-2054 > URL: https://issues.apache.org/jira/browse/YARN-2054 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch > > > Currenly, we have the following default values: > # yarn.resourcemanager.zk-num-retries - 500 > # yarn.resourcemanager.zk-retry-interval-ms - 2000 > This leads to a cumulate 1000 seconds before the RM gives up trying to > connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/YARN-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010906#comment-14010906 ] Steve Loughran commented on YARN-2092: -- # code using Apache Curator was one -in YARN and on the client if you picked up the locally installed hadoop CP. And we couldn't upload a newer version of Jackson, because again, what's on the CP is what you get. # If, client-side you abandon that CP and use/redist your entire hadoop binary set then you can reduce the risk here, at the expense of taking away from ops any control of the versions of things you run on the cluster, but then you now have to deal with older files in the cluster. # or you ignore yarn.lib.classpath entirely, *somehow* work out the values of yarn-site.xml &c, and re-upload every single hadoop-*.jar and its chosen binaries into every single container. having a YARN artifact repo will reduce the cost of that, but add a new one: bug fixes in hadoop will only propagate when the apps are rebuilt. # ..if you look at the HADOOP-9991 issue you can see links to some places where the outdated JARs in Hadoop cause problems for other ASF projects. # Tez appears to have broken because it was explicity putting the 1.8.x JARs on its list of binaries to upload. It only worked because it was using exactly the same version. # if you adopt a policy of change no dependencies that break apps that upload duplicate JARs to the CP -then this goes beyond Jackson, it says "hadoop cannot update any of its dependencies". That would go for 2.x and no doubt even if we did update things for 3.x, then we'll still get "you broke my code that uploaded jackson 1.8" issues. # ...and we haven't gone near Guava yet, which is frozen because it really is so brittle, but that means we can't pick up the guava 16.x-only fixes needed to work with the latest JVMs. If you do want to revoke jackson, I'm not going to veto it -but it goes beyond YARN, and we may as well revert every single HADOOP-9991-related upgrade. > Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to > 2.5.0-SNAPSHOT > > > Key: YARN-2092 > URL: https://issues.apache.org/jira/browse/YARN-2092 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Hitesh Shah > > Came across this when trying to integrate with the timeline server. Using a > 1.8.8 dependency of jackson works fine against 2.4.0 but fails against > 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user > jars are first in the classpath. -- This message was sent by Atlassian JIRA (v6.2#6252)