date:20140528

Vinod Kumar Vavilapalli created YARN-2113:
-

 Summary: CS Preemption should respect user-limits
 Key: YARN-2113
 URL: https://issues.apache.org/jira/browse/YARN-2113
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 2.5.0


This is different from (even if related to, and likely share code with) 
YARN-2069.

YARN-2069 focuses on making sure that even if queue has its guaranteed 
capacity, it's individual users are treated in-line with their limits 
irrespective of when they join in.

This JIRA is about respecting user-limits while preempting containers to 
balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012052#comment-14012052
 ] 

Hudson commented on YARN-596:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #5619 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5619/])
YARN-596. Use scheduling policies throughout the queue hierarchy to decide 
which containers to preempt (Wei Yan via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598197)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AppSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSParentQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FakeSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java


> Use scheduling policies throughout the queue hierarchy to decide which 
> containers to preempt
> 
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Fix For: 2.5.0
>
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012038#comment-14012038
 ] 

Sandy Ryza commented on YARN-596:
-

I just committed this to trunk and branch-2.  Thanks Wei for the patch and 
Ashwin for taking a look.

> Use scheduling policies throughout the queue hierarchy to decide which 
> containers to preempt
> 
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Fix For: 2.5.0
>
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt


 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-596:


Summary: Use scheduling policies throughout the queue hierarchy to decide 
which containers to preempt  (was: Use scheduling policies throughout the 
hierarchy to decide which containers to preempt)

> Use scheduling policies throughout the queue hierarchy to decide which 
> containers to preempt
> 
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-596) Use scheduling policies throughout the hierarchy to decide which containers to preempt


 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-596:


Summary: Use scheduling policies throughout the hierarchy to decide which 
containers to preempt  (was: In fair scheduler, intra-application container 
priorities affect inter-application preemption decisions)

> Use scheduling policies throughout the hierarchy to decide which containers 
> to preempt
> --
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012030#comment-14012030
 ] 

Sandy Ryza commented on YARN-596:
-

+1

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN


[ 
https://issues.apache.org/jira/browse/YARN-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011982#comment-14011982
 ] 

Vinod Kumar Vavilapalli commented on YARN-2041:
---

Tx for all the updates, [~nravi], but can you please make clear the issues that 
you think are needed to be fixed?

> Hard to co-locate MR2 and Spark jobs on the same cluster in YARN
> 
>
> Key: YARN-2041
> URL: https://issues.apache.org/jira/browse/YARN-2041
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Nishkam Ravi
>
> Performance of MR2 jobs falls drastically as YARN config parameter 
> yarn.nodemanager.resource.memory-mb  is increased beyond a certain value. 
> Performance of Spark falls drastically as the value of 
> yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a 
> large data set.
> This makes it hard to co-locate MR2 and Spark jobs in YARN.
> The experiments are being conducted on a 6-node cluster. The following 
> workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, 
> ShuffleText and PageRank.
> Will add more details to this JIRA over time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt


[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011943#comment-14011943
 ] 

Hadoop QA commented on YARN-2010:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647268/yarn-2010-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3854//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3854//console

This message is automatically generated.

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
> yarn-2010-3.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ...

[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters


[ 
https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011918#comment-14011918
 ] 

Hadoop QA commented on YARN-2091:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647261/YARN-2091.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3853//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3853//console

This message is automatically generated.

> Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
> ---
>
> Key: YARN-2091
> URL: https://issues.apache.org/jira/browse/YARN-2091
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2091.1.patch
>
>
> Currently, the AM cannot programmatically determine if the task was killed 
> due to using excessive memory. The NM kills it without passing this 
> information in the container status back to the RM. So the AM cannot take any 
> action here. The jira tracks adding this exit status and passing it from the 
> NM to the RM and then the AM. In general, there may be other such actions 
> taken by YARN that are currently opaque to the AM. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-05-28 Thread Ashwin Shankar (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011907#comment-14011907
 ] 

Ashwin Shankar commented on YARN-2026:
--

Hi [~sandyr],
bq. We would see that parentA is below its minShare, so we would preempt 
resources on its behalf. 
minShare preemption at parent queue is not yet implemented 
,FairScheduler.resToPreempt() is not recursive(YARN-596 doesn't address this).
I had created YARN-1961 for this purpose,which I plan to work on.
But yes you are right,if YARN-1961 is in place, we can set minShare and 
minShareTimeout at parentA,which would
reclaim resource from parentB.

This solves problem-1 in the description,but what about problem-2 ?
When we have many leaf queues under a parent,say using NestedUserQueue rule.
Eg.
 - parentA has 100 user queues under it
 - fair share of each user queue is 1% of parentA(assuming weight=1)
 - Say user queue parentA.user1 is taking up 100% of cluster since its the only 
active queue.
 - parentA.user2 which was inactive till now ,submits a job and needs say 20%.
 - parentA.user2 would get only 1% through preemption and parentA.user1 would 
have 99%.
  This seems unfair considering users have equal weight. Eventually,as user1 
releases its containers,
  it would go to user2,but until that happens user1 can hog the cluster.

In our cluster we have about 200 users(so 200 user queues),but only about 
20%(avg) are active
at a point in time. Fair share for each user becomes really low (1/200)*parent 
and can causes
this 'unfairness' mentioned in above example.
This can be solved by dividing fair share only to active queues.

How about this,can we have a new property say 'fairShareForActiveQueues' which 
turns on/off this feature,that way people
who need it can use it and other's can turn it off and would get the usual 
static fair share behavior.
Thoughts ?

> Fair scheduler : Fair share for inactive queues causes unfair allocation in 
> some scenarios
> --
>
> Key: YARN-2026
> URL: https://issues.apache.org/jira/browse/YARN-2026
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
>  Labels: scheduler
> Attachments: YARN-2026-v1.txt
>
>
> Problem1- While using hierarchical queues in fair scheduler,there are few 
> scenarios where we have seen a leaf queue with least fair share can take 
> majority of the cluster and starve a sibling parent queue which has greater 
> weight/fair share and preemption doesn’t kick in to reclaim resources.
> The root cause seems to be that fair share of a parent queue is distributed 
> to all its children irrespective of whether its an active or an inactive(no 
> apps running) queue. Preemption based on fair share kicks in only if the 
> usage of a queue is less than 50% of its fair share and if it has demands 
> greater than that. When there are many queues under a parent queue(with high 
> fair share),the child queue’s fair share becomes really low. As a result when 
> only few of these child queues have apps running,they reach their *tiny* fair 
> share quickly and preemption doesn’t happen even if other leaf 
> queues(non-sibling) are hogging the cluster.
> This can be solved by dividing fair share of parent queue only to active 
> child queues.
> Here is an example describing the problem and proposed solution:
> root.lowPriorityQueue is a leaf queue with weight 2
> root.HighPriorityQueue is parent queue with weight 8
> root.HighPriorityQueue has 10 child leaf queues : 
> root.HighPriorityQueue.childQ(1..10)
> Above config,results in root.HighPriorityQueue having 80% fair share
> and each of its ten child queue would have 8% fair share. Preemption would 
> happen only if the child queue is <4% (0.5*8=4). 
> Lets say at the moment no apps are running in any of the 
> root.HighPriorityQueue.childQ(1..10) and few apps are running in 
> root.lowPriorityQueue which is taking up 95% of the cluster.
> Up till this point,the behavior of FS is correct.
> Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
> of the cluster. It would get only the available 5% in the cluster and 
> preemption wouldn't kick in since its above 4%(half fair share).This is bad 
> considering childQ1 is under a highPriority parent queue which has *80% fair 
> share*.
> Until root.lowPriorityQueue starts relinquishing containers,we would see the 
> following allocation on the scheduler page:
> *root.lowPriorityQueue = 95%*
> *root.HighPriorityQueue.childQ1=5%*
> This can be solved by distributing a parent’s fair share only to active 
> queues.
> So in the example above,since childQ1 is the only active queue
> under root.HighPriorityQueue, it would get all its parent

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011906#comment-14011906
 ] 

Hadoop QA commented on YARN-596:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647256/YARN-596.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3852//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3852//console

This message is automatically generated.

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt


 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---

Attachment: yarn-2010-3.patch

New patch that gets rid of the config and addresses the issue where the 
masterKey is null. 

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
> yarn-2010-3.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters

2014-05-28 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2091:
-

Attachment: YARN-2091.1.patch

Added ContainerExitStatus.KILL_EXCEEDED_MEMORY and test to pass the exit status 
from NM to RM correctly.

> Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
> ---
>
> Key: YARN-2091
> URL: https://issues.apache.org/jira/browse/YARN-2091
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2091.1.patch
>
>
> Currently, the AM cannot programmatically determine if the task was killed 
> due to using excessive memory. The NM kills it without passing this 
> information in the container status back to the RM. So the AM cannot take any 
> action here. The jira tracks adding this exit status and passing it from the 
> NM to the RM and then the AM. In general, there may be other such actions 
> taken by YARN that are currently opaque to the AM. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011879#comment-14011879
 ] 

Hadoop QA commented on YARN-1474:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647249/YARN-1474.18.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 9 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/3851//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3851//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3851//console

This message is automatically generated.

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
> YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
> YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
> YARN-1474.17.patch, YARN-1474.18.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
> YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
> YARN-1474.8.patch, YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-596:
-

Attachment: YARN-596.patch

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011847#comment-14011847
 ] 

Hadoop QA commented on YARN-2110:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647243/YARN-2110.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3850//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3850//console

This message is automatically generated.

> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
> ---
>
> Key: YARN-2110
> URL: https://issues.apache.org/jira/browse/YARN-2110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Chen He
>  Labels: test
> Attachments: YARN-2110.patch
>
>
> The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
> CapacityScheduler in a couple of places
> {code}
> ((CapacityScheduler) rm1.getResourceScheduler())
> {code}
> If run with FairScheduler as default scheduler the test throws 
> {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-05-28 Thread Ashwin Shankar (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashwin Shankar updated YARN-2026:
-

Description:
Problem1- While using hierarchical queues in fair scheduler,there are few
scenarios where we have seen a leaf queue with least fair share can take
majority of the cluster and starve a sibling parent queue which has greater
weight/fair share and preemption doesn’t kick in to reclaim resources.

The root cause seems to be that fair share of a parent queue is distributed to
all its children irrespective of whether its an active or an inactive(no apps
running) queue. Preemption based on fair share kicks in only if the usage of a
queue is less than 50% of its fair share and if it has demands greater than
that. When there are many queues under a parent queue(with high fair share),the
child queue’s fair share becomes really low. As a result when only few of these
child queues have apps running,they reach their *tiny* fair share quickly and
preemption doesn’t happen even if other leaf queues(non-sibling) are hogging
the cluster.

This can be solved by dividing fair share of parent queue only to active child
queues.

Here is an example describing the problem and proposed solution:
root.lowPriorityQueue is a leaf queue with weight 2
root.HighPriorityQueue is parent queue with weight 8
root.HighPriorityQueue has 10 child leaf queues :
root.HighPriorityQueue.childQ(1..10)

Above config,results in root.HighPriorityQueue having 80% fair share
and each of its ten child queue would have 8% fair share. Preemption would
happen only if the child queue is <4% (0.5*8=4).

Lets say at the moment no apps are running in any of the
root.HighPriorityQueue.childQ(1..10) and few apps are running in
root.lowPriorityQueue which is taking up 95% of the cluster.
Up till this point,the behavior of FS is correct.

Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of
the cluster. It would get only the available 5% in the cluster and preemption
wouldn't kick in since its above 4%(half fair share).This is bad considering
childQ1 is under a highPriority parent queue which has *80% fair share*.

Until root.lowPriorityQueue starts relinquishing containers,we would see the
following allocation on the scheduler page:
*root.lowPriorityQueue = 95%*
*root.HighPriorityQueue.childQ1=5%*

This can be solved by distributing a parent’s fair share only to active queues.

So in the example above,since childQ1 is the only active queue
under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%.
This would cause preemption to reclaim the 30% needed by childQ1 from
root.lowPriorityQueue after fairSharePreemptionTimeout seconds.

Problem2 - Also note that similar situation can happen between
root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2
hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at
5%,until childQ2 starts relinquishing containers. We would like each of childQ1
and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which
would ensure childQ1 gets upto 40% resource if needed through preemption.

was:
While using hierarchical queues in fair scheduler,there are few scenarios where
we have seen a leaf queue with least fair share can take majority of the
cluster and starve a sibling parent queue which has greater weight/fair share
and preemption doesn’t kick in to reclaim resources.

This can be solved by dividing fair share of parent queue only to active child
queues.

Above config,results in root.HighPriorityQueue having 80% fair share
and each of its ten child queue would have 8% fair share. Preemption would
happen only if the child queue is <4% (0.5*8=4).

Now,lets say root

[jira] [Commented] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN

2014-05-28 Thread Nishkam Ravi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011844#comment-14011844
 ] 

Nishkam Ravi commented on YARN-2041:


Unlike FIFO, whose performance deteriorates consistently across multiple 
benchmarks as value of yarn.nodemanager.resource.memory-mb is increased from 
16GB to 40GB, Capacity scheduler performs well for all benchmarks except for 
TeraValidate. 

For TeraValidate in single-job mode:

Exec. time with Fair: 38 sec (yarn.nodemanager.resource.memory-mb = 16GB)
Exec. time with Fair: 38 sec (yarn.nodemanager.resource.memory-mb = 40GB)
Exec. time with Capacity: 51 sec (yarn.nodemanager.resource.memory-mb = 16GB)
Exec. time with Capacity: 100 sec (yarn.nodemanager.resource.memory-mb = 40GB)

Also, in multi-job mode, Capacity seems to be behaving like FIFO. Scheduling 
one job at a time for execution. 


> Hard to co-locate MR2 and Spark jobs on the same cluster in YARN
> 
>
> Key: YARN-2041
> URL: https://issues.apache.org/jira/browse/YARN-2041
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Nishkam Ravi
>
> Performance of MR2 jobs falls drastically as YARN config parameter 
> yarn.nodemanager.resource.memory-mb  is increased beyond a certain value. 
> Performance of Spark falls drastically as the value of 
> yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a 
> large data set.
> This makes it hard to co-locate MR2 and Spark jobs in YARN.
> The experiments are being conducted on a 6-node cluster. The following 
> workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, 
> ShuffleText and PageRank.
> Will add more details to this JIRA over time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1474) Make schedulers services

2014-05-28 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1474:
-

Attachment: YARN-1474.18.patch

Thanks for sharing the opinions, Sandy and Karthik. I also think it's OK to 
change internal APIs' semantics because its interface is Evolving one. 
[~vinodkv], please let us know if you have additional comments.

Updated patch with following changes to address Karthik's comments:
1. Removed {{initialized}} flag from *Schedulers. All initialization is done in 
{{serviceInit}} and {{serviceStart}}, instead of {{reinitialize()}}.
2. Changed ResourceSchedulerWrapper to override {{serviceInit}}, 
{{serviceStart}}, {{serviceStop}}.
3. Updated some tests to call scheduler.init() right after 
scheduler.setRMContext() without ResourceManager/MockRM.



> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
> YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
> YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
> YARN-1474.17.patch, YARN-1474.18.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
> YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
> YARN-1474.8.patch, YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)


[ 
https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011820#comment-14011820
 ] 

Vinod Kumar Vavilapalli commented on YARN-1708:
---

Thanks for the patch [~subru]! I started looking at this. Few comments:

h4. Misc
 - I think we should create a ReservationID or ReservationHandle and use it 
instead of strings
 - ReservationResponse.message -> errorMesage? Or Errors?

h4. ApplicationClientProtocol
 - createReservation -> submitReservation?
 - Let's have separate request/response records for submission, update and 
deletion of reservations. Deletion of reservations, for e.g only needs to 
supplied a reservationID. See submit/kill app for analogy. Similarly, 
ReservationRequest.reservationID doesn't need to be part of the request for the 
reservation-submission.

h4. ReservationDefinition
 - Seems like there is a notion of absolute time. We should make it clear what 
the arrival/deadline long's really represent. Particularly given the 
possibility of different timezones between the RM and the client.
 - It may be also very useful to let users specify time in relative terms - 
6hrs from now, etc.
 - It let's you specify a list of ResourceRequests. Not sure how we can specify 
things like RR1 for the first 5 mins, RR2 for the next 15 etc.

h4. ReservationDefinitionType
 - It seems like if we instead have a list of records of type (arrival, 
ResourceRequest, deadline), we will cover all the cases in the definition-type 
and then some more? Thoughts?
 - Also any examples of where R_ANY is useful? Similarly as to how R_ORDER is 
not enough and instead we have a need for R_ORDER_NO_GAP? Focusing mainly on 
use-cases here.

h4. ResourceRequest
 - concurrency is really a request for a gang of containers?
 - Meaning of leaseDuration? Is it indicating the scheduler as to how long the 
container will run for?

I have suggestions for configuration props renames follow. We follow a 
component.sub-component.sub-component.property-name convention. (OT: I wish I 
looked at preemption related config names :) ) IAC, I need to see the bigger 
picture with the rest of the patches before I can suggest correct naming, let's 
drop the YarnConfiguration changes from this patch.

Will look more carefully at the PB impls in the next cycle.

bq. The patch posted here is not submitted, since it depends on many other 
patches part of the umbrella JIRA, the separation is designed only for ease of 
reviewing. 
I see this patch to be fairly independent and committable in isolation. Though 
we should wait till we have the entire set to make sure the changes here are 
all sufficient and necessary.

> Add a public API to reserve resources (part of YARN-1051)
> -
>
> Key: YARN-1708
> URL: https://issues.apache.org/jira/browse/YARN-1708
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Carlo Curino
>Assignee: Subramaniam Krishnan
> Attachments: YARN-1708.patch
>
>
> This JIRA tracks the definition of a new public API for YARN, which allows 
> users to reserve resources (think of time-bounded queues). This is part of 
> the admission control enhancement proposed in YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval


[ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011818#comment-14011818
 ] 

Hadoop QA commented on YARN-2054:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647233/yarn-2054-4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3849//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3849//console

This message is automatically generated.

> Poor defaults for YARN ZK configs for retries and retry-inteval
> ---
>
> Key: YARN-2054
> URL: https://issues.apache.org/jira/browse/YARN-2054
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch, 
> yarn-2054-4.patch
>
>
> Currenly, we have the following default values:
> # yarn.resourcemanager.zk-num-retries - 500
> # yarn.resourcemanager.zk-retry-interval-ms - 2000
> This leads to a cumulate 1000 seconds before the RM gives up trying to 
> connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011814#comment-14011814
 ] 

Wei Yan commented on YARN-596:
--

Thanks, [~ashwinshankar77]. I'll update a patch quickly.

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Ashwin Shankar (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011807#comment-14011807
 ] 

Ashwin Shankar commented on YARN-596:
-

[~ywskycn],minor comment : can you please update javadoc comment for 
{code:title=FairScheduler.java}
protected void preemptResources(Resource toPreempt)
{code}
it still talks about the previous preemption algorithm.

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2110:
--

Labels: test  (was: )

> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
> ---
>
> Key: YARN-2110
> URL: https://issues.apache.org/jira/browse/YARN-2110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Chen He
>  Labels: test
> Attachments: YARN-2110.patch
>
>
> The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
> CapacityScheduler in a couple of places
> {code}
> ((CapacityScheduler) rm1.getResourceScheduler())
> {code}
> If run with FairScheduler as default scheduler the test throws 
> {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2109:
--

Labels: test  (was: )

> TestRM fails some tests when some tests run with CapacityScheduler and some 
> with FairScheduler
> --
>
> Key: YARN-2109
> URL: https://issues.apache.org/jira/browse/YARN-2109
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Anubhav Dhoot
>Assignee: Chen He
>  Labels: test
>
> testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
> [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
> it to be CapacityScheduler. But if the default scheduler is set to 
> FairScheduler then the rest of the tests that execute after this will fail 
> with invalid cast exceptions when getting queuemetrics. This is based on test 
> execution order as only the tests that execute after this test will fail. 
> This is because the queuemetrics will be initialized by this test to 
> QueueMetrics and shared by the subsequent tests. 
> We can explicitly clear the metrics at the end of this test to fix this.
> For example
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot 
> be cast to 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:90)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:85)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:81)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011788#comment-14011788
 ] 

Chen He commented on YARN-2110:
---

change casting from CapacityScheduler to AbstractYarnScheduler which is the 
parent of both FairScheduler and CapacityScheduler.

> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
> ---
>
> Key: YARN-2110
> URL: https://issues.apache.org/jira/browse/YARN-2110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Chen He
> Attachments: YARN-2110.patch
>
>
> The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
> CapacityScheduler in a couple of places
> {code}
> ((CapacityScheduler) rm1.getResourceScheduler())
> {code}
> If run with FairScheduler as default scheduler the test throws 
> {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2110:
--

Attachment: YARN-2110.patch

> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
> ---
>
> Key: YARN-2110
> URL: https://issues.apache.org/jira/browse/YARN-2110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Chen He
> Attachments: YARN-2110.patch
>
>
> The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
> CapacityScheduler in a couple of places
> {code}
> ((CapacityScheduler) rm1.getResourceScheduler())
> {code}
> If run with FairScheduler as default scheduler the test throws 
> {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval


 [ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2054:
---

Attachment: yarn-2054-4.patch

Sorry for the bulky patch - forgot to rebase against trunk before generating a 
diff against it :)

Here is the right one. 

> Poor defaults for YARN ZK configs for retries and retry-inteval
> ---
>
> Key: YARN-2054
> URL: https://issues.apache.org/jira/browse/YARN-2054
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch, 
> yarn-2054-4.patch
>
>
> Currenly, we have the following default values:
> # yarn.resourcemanager.zk-num-retries - 500
> # yarn.resourcemanager.zk-retry-interval-ms - 2000
> This leads to a cumulate 1000 seconds before the RM gives up trying to 
> connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml


[ 
https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011706#comment-14011706
 ] 

Hadoop QA commented on YARN-2112:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647222/YARN-2112.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3848//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3848//console

This message is automatically generated.

> Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
> -
>
> Key: YARN-2112
> URL: https://issues.apache.org/jira/browse/YARN-2112
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2112.1.patch
>
>
> Now YarnClient is using TimelineClient, which has dependency on jackson libs. 
> However, the current dependency configurations make the hadoop-client 
> artifect miss 2 jackson libs, such that the applications which have 
> hadoop-client dependency will see the following exception
> {code}
> java.lang.NoClassDefFoundError: 
> org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>   at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.(TimelineClientImpl.java:92)
>   at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:88)
>   at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:111)
>   at 
> org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
>   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
>   at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
>   at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
>   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
>   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:394)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>   at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
>   at or

[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-28 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011686#comment-14011686
 ] 

Jian He commented on YARN-2010:
---

I agree that failing to recover an app shouldn’t fail the RM.  I think for 
cases where the failure will be simply resolved by launching a new attempt like 
this, we should not  fail the app. We can fail the app for cases where starting 
a new attempt can’t resolve the issue like failing to renew DT on recovery. 

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml

2014-05-28 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2112:
--

Target Version/s: 2.5.0

> Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
> -
>
> Key: YARN-2112
> URL: https://issues.apache.org/jira/browse/YARN-2112
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2112.1.patch
>
>
> Now YarnClient is using TimelineClient, which has dependency on jackson libs. 
> However, the current dependency configurations make the hadoop-client 
> artifect miss 2 jackson libs, such that the applications which have 
> hadoop-client dependency will see the following exception
> {code}
> java.lang.NoClassDefFoundError: 
> org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>   at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.(TimelineClientImpl.java:92)
>   at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:88)
>   at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:111)
>   at 
> org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
>   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
>   at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
>   at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
>   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
>   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:394)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>   at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
>   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
>   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Caused by: java.lang.ClassNotFoundException: 
> org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>   at java.securit

[jira] [Updated] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml

2014-05-28 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2112:
--

Attachment: YARN-2112.1.patch

Create a patch to correct the configs in pom.xml, and make sure all 4 jackson 
libs are available in hadoop-client

> Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
> -
>
> Key: YARN-2112
> URL: https://issues.apache.org/jira/browse/YARN-2112
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2112.1.patch
>
>
> Now YarnClient is using TimelineClient, which has dependency on jackson libs. 
> However, the current dependency configurations make the hadoop-client 
> artifect miss 2 jackson libs, such that the applications which have 
> hadoop-client dependency will see the following exception
> {code}
> java.lang.NoClassDefFoundError: 
> org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>   at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.(TimelineClientImpl.java:92)
>   at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:88)
>   at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:111)
>   at 
> org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
>   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
>   at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
>   at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
>   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
>   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:394)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>   at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
>   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
>   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Caused by: java.lang.ClassNotFoundException: 
> org.codehaus.jacks

[jira] [Created] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml

2014-05-28 Thread Zhijie Shen (JIRA)

Zhijie Shen created YARN-2112:
-

 Summary: Hadoop-client is missing jackson libs due to 
inappropriate configs in pom.xml
 Key: YARN-2112
 URL: https://issues.apache.org/jira/browse/YARN-2112
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen


Now YarnClient is using TimelineClient, which has dependency on jackson libs. 
However, the current dependency configurations make the hadoop-client artifect 
miss 2 jackson libs, such that the applications which have hadoop-client 
dependency will see the following exception
{code}
java.lang.NoClassDefFoundError: 
org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.(TimelineClientImpl.java:92)
at 
org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:88)
at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:111)
at 
org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at 
org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
at 
org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at 
org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: 
org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:24

[jira] [Commented] (YARN-2098) App priority support in Fair Scheduler


[ 
https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011643#comment-14011643
 ] 

Hadoop QA commented on YARN-2098:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647212/YARN-2098.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/3847//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3847//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3847//console

This message is automatically generated.

> App priority support in Fair Scheduler
> --
>
> Key: YARN-2098
> URL: https://issues.apache.org/jira/browse/YARN-2098
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Wei Yan
> Attachments: YARN-2098.patch
>
>
> This jira is created for supporting app priorities in fair scheduler. 
> AppSchedulable hard codes priority of apps to 1,we should
> change this to get priority from ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2111) In FairScheduler.attemptScheduling, we don't count containers as assigned if they have 0 memory but non-zero cores


 [ 
https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-2111:
-

Summary: In FairScheduler.attemptScheduling, we don't count containers as 
assigned if they have 0 memory but non-zero cores  (was: In 
FairScheduler.attemptScheduling, we won't count containers as assigned if they 
have 0 memory but non-zero cores)

> In FairScheduler.attemptScheduling, we don't count containers as assigned if 
> they have 0 memory but non-zero cores
> --
>
> Key: YARN-2111
> URL: https://issues.apache.org/jira/browse/YARN-2111
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.4.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-2111.patch
>
>
> {code}
> if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
>   queueMgr.getRootQueue().assignContainer(node),
>   Resources.none())) {
> {code}
> As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores 
> here into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores


[ 
https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011625#comment-14011625
 ] 

Hadoop QA commented on YARN-2111:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647204/YARN-2111.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3846//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3846//console

This message is automatically generated.

> In FairScheduler.attemptScheduling, we won't count containers as assigned if 
> they have 0 memory but non-zero cores
> --
>
> Key: YARN-2111
> URL: https://issues.apache.org/jira/browse/YARN-2111
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.4.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-2111.patch
>
>
> {code}
> if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
>   queueMgr.getRootQueue().assignContainer(node),
>   Resources.none())) {
> {code}
> As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores 
> here into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1801) NPE in public localizer

2014-05-28 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011601#comment-14011601
 ] 

Jason Lowe commented on YARN-1801:
--

Strictly speaking, the patch does prevent the NPE.  However the public 
localizer is still effectively doomed if this condition occurs because it 
returns from the run() method.  That will shutdown the localizer thread and 
public local resource requests will stop being processed.  In that sense we've 
traded an NPE with a traceback for a one-line log message.  I'm not sure this 
is an improvement, since at least the traceback is easier to notice in the NM 
log and we get a corresponding fatal log when someone goes hunting for what 
went wrong with the public localizer.

The real issue is we need to understand what happened to cause 
pending.remove(completed) to return null.  This should never happen, and if it 
does then it means we have a bug.  Trying to recover from this condition is 
patching a symptom rather than a root cause.  The problem that lead to the null 
request event _might_ have been fixed by YARN-1575 which wasn't present in 2.2 
where the original bug occurred.  It would be interesting to know if this has 
reoccurred since 2.3.0.

Assuming this is still a potential issue, we should either find a way to 
prevent it from ever occurring or recover in a way that keeps the public 
localizer working as much as possible. It'd be great if we could just pull from 
the queue and receive a structure that has both the request event and the 
Future so we don't have to worry about a Future with no associated 
event.  If we're going to try to recover instead, we'd have to log an error and 
try to cleanup.  With no associated request event and no path if we got an 
execution error, it's going to be particularly difficult to recover properly.

> NPE in public localizer
> ---
>
> Key: YARN-1801
> URL: https://issues.apache.org/jira/browse/YARN-1801
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Jason Lowe
>Assignee: Hong Zhiguo
>Priority: Critical
> Attachments: YARN-1801.patch
>
>
> While investigating YARN-1800 found this in the NM logs that caused the 
> public localizer to shutdown:
> {noformat}
> 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(651)) - Downloading public 
> rsrc:{ 
> hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
>  1390440382009, FILE, null }
> 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(726)) - Error: Shutting down
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
> 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(728)) - Public cache exiting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs


[ 
https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011587#comment-14011587
 ] 

Sandy Ryza commented on YARN-1913:
--

I think we should avoid doing approximate calculation through the minimum 
allocation.  We need to handle situations where AM resources are much larger 
than the min, and situations where the minimum allocation will be 0 (common on 
Llama-enabled clusters).

This would have the added benefit of avoiding touching the "runnability" 
machinery, which is already bordering on over-complicated.

> With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
> --
>
> Key: YARN-1913
> URL: https://issues.apache.org/jira/browse/YARN-1913
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Wei Yan
>  Labels: easyfix
> Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, 
> YARN-1913.patch
>
>
> It's possible to deadlock a cluster by submitting many applications at once, 
> and have all cluster resources taken up by AMs.
> One solution is for the scheduler to limit resources taken up by AMs, as a 
> percentage of total cluster resources, via a "maxApplicationMasterShare" 
> config.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He reassigned YARN-2109:
-

Assignee: Chen He

> TestRM fails some tests when some tests run with CapacityScheduler and some 
> with FairScheduler
> --
>
> Key: YARN-2109
> URL: https://issues.apache.org/jira/browse/YARN-2109
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Anubhav Dhoot
>Assignee: Chen He
>
> testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
> [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
> it to be CapacityScheduler. But if the default scheduler is set to 
> FairScheduler then the rest of the tests that execute after this will fail 
> with invalid cast exceptions when getting queuemetrics. This is based on test 
> execution order as only the tests that execute after this test will fail. 
> This is because the queuemetrics will be initialized by this test to 
> QueueMetrics and shared by the subsequent tests. 
> We can explicitly clear the metrics at the end of this test to fix this.
> For example
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot 
> be cast to 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:90)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:85)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:81)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011582#comment-14011582
 ] 

Chen He commented on YARN-2109:
---

This is interesting and I will work on it.

> TestRM fails some tests when some tests run with CapacityScheduler and some 
> with FairScheduler
> --
>
> Key: YARN-2109
> URL: https://issues.apache.org/jira/browse/YARN-2109
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Anubhav Dhoot
>
> testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
> [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
> it to be CapacityScheduler. But if the default scheduler is set to 
> FairScheduler then the rest of the tests that execute after this will fail 
> with invalid cast exceptions when getting queuemetrics. This is based on test 
> execution order as only the tests that execute after this test will fail. 
> This is because the queuemetrics will be initialized by this test to 
> QueueMetrics and shared by the subsequent tests. 
> We can explicitly clear the metrics at the end of this test to fix this.
> For example
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot 
> be cast to 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:90)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:85)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:81)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011571#comment-14011571
 ] 

Chen He commented on YARN-2110:
---

I will take this. 

> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
> ---
>
> Key: YARN-2110
> URL: https://issues.apache.org/jira/browse/YARN-2110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Chen He
>
> The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
> CapacityScheduler in a couple of places
> {code}
> ((CapacityScheduler) rm1.getResourceScheduler())
> {code}
> If run with FairScheduler as default scheduler the test throws 
> {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He reassigned YARN-2110:
-

Assignee: Chen He

> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
> ---
>
> Key: YARN-2110
> URL: https://issues.apache.org/jira/browse/YARN-2110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Chen He
>
> The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
> CapacityScheduler in a couple of places
> {code}
> ((CapacityScheduler) rm1.getResourceScheduler())
> {code}
> If run with FairScheduler as default scheduler the test throws 
> {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2098) App priority support in Fair Scheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2098:
--

Attachment: YARN-2098.patch

> App priority support in Fair Scheduler
> --
>
> Key: YARN-2098
> URL: https://issues.apache.org/jira/browse/YARN-2098
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Wei Yan
> Attachments: YARN-2098.patch
>
>
> This jira is created for supporting app priorities in fair scheduler. 
> AppSchedulable hard codes priority of apps to 1,we should
> change this to get priority from ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011542#comment-14011542
 ] 

Sandy Ryza commented on YARN-596:
-

(pending Jenkins)

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011540#comment-14011540
 ] 

Sandy Ryza commented on YARN-596:
-

+1

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores


 [ 
https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-2111:
-

Attachment: YARN-2111.patch

> In FairScheduler.attemptScheduling, we won't count containers as assigned if 
> they have 0 memory but non-zero cores
> --
>
> Key: YARN-2111
> URL: https://issues.apache.org/jira/browse/YARN-2111
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.4.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-2111.patch
>
>
> {code}
> if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
>   queueMgr.getRootQueue().assignContainer(node),
>   Resources.none())) {
> {code}
> As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores 
> here into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores


 [ 
https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned YARN-2111:


Assignee: Sandy Ryza

> In FairScheduler.attemptScheduling, we won't count containers as assigned if 
> they have 0 memory but non-zero cores
> --
>
> Key: YARN-2111
> URL: https://issues.apache.org/jira/browse/YARN-2111
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.4.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> {code}
> if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
>   queueMgr.getRootQueue().assignContainer(node),
>   Resources.none())) {
> {code}
> As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores 
> here into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores

Sandy Ryza created YARN-2111:


 Summary: In FairScheduler.attemptScheduling, we won't count 
containers as assigned if they have 0 memory but non-zero cores
 Key: YARN-2111
 URL: https://issues.apache.org/jira/browse/YARN-2111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Sandy Ryza


{code}
if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
  queueMgr.getRootQueue().assignContainer(node),
  Resources.none())) {
{code}

As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores here 
into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user


[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011524#comment-14011524
 ] 

Vinod Kumar Vavilapalli commented on YARN-1063:
---

Scanned through the patch. It's dense and full of windows related stuff which I 
am not entirely familiar with.

Looked at the code from YARN container localization and launch POV. I have 
posted some comments on YARN-1972 which may cause some changes here too.

> Winutils needs ability to create task as domain user
> 
>
> Key: YARN-1063
> URL: https://issues.apache.org/jira/browse/YARN-1063
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
> Environment: Windows
>Reporter: Kyle Leckie
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
> YARN-1063.patch
>
>
> h1. Summary:
> Securing a Hadoop cluster requires constructing some form of security 
> boundary around the processes executed in YARN containers. Isolation based on 
> Windows user isolation seems most feasible. This approach is similar to the 
> approach taken by the existing LinuxContainerExecutor. The current patch to 
> winutils.exe adds the ability to create a process as a domain user. 
> h1. Alternative Methods considered:
> h2. Process rights limited by security token restriction:
> On Windows access decisions are made by examining the security token of a 
> process. It is possible to spawn a process with a restricted security token. 
> Any of the rights granted by SIDs of the default token may be restricted. It 
> is possible to see this in action by examining the security tone of a 
> sandboxed process launch be a web browser. Typically the launched process 
> will have a fully restricted token and need to access machine resources 
> through a dedicated broker process that enforces a custom security policy. 
> This broker process mechanism would break compatibility with the typical 
> Hadoop container process. The Container process must be able to utilize 
> standard function calls for disk and network IO. I performed some work 
> looking at ways to ACL the local files to the specific launched without 
> granting rights to other processes launched on the same machine but found 
> this to be an overly complex solution. 
> h2. Relying on APP containers:
> Recent versions of windows have the ability to launch processes within an 
> isolated container. Application containers are supported for execution of 
> WinRT based executables. This method was ruled out due to the lack of 
> official support for standard windows APIs. At some point in the future 
> windows may support functionality similar to BSD jails or Linux containers, 
> at that point support for containers should be added.
> h1. Create As User Feature Description:
> h2. Usage:
> A new sub command was added to the set of task commands. Here is the syntax:
> winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
> Some notes:
> * The username specified is in the format of "user@domain"
> * The machine executing this command must be joined to the domain of the user 
> specified
> * The domain controller must allow the account executing the command access 
> to the user information. For this join the account to the predefined group 
> labeled "Pre-Windows 2000 Compatible Access"
> * The account running the command must have several rights on the local 
> machine. These can be managed manually using secpol.msc: 
> ** "Act as part of the operating system" - SE_TCB_NAME
> ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME
> ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME
> * The launched process will not have rights to the desktop so will not be 
> able to display any information or create UI.
> * The launched process will have no network credentials. Any access of 
> network resources that requires domain authentication will fail.
> h2. Implementation:
> Winutils performs the following steps:
> # Enable the required privileges for the current process.
> # Register as a trusted process with the Local Security Authority (LSA).
> # Create a new logon for the user passed on the command line.
> # Load/Create a profile on the local machine for the new logon.
> # Create a new environment for the new logon.
> # Launch the new process in a job with the task name specified and using the 
> created logon.
> # Wait for the JOB to exit.
> h2. Future work:
> The following work was scoped out of this check in:
> * Support for non-domain users or machine that are not domain joined.
> * Support for privilege isolation by running the task launcher in a high 
> privilege service with access over an ACLed named pipe.



--
This message was sent by

[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios


[ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011523#comment-14011523
 ] 

Sandy Ryza commented on YARN-2026:
--

The nice thing about fair share currently is that it's interpretable as an 
amount of resources that, as long as you stay under, you won't get preempted.   
Changing it to depend on the running apps in the cluster severely complicates 
this.  It used to be that each app and queue's fair share was min'd with its 
resource usage+demand, which is sort of a continuous analog to what you're 
suggesting, but we moved to the current definition when we added multi-resource 
scheduling.

I'm wondering if the right way to solve this problem is to allow preemption to 
be triggered at higher levels in the queue hierarchy.  I.e. suppose we have the 
following situation:
* root has two children - parentA and parentB
* each of root's children has two children - childA1, childA2, childB1, and 
childB2
* the parent queues' minShares are each set to half of the cluster resources
* the child queue' minShares are each set to a quarter of the cluster resources 
* childA1 has a third of the cluster resources
* childB1 and childB2 each have a third of the cluster resources

Even though childA1 is above its fair/minShare, We would see that parentA is 
below its minShare, so we would preempt resources on its behalf.  Once we have 
YARN-596 in, these resources would end up coming from parentB, and end up going 
to childA1.

> Fair scheduler : Fair share for inactive queues causes unfair allocation in 
> some scenarios
> --
>
> Key: YARN-2026
> URL: https://issues.apache.org/jira/browse/YARN-2026
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
>  Labels: scheduler
> Attachments: YARN-2026-v1.txt
>
>
> While using hierarchical queues in fair scheduler,there are few scenarios 
> where we have seen a leaf queue with least fair share can take majority of 
> the cluster and starve a sibling parent queue which has greater weight/fair 
> share and preemption doesn’t kick in to reclaim resources.
> The root cause seems to be that fair share of a parent queue is distributed 
> to all its children irrespective of whether its an active or an inactive(no 
> apps running) queue. Preemption based on fair share kicks in only if the 
> usage of a queue is less than 50% of its fair share and if it has demands 
> greater than that. When there are many queues under a parent queue(with high 
> fair share),the child queue’s fair share becomes really low. As a result when 
> only few of these child queues have apps running,they reach their *tiny* fair 
> share quickly and preemption doesn’t happen even if other leaf 
> queues(non-sibling) are hogging the cluster.
> This can be solved by dividing fair share of parent queue only to active 
> child queues.
> Here is an example describing the problem and proposed solution:
> root.lowPriorityQueue is a leaf queue with weight 2
> root.HighPriorityQueue is parent queue with weight 8
> root.HighPriorityQueue has 10 child leaf queues : 
> root.HighPriorityQueue.childQ(1..10)
> Above config,results in root.HighPriorityQueue having 80% fair share
> and each of its ten child queue would have 8% fair share. Preemption would 
> happen only if the child queue is <4% (0.5*8=4). 
> Lets say at the moment no apps are running in any of the 
> root.HighPriorityQueue.childQ(1..10) and few apps are running in 
> root.lowPriorityQueue which is taking up 95% of the cluster.
> Up till this point,the behavior of FS is correct.
> Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
> of the cluster. It would get only the available 5% in the cluster and 
> preemption wouldn't kick in since its above 4%(half fair share).This is bad 
> considering childQ1 is under a highPriority parent queue which has *80% fair 
> share*.
> Until root.lowPriorityQueue starts relinquishing containers,we would see the 
> following allocation on the scheduler page:
> *root.lowPriorityQueue = 95%*
> *root.HighPriorityQueue.childQ1=5%*
> This can be solved by distributing a parent’s fair share only to active 
> queues.
> So in the example above,since childQ1 is the only active queue
> under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
> 80%.
> This would cause preemption to reclaim the 30% needed by childQ1 from 
> root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
> Also note that similar situation can happen between 
> root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
> hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck 
> at 5%,until chil

[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor


[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011515#comment-14011515
 ] 

Vinod Kumar Vavilapalli commented on YARN-1972:
---

Thanks for working on this Remus. Can you upload the "short design"?

Questions/comments on the approach and the patch in the mean-while

h4. Approach
 - What are the requirements on the NodeManager user? Can it run as a regular 
'yarn' user, spawn the winutils shell and automatically launch task as some 
other user? Is there any admin setup that is needed for this to say grant such 
privileges to 'yarn' user?
 - One reason why we resorted to duplicate most of the code in 
DefaultContainerExecutor in container-executor.c for linux is performance. You 
are launching so many commands for every container - to chown files, to copy 
files etc. You should measure the performance impact of this to figure out if 
what the patch does is fine or if we should imitate what the linux-executor 
does.

h4. Patch
WindowsSecureContainerExecutor
 - The overridden getRunCommand skips things like the setting niceness feature 
(YARN-443) in linux. Arguably this isn't working in non-secure mode before 
anyways. Is there a way we can bump process-priority in windows? If so, when we 
add that feature, we'll need to be careful to change both the default and the 
secure Executor.
 - namenodeGroup -> nodeManagerGroup
 -  The division of responsibility between launching multiple commands before 
starting the localizer and the stuff that happens inside the localizer: 
Localizer already does createUserLocalDirs etc. So you don't need to do them 
explicitly in the java code inside NodeManager process.
 - In the minimum we should definitely move exec.localizeClasspathJar() related 
stuff into the winutils start-process code.
 - Why is appLocalizationCounter needed? Once we tackle container-preserving 
NM-restart (YARN-1336), this will be an issue. Why cannot we simply use the 
localizerId? That is unique enough if we want uniqueness.
 - Also the startLocalizer() method is a near clone of what exists in 
LinuxContainerExecutor. We should refactor and reuse, otherwise it will be a 
maintenance headache.

> Implement secure Windows Container Executor
> ---
>
> Key: YARN-1972
> URL: https://issues.apache.org/jira/browse/YARN-1972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1972.1.patch
>
>
> This work item represents the Java side changes required to implement a 
> secure windows container executor, based on the YARN-1063 changes on 
> native/winutils side. 
> Necessary changes include leveraging the winutils task createas to launch the 
> container process as the required user and a secure localizer (launch 
> localization as a separate process running as the container user).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2068) FairScheduler uses the same ResourceCalculator for all policies


 [ 
https://issues.apache.org/jira/browse/YARN-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved YARN-2068.
--

Resolution: Invalid

Closing this as invalid.  Obviously feel free to reopen if I'm missing 
something.

> FairScheduler uses the same ResourceCalculator for all policies
> ---
>
> Key: YARN-2068
> URL: https://issues.apache.org/jira/browse/YARN-2068
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> FairScheduler uses the same ResourceCalculator for all policies including 
> DRF. Need to fix that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011507#comment-14011507
 ] 

Sandy Ryza commented on YARN-1474:
--

My opinion is that it's ok to change these semantics, as ResourceScheduler is 
marked Evolving.  Given the complexity of writing a YARN scheduler, I also 
seriously doubt that there are custom ones out there outside of academic 
contexts, so I'm comfortable erring on the opposite side of caution. 

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
> YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
> YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
> YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
> YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
> YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Anubhav Dhoot (JIRA)

Anubhav Dhoot created YARN-2110:
---

 Summary: TestAMRestart#testAMRestartWithExistingContainers assumes 
CapacityScheduler
 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: The TestAMRestart#testAMRestartWithExistingContainers 
does a cast to CapacityScheduler in a couple of places
{code}
((CapacityScheduler) rm1.getResourceScheduler())
{code}

If run with FairScheduler as default scheduler the test throws 
{code} java.lang.ClassCastException {code}.
Reporter: Anubhav Dhoot






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2110:


Description: 
The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
CapacityScheduler in a couple of places
{code}
((CapacityScheduler) rm1.getResourceScheduler())
{code}

If run with FairScheduler as default scheduler the test throws 
{code} java.lang.ClassCastException {code}.
Environment: (was: The 
TestAMRestart#testAMRestartWithExistingContainers does a cast to 
CapacityScheduler in a couple of places
{code}
((CapacityScheduler) rm1.getResourceScheduler())
{code}

If run with FairScheduler as default scheduler the test throws 
{code} java.lang.ClassCastException {code}.)

> TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
> ---
>
> Key: YARN-2110
> URL: https://issues.apache.org/jira/browse/YARN-2110
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>
> The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
> CapacityScheduler in a couple of places
> {code}
> ((CapacityScheduler) rm1.getResourceScheduler())
> {code}
> If run with FairScheduler as default scheduler the test throws 
> {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2098) App priority support in Fair Scheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan reassigned YARN-2098:
-

Assignee: Wei Yan

> App priority support in Fair Scheduler
> --
>
> Key: YARN-2098
> URL: https://issues.apache.org/jira/browse/YARN-2098
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Wei Yan
>
> This jira is created for supporting app priorities in fair scheduler. 
> AppSchedulable hard codes priority of apps to 1,we should
> change this to get priority from ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2107) Refactor timeline classes into server.timeline package


[ 
https://issues.apache.org/jira/browse/YARN-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011429#comment-14011429
 ] 

Hudson commented on YARN-2107:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5616 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5616/])
YARN-2107. Refactored timeline classes into o.a.h.y.s.timeline package. 
Contributed by Vinod Kumar Vavilapalli. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1598094)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AHSWebApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TimelineWebServices.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/EntityIdentifier.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/GenericObjectMapper.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/NameValuePair.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineReader.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineWriter.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/package-info.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilter.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilterInitializer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineClientAuthenticationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineDelegationTokenSecretManagerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp
* 
/hadoop/common/trunk/hadoop-yarn-project

[jira] [Commented] (YARN-800) Clicking on an AM link for a running app leads to a HTTP 500

2014-05-28 Thread Dave Disser (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011420#comment-14011420
 ] 

Dave Disser commented on YARN-800:
--

I'm seeing this issue regardless of the status of yarn.resourcemanager.hostname 
and yarn.web-proxy.address.  I notice the following in my nodemanager log file:

2014-05-28 14:06:20,478 INFO  webproxy.WebAppProxyServlet 
(WebAppProxyServlet.java:doGet(330)) - dr.who is accessing
unchecked http://hdp003-3:59959/ which is the app master GUI of 
application_1401300304842_0001 owned by hdfs

I can try to retrieve this URL directly:

hdp003-2:~ # wget -O - http://hdp003-3:59959/
--2014-05-28 14:06:47--  http://hdp003-3:59959/
Resolving hdp003-3... 39.64.24.3
Connecting to hdp003-3|39.64.24.3|:59959... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://hdp003-3:59959/mapreduce [following]
--2014-05-28 14:06:47--  http://hdp003-3:59959/mapreduce
Reusing existing connection to hdp003-3:59959.
HTTP request sent, awaiting response... 302 Found
Location: http://hdp003-3:8088/proxy/application_1401300304842_0001/mapreduce 
[following]
--2014-05-28 14:06:47--  
http://hdp003-3:8088/proxy/application_1401300304842_0001/mapreduce
Connecting to hdp003-3|39.64.24.3|:8088... failed: Connection refused.
Resolving hdp003-3... 39.64.24.3
Connecting to hdp003-3|39.64.24.3|:8088... failed: Connection refused.

The node running the AM is proxying the request to itself, where there is no 
proxy running.  If I do the same on the node where AM is running, I get the 
proper result:

hdp003-3:~ # wget -O - http://hdp003-3:59959/
--2014-05-28 14:07:25--  http://hdp003-3:59959/
Resolving hdp003-3... 39.64.24.3
Connecting to hdp003-3|39.64.24.3|:59959... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://hdp003-3:59959/mapreduce [following]
--2014-05-28 14:07:25--  http://hdp003-3:59959/mapreduce
Reusing existing connection to hdp003-3:59959.
HTTP request sent, awaiting response... 200 OK
Length: 6224 (6.1K) [text/html]
Saving to: `STDOUT'

 0% [   
 ] 0   --.-K/s  <
!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" 
"http://www.w3.org/TR/html4/strict.dtd";>

[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt


[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011362#comment-14011362
 ] 

Karthik Kambatla commented on YARN-2010:


I see. Thanks for the input. Let me check if that is indeed the case, and 
attempt recovering the app even if the key is null

Regardless, do we agree that we still need to address the case where the app 
recovery fails for potentially other reasons?

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-28 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011355#comment-14011355
 ] 

Jian He commented on YARN-2010:
---

bq. The stack trace corresponds to non-work-preserving restart. I am not sure I 
understand the concern.
What I meant is, in this scenario, it shouldn't matter whether the old attempt 
has the master key or not, since the old attempt will be anyways killed by NM 
on RM restart. The newly started attempt will have the proper master key 
generated. If we just check whether the key is null and move on, the next 
attempt should be able to succeed. So we don't need to explicitly fail the app ?

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ...

[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-596:
-

Attachment: YARN-596.patch

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-28 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011335#comment-14011335
 ] 

Xuan Gong commented on YARN-2054:
-

[~kasha] Looks like you need to update the patch. There are lots of unrelated 
changes..

> Poor defaults for YARN ZK configs for retries and retry-inteval
> ---
>
> Key: YARN-2054
> URL: https://issues.apache.org/jira/browse/YARN-2054
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
> Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch
>
>
> Currenly, we have the following default values:
> # yarn.resourcemanager.zk-num-retries - 500
> # yarn.resourcemanager.zk-retry-interval-ms - 2000
> This leads to a cumulate 1000 seconds before the RM gives up trying to 
> connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions


 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-596:
-

Attachment: YARN-596.patch

Thanks, Sandy. Upload a new patch to fix your comments.

> In fair scheduler, intra-application container priorities affect 
> inter-application preemption decisions
> ---
>
> Key: YARN-596
> URL: https://issues.apache.org/jira/browse/YARN-596
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
> YARN-596.patch, YARN-596.patch, YARN-596.patch
>
>
> In the fair scheduler, containers are chosen for preemption in the following 
> way:
> All containers for all apps that are in queues that are over their fair share 
> are put in a list.
> The list is sorted in order of the priority that the container was requested 
> in.
> This means that an application can shield itself from preemption by 
> requesting it's containers at higher priorities, which doesn't really make 
> sense.
> Also, an application that is not over its fair share, but that is in a queue 
> that is over it's fair share is just as likely to have containers preempted 
> as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011331#comment-14011331
 ] 

Karthik Kambatla commented on YARN-1474:


I think that is the step in the right direction. I agree it is a change in 
semantics. Might be a good idea to see what others think.

[~sandyr], [~vinodkv] - do you guys think it is okay to change the semantics on 
how a scheduler is used:
- Before this patch, we create a scheduler and call reinitialize().
- After this patch, I am proposing scheduler.setRMContext(), scheduler.init(), 
and then scheduler.reinitialize() for later updates to allocation-files etc.

Scheduler initialization is within the RM, and we haven't exposed the scheduler 
API for users to write custom schedulers yet. 

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
> YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
> YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
> YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
> YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
> YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart


[ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011325#comment-14011325
 ] 

Hadoop QA commented on YARN-1338:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647161/YARN-1338v6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3844//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3844//console

This message is automatically generated.

> Recover localized resource cache state upon nodemanager restart
> ---
>
> Key: YARN-1338
> URL: https://issues.apache.org/jira/browse/YARN-1338
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1338.patch, YARN-1338v2.patch, 
> YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch, 
> YARN-1338v6.patch
>
>
> Today when node manager restarts we clean up all the distributed cache files 
> from disk. This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers 
> are using them
> * For even non work preserving restart this will be useful in the sense that 
> we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-28 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011320#comment-14011320
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

Thanks Karthik for the comments. I'd like to make sure one point:

{quote}
1. In each of the schedulers, I don't think we need the following snippet or 
for that matter the variable initialized at all. reinitialize() would have just 
the contents of else-block.
{quote}

If we change that {{reinitialize()}} would have just contents of else-black, we 
need to change lots schedulers-related test cases without 
ResourceManager/MockRM to call {{scheduler.init()}} right after 
{{scheduler.setRMContext()}}. Is it acceptable for us?

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
> YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
> YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
> YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
> YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
> YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler

2014-05-28 Thread Anubhav Dhoot (JIRA)

Anubhav Dhoot created YARN-2109:
---

 Summary: TestRM fails some tests when some tests run with 
CapacityScheduler and some with FairScheduler
 Key: YARN-2109
 URL: https://issues.apache.org/jira/browse/YARN-2109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Anubhav Dhoot


testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
[YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
it to be CapacityScheduler. But if the default scheduler is set to 
FairScheduler then the rest of the tests that execute after this will fail with 
invalid cast exceptions when getting queuemetrics. This is based on test 
execution order as only the tests that execute after this test will fail. This 
is because the queuemetrics will be initialized by this test to QueueMetrics 
and shared by the subsequent tests. 

We can explicitly clear the metrics at the end of this test to fix this.
For example

java.lang.ClassCastException: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be 
cast to 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:90)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:85)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart

2014-05-28 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1338:
-

Attachment: YARN-1338v6.patch

Thanks for the additional comments, Junping.

bq. Do we have any code to destroy DB items for NMState when NM is 
decommissioned (not expecting short-term restart)?

Good point.  I added shutdown code that removes the recovery directory if the 
shutdown is due to a decommission.  I also added a unit test for this scenario.

{quote}
In LocalResourcesTrackerImpl#recoverResource()

+incrementFileCountForLocalCacheDirectory(localDir.getParent());

Given localDir is already the parent of localPath, may be we should just 
increment locaDir rather than its parent? I didn't see we have unit test to 
check file count for resource directory after recovery. May be we should add 
some?
{quote}

The last component of localDir is the unique resource ID and not a directory 
managed by the local cache directory manager.  The directory allocated by the 
local cache directory manager has an additional directory added by the 
localization process which is named after the unique ID for the local resource. 
 For example, the localPath might be something like 
/local/root/0/1/52/resource.jar and localDir is /local/root/0/1/52.  The '52' 
is the unique resource ID (always >= 10 so it can't conflict with 
single-character cache mgr subdirs) and /local/root/0/1 is the directory 
managed by the local dir cache manager.  If we passed localDir to the local dir 
cache manager it would get confused since it would try to parse the last 
component as a subdirectory it created but it isn't that.

I did add a unit test to verify local cache directory counts are incremented 
properly when resources are recovered.  This required exposing a couple of 
methods as package-private to get the necessary information for the test.

> Recover localized resource cache state upon nodemanager restart
> ---
>
> Key: YARN-1338
> URL: https://issues.apache.org/jira/browse/YARN-1338
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1338.patch, YARN-1338v2.patch, 
> YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch, 
> YARN-1338v6.patch
>
>
> Today when node manager restarts we clean up all the distributed cache files 
> from disk. This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers 
> are using them
> * For even non work preserving restart this will be useful in the sense that 
> we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue


[ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011183#comment-14011183
 ] 

Hudson commented on YARN-2012:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1784 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1784/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> Fair Scheduler: allow default queue placement rule to take an arbitrary queue
> -
>
> Key: YARN-2012
> URL: https://issues.apache.org/jira/browse/YARN-2012
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
>  Labels: scheduler
> Fix For: 2.5.0
>
> Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt
>
>
> Currently 'default' rule in queue placement policy,if applied,puts the app in 
> root.default queue. It would be great if we can make 'default' rule 
> optionally point to a different queue as default queue .
> This default queue can be a leaf queue or it can also be an parent queue if 
> the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2105) Fix TestFairScheduler after YARN-2012


[ 
https://issues.apache.org/jira/browse/YARN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011182#comment-14011182
 ] 

Hudson commented on YARN-2105:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1784 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1784/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> Fix TestFairScheduler after YARN-2012
> -
>
> Key: YARN-2105
> URL: https://issues.apache.org/jira/browse/YARN-2105
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Ashwin Shankar
> Fix For: 2.5.0
>
> Attachments: YARN-2105-v1.txt
>
>
> The following tests fail in trunk:
> {code}
> Failed tests:
>   TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:<1> but was:<0>
> Tests in error:
>   TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer
>   TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2105) Fix TestFairScheduler after YARN-2012


[ 
https://issues.apache.org/jira/browse/YARN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011101#comment-14011101
 ] 

Hudson commented on YARN-2105:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1757 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1757/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> Fix TestFairScheduler after YARN-2012
> -
>
> Key: YARN-2105
> URL: https://issues.apache.org/jira/browse/YARN-2105
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Ashwin Shankar
> Fix For: 2.5.0
>
> Attachments: YARN-2105-v1.txt
>
>
> The following tests fail in trunk:
> {code}
> Failed tests:
>   TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:<1> but was:<0>
> Tests in error:
>   TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer
>   TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue


[ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011103#comment-14011103
 ] 

Hudson commented on YARN-2012:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1757 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1757/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> Fair Scheduler: allow default queue placement rule to take an arbitrary queue
> -
>
> Key: YARN-2012
> URL: https://issues.apache.org/jira/browse/YARN-2012
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
>  Labels: scheduler
> Fix For: 2.5.0
>
> Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt
>
>
> Currently 'default' rule in queue placement policy,if applied,puts the app in 
> root.default queue. It would be great if we can make 'default' rule 
> optionally point to a different queue as default queue .
> This default queue can be a leaf queue or it can also be an parent queue if 
> the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue


[ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011022#comment-14011022
 ] 

Hudson commented on YARN-2012:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #566 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/566/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> Fair Scheduler: allow default queue placement rule to take an arbitrary queue
> -
>
> Key: YARN-2012
> URL: https://issues.apache.org/jira/browse/YARN-2012
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
>  Labels: scheduler
> Fix For: 2.5.0
>
> Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt
>
>
> Currently 'default' rule in queue placement policy,if applied,puts the app in 
> root.default queue. It would be great if we can make 'default' rule 
> optionally point to a different queue as default queue .
> This default queue can be a leaf queue or it can also be an parent queue if 
> the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2105) Fix TestFairScheduler after YARN-2012


[ 
https://issues.apache.org/jira/browse/YARN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011021#comment-14011021
 ] 

Hudson commented on YARN-2105:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #566 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/566/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> Fix TestFairScheduler after YARN-2012
> -
>
> Key: YARN-2105
> URL: https://issues.apache.org/jira/browse/YARN-2105
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Ashwin Shankar
> Fix For: 2.5.0
>
> Attachments: YARN-2105-v1.txt
>
>
> The following tests fail in trunk:
> {code}
> Failed tests:
>   TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:<1> but was:<0>
> Tests in error:
>   TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer
>   TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval