[jira] [Updated] (YARN-1874) Cleanup: Move RMActiveServices out of ResourceManager into its own file

2014-05-05 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1874:
-

Attachment: YARN-1874.2.patch

> Cleanup: Move RMActiveServices out of ResourceManager into its own file
> ---
>
> Key: YARN-1874
> URL: https://issues.apache.org/jira/browse/YARN-1874
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Karthik Kambatla
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1874.1.patch, YARN-1874.2.patch
>
>
> As [~vinodkv] noticed on YARN-1867, ResourceManager is hard to maintain. We 
> should move RMActiveServices out to make it more manageable. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-05-05 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-2022:
--

Description: 
Cluster Size = 16GB [2NM's]
Queue A Capacity = 50%
Queue B Capacity = 50%
Consider there are 3 applications running in Queue A which has taken the full 
cluster capacity. 
J1 = 2GB AM + 1GB * 4 Maps
J2 = 2GB AM + 1GB * 4 Maps
J3 = 2GB AM + 1GB * 2 Maps

Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
Currently in this scenario, Jobs J3 will get killed including its AM.

It is better if AM can be given least priority among multiple applications. In 
this same scenario, map tasks from J3 and J4 can be preempted.
Later when cluster is free, maps can be allocated to these Jobs.

  was:
Cluster Size = 16GB [2NM's]
Queue A Capacity = 50%
Queue B Capacity = 50%
Consider there are 3 applications running in Queue A which has taken the full 
cluster capacity. 
J1 = 2GB AM + 1GB * 4 Maps
J2 = 2GB AM + 1GB * 4 Maps
J3 = 2GB AM + 1GB * 2 Maps

Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
Currently in this scenario, Jobs J3 will get killed including its AM.

It is better if AM can be given least priroity among multiple applications. In 
this same scenario, map tasks from J3 and J4 can be preempted.
Later when cluster is free, maps can be allocated to these Jobs.


> Preempting an Application Master container can be kept as least priority when 
> multiple applications are marked for preemption by 
> ProportionalCapacityPreemptionPolicy
> -
>
> Key: YARN-2022
> URL: https://issues.apache.org/jira/browse/YARN-2022
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Sunil G
>Assignee: Sunil G
>
> Cluster Size = 16GB [2NM's]
> Queue A Capacity = 50%
> Queue B Capacity = 50%
> Consider there are 3 applications running in Queue A which has taken the full 
> cluster capacity. 
> J1 = 2GB AM + 1GB * 4 Maps
> J2 = 2GB AM + 1GB * 4 Maps
> J3 = 2GB AM + 1GB * 2 Maps
> Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
> Currently in this scenario, Jobs J3 will get killed including its AM.
> It is better if AM can be given least priority among multiple applications. 
> In this same scenario, map tasks from J3 and J4 can be preempted.
> Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy

2014-05-05 Thread Sunil G (JIRA)
Sunil G created YARN-2022:
-

 Summary: Preempting an Application Master container can be kept as 
least priority when multiple applications are marked for preemption by 
ProportionalCapacityPreemptionPolicy
 Key: YARN-2022
 URL: https://issues.apache.org/jira/browse/YARN-2022
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Sunil G
Assignee: Sunil G


Cluster Size = 16GB [2NM's]
Queue A Capacity = 50%
Queue B Capacity = 50%
Consider there are 3 applications running in Queue A which has taken the full 
cluster capacity. 
J1 = 2GB AM + 1GB * 4 Maps
J2 = 2GB AM + 1GB * 4 Maps
J3 = 2GB AM + 1GB * 2 Maps

Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
Currently in this scenario, Jobs J3 will get killed including its AM.

It is better if AM can be given least priroity among multiple applications. In 
this same scenario, map tasks from J3 and J4 can be preempted.
Later when cluster is free, maps can be allocated to these Jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]

2014-05-05 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-2003:
-

Assignee: Sunil G

> Support to process Job priority from Submission Context in 
> AppAttemptAddedSchedulerEvent [RM side]
> --
>
> Key: YARN-2003
> URL: https://issues.apache.org/jira/browse/YARN-2003
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
>
> AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
> Submission Context and store.
> Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-05-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990304#comment-13990304
 ] 

Hadoop QA commented on YARN-1857:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12643410/YARN-1857.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3697//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3697//console

This message is automatically generated.

> CapacityScheduler headroom doesn't account for other AM's running
> -
>
> Key: YARN-1857
> URL: https://issues.apache.org/jira/browse/YARN-1857
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Chen He
> Attachments: YARN-1857.patch, YARN-1857.patch
>
>
> Its possible to get an application to hang forever (or a long time) in a 
> cluster with multiple users.  The reason why is that the headroom sent to the 
> application is based on the user limit but it doesn't account for other 
> Application masters using space in that queue.  So the headroom (user limit - 
> user consumed) can be > 0 even though the cluster is 100% full because the 
> other space is being used by application masters from other users.  
> For instance if you have a cluster with 1 queue, user limit is 100%, you have 
> multiple users submitting applications.  One very large application by user 1 
> starts up, runs most of its maps and starts running reducers. other users try 
> to start applications and get their application masters started but not 
> tasks.  The very large application then gets to the point where it has 
> consumed the rest of the cluster resources with all reduces.  But at this 
> point it needs to still finish a few maps.  The headroom being sent to this 
> application is only based on the user limit (which is 100% of the cluster 
> capacity) its using lets say 95% of the cluster for reduces and then other 5% 
> is being used by other users running application masters.  The MRAppMaster 
> thinks it still has 5% so it doesn't know that it should kill a reduce in 
> order to run a map.  
> This can happen in other scenarios also.  Generally in a large cluster with 
> multiple queues this shouldn't cause a hang forever but it could cause the 
> application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: 
After failover, RM may require a certain threshold to determine whether it’s 
safe to make scheduling decisions and start accepting new container requests 
from AMs. The threshold could be a certain amount of nodes. i.e. RM waits until 
a certain amount of nodes joining before accepting new container requests.  Or 
it could simply be a timeout, only after the timeout RM accepts new requests. 
NMs joined after the threshold can be treated as new NMs and instructed to kill 
all its containers.

  was:After failover, RM may require a certain threshold to determine whether 
it’s safe to make scheduling decisions and start accepting new container 
requests from AMs. The threshold could be a certain amount of nodes. i.e. RM 
waits until a certain amount of nodes joining before accepting new container 
requests.  Or it could simply be a timeout, only after the timeout RM accepts 
new requests.


> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1986) After upgrade from 2.2.0 to 2.4.0, NPE on first job start.

2014-05-05 Thread Hong Zhiguo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo reassigned YARN-1986:
-

Assignee: Hong Zhiguo

> After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
> --
>
> Key: YARN-1986
> URL: https://issues.apache.org/jira/browse/YARN-1986
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Jon Bringhurst
>Assignee: Hong Zhiguo
>
> After upgrade from 2.2.0 to 2.4.0, NPE on first job start.
> After RM was restarted, the job runs without a problem.
> {noformat}
> 19:11:13,441 FATAL ResourceManager:600 - Error in handling event type 
> NODE_UPDATE to the scheduler
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:462)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:714)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:743)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:104)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591)
>   at java.lang.Thread.run(Thread.java:744)
> 19:11:13,443  INFO ResourceManager:604 - Exiting, bbye..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: After failover, RM may require a certain threshold to 
determine whether it’s safe to make scheduling decisions and start accepting 
new container requests from AMs. The threshold could be a certain amount of 
nodes. i.e. RM waits until a certain amount of nodes joining before accepting 
new container requests.  Or it could simply be a timeout, only after the 
timeout RM accepts new requests.  (was: RM may not accept allocate requests 
from AMs until all the NMs have re-synced back with RM. This is to eliminate 
some race conditions like containerIds overlapping between 
)

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Description: 
RM may not accept allocate requests from AMs until all the NMs have re-synced 
back with RM. This is to eliminate some race conditions like containerIds 
overlapping between 


  was:
RM should not accept allocate requests from AMs until all the NMs have 
registered with RM. For that, RM needs to remember the previous NMs and wait 
for all the NMs to register.
This is also useful for remembering decommissioned nodes across restarts.


> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM may not accept allocate requests from AMs until all the NMs have re-synced 
> back with RM. This is to eliminate some race conditions like containerIds 
> overlapping between 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990254#comment-13990254
 ] 

Jian He commented on YARN-2001:
---

bq. Then node1 comes back to RM, RM recovers all containers on node1.
On a second thought, this can be changed to not recover those containers and 
choose to kill them to meet the resource limit. That's another decision to make.

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2001) Threshold for RM to accept requests from AM after failover

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2001:
--

Summary: Threshold for RM to accept requests from AM after failover  (was: 
Persist NMs info for RM restart)

> Threshold for RM to accept requests from AM after failover
> --
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Persist NMs info for RM restart

2014-05-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990252#comment-13990252
 ] 

Jian He commented on YARN-2001:
---

In a simple case that an application is granted 50% of the cluster resource. 
The cluster has 2 nodes. the application used up all its resource quota and 
launched all containers on node1. RM fails over and node2 first re-syncs back 
with RM. Since node2 has no containers running for this application, AM asks 
for more containers and RM will think this AM hasn’t used any resources and 
will grant it more resources on node1. Then node1 comes back to RM, RM recovers 
all containers on node1. The application end up with more than 50% resource 
limit.

Another example would be RM needs to generate new container Id for the new 
containers requested from AM. If RM accepts new requests from AM before nodes 
sync back, the new container Id may overlap with the Ids of the recovered 
containers. 

> Persist NMs info for RM restart
> ---
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Persist NMs info for RM restart

2014-05-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990241#comment-13990241
 ] 

Karthik Kambatla commented on YARN-2001:


bq. we may run into condition like the resource usage, capacity limit (e.g. 
headroom, queue capacity etc. ) in scheduler is not yet correct until all the 
nodes sync back all the running containers belong to the app, 
applications/queues can potentially go beyond its limit.

My understanding has been that the RM's scheduler starts from scratch on 
restart/failover and rebuilds its state as nodes heartbeat. At any point in 
time, the cluster's resources correspond only to the NMs that have registered 
with the "new" RM. IOW, this should be no different from a new cluster. Given 
this, I am not sure how the scheduler can have incorrect information. 

> Persist NMs info for RM restart
> ---
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)

2014-05-05 Thread Subramaniam Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990239#comment-13990239
 ] 

Subramaniam Krishnan commented on YARN-1708:


Attaching the patch

> Add a public API to reserve resources (part of YARN-1051)
> -
>
> Key: YARN-1708
> URL: https://issues.apache.org/jira/browse/YARN-1708
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Carlo Curino
>Assignee: Subramaniam Krishnan
> Attachments: YARN-1708.patch
>
>
> This JIRA tracks the definition of a new public API for YARN, which allows 
> users to reserve resources (think of time-bounded queues). This is part of 
> the admission control enhancement proposed in YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)

2014-05-05 Thread Subramaniam Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramaniam Krishnan updated YARN-1708:
---

Attachment: YARN-1708.patch

> Add a public API to reserve resources (part of YARN-1051)
> -
>
> Key: YARN-1708
> URL: https://issues.apache.org/jira/browse/YARN-1708
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Carlo Curino
>Assignee: Subramaniam Krishnan
> Attachments: YARN-1708.patch
>
>
> This JIRA tracks the definition of a new public API for YARN, which allows 
> users to reserve resources (think of time-bounded queues). This is part of 
> the admission control enhancement proposed in YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)

2014-05-05 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990240#comment-13990240
 ] 

Carlo Curino commented on YARN-1708:



The attached patch represents a proposal for an extension of YARN's APIs.
This is the externally visible portion of the the umbrella JIRA YARN-1051,
and provides users with the opportunity to create/update/deleted time varying 
resource reservations within a queue (if the queue allows it). 

Reservations are expressed leveraging existing ResourceRequest objects 
and extending them with temporal semantics (e.g., I need 1h of 20 containers
of size <2GB,1vcore> some time between 2pm and 6pm). We allow also to express
minimum concurrency constraints, and dependencies among different stages of
a pipeline. 

The reservationID token obtained by the user during the reservation process
is passed during application submission, and instructs the RM to use the
reserved resources to satisfy this application needs. 

The patch posted here is not submitted, since it depends on many other patches
part of the umbrella JIRA, the separation is designed only for ease of 
reviewing. 

A more broad discussion of this idea, and some experimental results are 
provided in
the tech-report attached to the umbrella JIRA YARN-1051.

We have a complete solution backing this API, which we are testing/hardening and
will be posting the rest of it in the upcoming days/weeks.

> Add a public API to reserve resources (part of YARN-1051)
> -
>
> Key: YARN-1708
> URL: https://issues.apache.org/jira/browse/YARN-1708
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Carlo Curino
>Assignee: Subramaniam Krishnan
> Attachments: YARN-1708.patch
>
>
> This JIRA tracks the definition of a new public API for YARN, which allows 
> users to reserve resources (think of time-bounded queues). This is part of 
> the admission control enhancement proposed in YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-05 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990233#comment-13990233
 ] 

Wangda Tan commented on YARN-1368:
--

Hi [~jianhe], thanks for this patch, I'm agree with major strategies. But I've 
some comments and questions,

In AbstractYarnScheduler:recoverContainersOnNode
{code}
+  if (rmApp.getApplicationSubmissionContext().getUnmanagedAM()) {
+if (LOG.isDebugEnabled()) {
+  LOG.debug("Skip recovering container " + status
+  + " for unmanaged AM." + rmApp.getApplicationId());
+}
+continue;
+  }
{code}
Why we don't recover container in unmanaged AM case? In my understand, no 
matter it's managed or unmanaged AM, the recover process should be same. Is 
there any difference between them?

Should this be included in schedulerAttempt.recoverContainer(...)?
{code}
+  // recover app scheduling info
+  schedulerAttempt.appSchedulingInfo.recoverContainer(rmContainer);
{code}

In AppSchedulingInfo.recoverContainer(...)
{code}
+QueueMetrics metrics = queue.getMetrics();
+if (pending) {
+  // If there was any running containers, the application was
+  // running from scheduler's POV.
+  pending = false;
+  metrics.runAppAttempt(applicationId, user);
+}
+if (rmContainer.getState().equals(RMContainerState.COMPLETED)) {
+  return;
+}
+metrics.allocateResources(user, 1, Resource.newInstance(1024, 1), false);
{code}
Should this be a part of queue.recoverContainer(...)? Is it better to create 
QueueMetrics.recoverContainer(...)?

In CapacityScheduler,
{code}
-Collection nodes = cs.getAllNodes().values();
+Collection nodes = cs.getAllNodes().values();
{code}
Could you elaborate why do this and a series of change between SchedulerNode 
and FiCaSchedulerNode? Not really understand.

For recoverContainer in queue, should we do top-down (recover from root queue) 
or bottom-up (recover from leaf queue). I found in the patch it's bottom-up, 
should this be decided by scheduler implementation?

> Common work to re-populate containers’ state into scheduler
> ---
>
> Key: YARN-1368
> URL: https://issues.apache.org/jira/browse/YARN-1368
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1368.1.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990221#comment-13990221
 ] 

Rohith commented on YARN-2010:
--

Thank you [~kasha] for reviewing patch. I update the patch for continue on 
recovery failure for Finished applications.

Correct me if am wrong, continuing on recovery failure for running 
application,I think it may cause application to hang. So need to consider final 
state of applicaion.

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-2010.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Persist NMs info for RM restart

2014-05-05 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990220#comment-13990220
 ] 

Bikas Saha commented on YARN-2001:
--

What if users want to have multiple standbys for fault tolerance? In a large 
1 nodes cluster there could be 3-4 distinct fault domains where more than 1 
standby may be good to guarantee availability. Until now, in the design we have 
not restricted the number of standby's. Having all NM's ping all RM's will 
cause a lot of communication overhead in a healthy cluster.
The design already encompasses NM's discovering and syncing with the new active 
RM. So that is not the problem. The problem is restart during an upgrade where 
it may be common that a bunch of NM's dont come back up. The RM needs to be 
resilient to that while maintaining availability. Having a threshold of NM's 
sounds like a reasonable solution. The threshold can be calculated based on the 
scheduling margin of error wrt queue capacity.

At this point my suggestion would be to clarify the problem being addressed in 
this jira. Is the problem that after RM failover, the new RM needs to have a 
certain minimum number of machines join it before it can safely make scheduling 
decisions? If thats the case then please update the title to reflect that 
problem and not the solution.

> Persist NMs info for RM restart
> ---
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Persist NMs info for RM restart

2014-05-05 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990215#comment-13990215
 ] 

Ming Ma commented on YARN-2001:
---

1. In the HA set up, could we make standby RM hot by having NMs send heartbeat 
to all RMs?  NMs will ignore the heartbeat response's commands from standby 
RMs. In that way, the new active will have most recent NMs state right after 
the failover.

2. Decommission handling. If decommission state can be reconstructed via 
include and exclude files, maybe we can ask admins to update include and 
exclude files on all RM nodes during decommission process.

> Persist NMs info for RM restart
> ---
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java

2014-05-05 Thread yeqi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990210#comment-13990210
 ] 

yeqi commented on YARN-2020:


Carlo Curino, thanks for your explaination.  You are right, observeOnly is 
useful for debugging and it's meaningless if change it as I proposed. 
I will close this ticket. 

> observeOnly should be checked before any preemption computation started 
> inside containerBasedPreemptOrKill() of 
> ProportionalCapacityPreemptionPolicy.java
> -
>
> Key: YARN-2020
> URL: https://issues.apache.org/jira/browse/YARN-2020
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: all
>Reporter: yeqi
>Priority: Trivial
> Fix For: 2.5.0
>
> Attachments: YARN-2020.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> observeOnly should be checked in the very beginning of  
> ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(),  so that 
> to avoid unnecessary workload.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java

2014-05-05 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990203#comment-13990203
 ] 

Carlo Curino commented on YARN-2020:


I might be missing your point, but it seems to me that  observeOnly is used to 
compute the idea allocation, and log it without affecting the actual scheduler 
allocation... 
This is useful for debugging, and for operators to gain insight of what would 
happens if they turn on preemption in their cluster, before actually doing so. 

By anticipating the observeOnly check as you propose you prevent the 
computation and log to happen, which you are right will save some computation, 
but also 
makes it pointless to do the invocation altogether.  The effect you desire can 
be obtained by turning off the preemption policy altogether.  

Is there anything else I am missing?

> observeOnly should be checked before any preemption computation started 
> inside containerBasedPreemptOrKill() of 
> ProportionalCapacityPreemptionPolicy.java
> -
>
> Key: YARN-2020
> URL: https://issues.apache.org/jira/browse/YARN-2020
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: all
>Reporter: yeqi
>Priority: Trivial
> Fix For: 2.5.0
>
> Attachments: YARN-2020.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> observeOnly should be checked in the very beginning of  
> ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(),  so that 
> to avoid unnecessary workload.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2021) Allow AM to set failed final status

2014-05-05 Thread David Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990196#comment-13990196
 ] 

David Chen commented on YARN-2021:
--

I would like to learn more about YARN and would like to pick this up.

> Allow AM to set failed final status
> ---
>
> Key: YARN-2021
> URL: https://issues.apache.org/jira/browse/YARN-2021
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jakob Homan
>
> Background: SAMZA-117. It would be good if an AM were able to signal via its 
> final status the job itself has failed, even if the AM itself has finished up 
> in a tidy fashion.  It would be good if either (a) the AM can signal a final 
> status of failed and exit cleanly, or (b) we had another status, says 
> Application Failed, to indicate that the AM itself gave up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1868) YARN status web ui does not show correctly in IE 11

2014-05-05 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990145#comment-13990145
 ] 

Vinod Kumar Vavilapalli commented on YARN-1868:
---

Seems reasonable. From what I understand, this header is only interpreted by 
IE. Right?

Can you leave a code comment as to why the header is set and add a test?

> YARN status web ui does not show correctly in IE 11
> ---
>
> Key: YARN-1868
> URL: https://issues.apache.org/jira/browse/YARN-1868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 3.0.0
>Reporter: Chuan Liu
>Assignee: Chuan Liu
> Attachments: YARN-1868.patch, YARN_status.png
>
>
> The YARN status web ui does not show correctly in IE 11. The drop down menu 
> for app entries are not shown. Also the navigation menu displays incorrectly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Persist NMs info for RM restart

2014-05-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990120#comment-13990120
 ] 

Jian He commented on YARN-2001:
---

If RM start accepting application requests before NMs sync back, for example, 
we may run into condition like  the resource usage, capacity limit (e.g. 
headroom, queue capacity etc. ) in scheduler is not yet correct until all the 
nodes sync back all the running containers belong to the app, 
applications/queues can potentially go beyond its limit.
It'll be definitely good if we can think of a way to not make RM wait without 
hitting race conditions.

> Persist NMs info for RM restart
> ---
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge common code in schedulers

2014-05-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990110#comment-13990110
 ] 

Jian He commented on YARN-2017:
---

yup.  Put common code in an abstract class from which specific node can extend 
the class and implement its own specific logic.

> Merge common code in schedulers
> ---
>
> Key: YARN-2017
> URL: https://issues.apache.org/jira/browse/YARN-2017
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> A bunch of same code is repeated among schedulers, e.g:  between 
> FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
> common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore

2014-05-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990108#comment-13990108
 ] 

Karthik Kambatla commented on YARN-2019:


[~djp] - any particular ideas on how this should behave? 

> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> 
>
> Key: YARN-2019
> URL: https://issues.apache.org/jira/browse/YARN-2019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Priority: Critical
>  Labels: ha
> Attachments: YARN-2019.1-wip.patch
>
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge common code in schedulers

2014-05-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990104#comment-13990104
 ] 

Karthik Kambatla commented on YARN-2017:


s/too/two/

> Merge common code in schedulers
> ---
>
> Key: YARN-2017
> URL: https://issues.apache.org/jira/browse/YARN-2017
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> A bunch of same code is repeated among schedulers, e.g:  between 
> FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
> common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge common code in schedulers

2014-05-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990100#comment-13990100
 ] 

Karthik Kambatla commented on YARN-2017:


Instead of removing these too, can we make them extend AbstractSchedulerNode so 
we can store any other information that is scheduler specific? 

> Merge common code in schedulers
> ---
>
> Key: YARN-2017
> URL: https://issues.apache.org/jira/browse/YARN-2017
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> A bunch of same code is repeated among schedulers, e.g:  between 
> FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
> common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2001) Persist NMs info for RM restart

2014-05-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990091#comment-13990091
 ] 

Karthik Kambatla commented on YARN-2001:


I am not quite sure if this is a good idea. The RM should start doing what it 
can do irrespective of the NMs coming up - accepting applications, serving 
information etc. 

> Persist NMs info for RM restart
> ---
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-05 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990092#comment-13990092
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

Yes, it's OK :-)

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
> YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
> YARN-1474.8.patch, YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990088#comment-13990088
 ] 

Karthik Kambatla commented on YARN-1474:


Just got back and catching up on a number of things. Is it okay if I take a 
look later this week?

> Make schedulers services
> 
>
> Key: YARN-1474
> URL: https://issues.apache.org/jira/browse/YARN-1474
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Sandy Ryza
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-1474.1.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
> YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
> YARN-1474.8.patch, YARN-1474.9.patch
>
>
> Schedulers currently have a reinitialize but no start and stop.  Fitting them 
> into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1987) Wrapper for leveldb DBIterator to aid in handling database exceptions

2014-05-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990077#comment-13990077
 ] 

Karthik Kambatla commented on YARN-1987:


Looks good to me, except for one nit -  LevelDBIterator is marked Public. We 
should probably add Evolving to say it is not completely stable yet.

> Wrapper for leveldb DBIterator to aid in handling database exceptions
> -
>
> Key: YARN-1987
> URL: https://issues.apache.org/jira/browse/YARN-1987
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.4.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1987.patch
>
>
> Per discussions in YARN-1984 and MAPREDUCE-5652, it would be nice to have a 
> utility wrapper around leveldb's DBIterator to translate the raw 
> RuntimeExceptions it can throw into DBExceptions to make it easier to handle 
> database errors while iterating.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2021) Allow AM to set failed final status

2014-05-05 Thread Jakob Homan (JIRA)
Jakob Homan created YARN-2021:
-

 Summary: Allow AM to set failed final status
 Key: YARN-2021
 URL: https://issues.apache.org/jira/browse/YARN-2021
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jakob Homan


Background: SAMZA-117. It would be good if an AM were able to signal via its 
final status the job itself has failed, even if the AM itself has finished up 
in a tidy fashion.  It would be good if either (a) the AM can signal a final 
status of failed and exit cleanly, or (b) we had another status, says 
Application Failed, to indicate that the AM itself gave up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore

2014-05-05 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990050#comment-13990050
 ] 

Tsuyoshi OZAWA commented on YARN-2019:
--

This means that all RM can terminates when ZK cannot be accessed from RMs. If 
we should retry until ZK come up, one solution is handling 
STATE_STORE_OP_FAILED in RMFatalEventDispatcher and going into standby state. 
Please see an attached patch .

> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> 
>
> Key: YARN-2019
> URL: https://issues.apache.org/jira/browse/YARN-2019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Priority: Critical
>  Labels: ha
> Attachments: YARN-2019.1-wip.patch
>
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore

2014-05-05 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2019:
-

Attachment: YARN-2019.1-wip.patch

> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> 
>
> Key: YARN-2019
> URL: https://issues.apache.org/jira/browse/YARN-2019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Priority: Critical
>  Labels: ha
> Attachments: YARN-2019.1-wip.patch
>
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore

2014-05-05 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990037#comment-13990037
 ] 

Tsuyoshi OZAWA commented on YARN-2019:
--

RMStateStore handles the exceptions in ZKRMStateStore like this: 
{code}
try {
  // ZK related operations
  removeRMDTMasterKeyState(delegationKey);
} catch (Exception e) {
  notifyStoreOperationFailed(e);
}
{code}

If it's fenced, RMFatalEventDispatcher handles the exceptions and RM goes into 
standby state. However, if STATE_STORE_OP_FAILED occurs, Active RM terminates. 
After fail-over to standby RM, the exception could be repeated on new active 
RM. Maybe this is the case [~djp] mentioned. Please correct me if I get wrong.

{code}
  @Private
  public static class RMFatalEventDispatcher
  implements EventHandler {
@Override
public void handle(RMFatalEvent event) {
  LOG.fatal("Received a " + RMFatalEvent.class.getName() + " of type " +
  event.getType().name() + ". Cause:\n" + event.getCause());

  if (event.getType() == RMFatalEventType.STATE_STORE_FENCED) {
LOG.info("RMStateStore has been fenced");
if (rmContext.isHAEnabled()) {
  try {
// Transition to standby and reinit active services
LOG.info("Transitioning RM to Standby mode");
rm.transitionToStandby(true);
return;
  } catch (Exception e) {
LOG.fatal("Failed to transition RM to Standby mode.");
  }
}
  }

  ExitUtil.terminate(1, event.getCause());
}
  }
{code}



> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> 
>
> Key: YARN-2019
> URL: https://issues.apache.org/jira/browse/YARN-2019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Priority: Critical
>  Labels: ha
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990026#comment-13990026
 ] 

Karthik Kambatla commented on YARN-2010:


I do think this is a critical issue, but don't believe it is essentially a 
blocker for 2.4.1.

In terms of the fix, I believe the fix shouldn't be security-specific. We 
should probably add a recovery-specific config to say it is okay to continue 
with starting the RM even if we fail to recover some applications. 

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-2010.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-05 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---

Priority: Critical  (was: Major)
Target Version/s: 2.5.0

> RM can't transition to active if it can't recover an app attempt
> 
>
> Key: YARN-2010
> URL: https://issues.apache.org/jira/browse/YARN-2010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.3.0
>Reporter: bc Wong
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-2010.patch
>
>
> If the RM fails to recover an app attempt, it won't come up. We should make 
> it more resilient.
> Specifically, the underlying error is that the app was submitted before 
> Kerberos security got turned on. Makes sense for the app to fail in this 
> case. But YARN should still start.
> {noformat}
> 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election 
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
> Active 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
>  
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
>  
> ... 4 more 
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
>  
> ... 5 more 
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
> java.lang.IllegalArgumentException: Missing argument 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
>  
> at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
> ... 8 more 
> Caused by: java.lang.IllegalArgumentException: Missing argument 
> at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) 
> at 
> org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
>  
> ... 13 more 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1701) Improve default paths of timeline store and generic history store

2014-05-05 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990009#comment-13990009
 ] 

Tsuyoshi OZAWA commented on YARN-1701:
--

Sure.

> Improve default paths of timeline store and generic history store
> -
>
> Key: YARN-1701
> URL: https://issues.apache.org/jira/browse/YARN-1701
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Blocker
> Attachments: YARN-1701.3.patch, YARN-1701.v01.patch, 
> YARN-1701.v02.patch
>
>
> When I enable AHS via yarn.ahs.enabled, the app history is still not visible 
> in AHS webUI. This is due to NullApplicationHistoryStore as 
> yarn.resourcemanager.history-writer.class. It would be good to have just one 
> key to enable basic functionality.
> yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is 
> local file system location. However, FileSystemApplicationHistoryStore uses 
> DFS by default.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java

2014-05-05 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989928#comment-13989928
 ] 

Tsuyoshi OZAWA commented on YARN-2020:
--

[~yeqi], thank you for taking this JIRA. Your patch doesn't include path 
information. {{git diff --no-prefix}} command can be helpful for you to create 
it. Thanks.

> observeOnly should be checked before any preemption computation started 
> inside containerBasedPreemptOrKill() of 
> ProportionalCapacityPreemptionPolicy.java
> -
>
> Key: YARN-2020
> URL: https://issues.apache.org/jira/browse/YARN-2020
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: all
>Reporter: yeqi
>Priority: Trivial
> Fix For: 2.5.0
>
> Attachments: YARN-2020.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> observeOnly should be checked in the very beginning of  
> ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(),  so that 
> to avoid unnecessary workload.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2

2014-05-05 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989915#comment-13989915
 ] 

Mit Desai commented on YARN-1906:
-

[~zjshen] and [~wangda], Totally agree. We should add the message for the 
assertion in with the fix.
[~ashwinshankar77]: Thanks for pointing this out. I also got the failure on 
this line last time. Additionally, I ran the test couple of time and found out 
that the test was failing randomly on any of the asserts. Still trying to 
figure out where the race is.

> TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and 
> branch2
> ---
>
> Key: YARN-1906
> URL: https://issues.apache.org/jira/browse/YARN-1906
> Project: Hadoop YARN
>  Issue Type: Test
>Affects Versions: 2.4.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-1906.patch, YARN-1906.patch
>
>
> Here is the output of the format
> {noformat}
> testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 9.757 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<2> but was:<1>
>   at org.junit.Assert.fail(Assert.java:93)
>   at org.junit.Assert.failNotEquals(Assert.java:647)
>   at org.junit.Assert.assertEquals(Assert.java:128)
>   at org.junit.Assert.assertEquals(Assert.java:472)
>   at org.junit.Assert.assertEquals(Assert.java:456)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1864) Fair Scheduler Dynamic Hierarchical User Queues

2014-05-05 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989888#comment-13989888
 ] 

Ashwin Shankar commented on YARN-1864:
--

FYI, there is some work going on to fix these two test failures : YARN-1906 and 
YARN-2018.

> Fair Scheduler Dynamic Hierarchical User Queues
> ---
>
> Key: YARN-1864
> URL: https://issues.apache.org/jira/browse/YARN-1864
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: Ashwin Shankar
>  Labels: scheduler
> Attachments: YARN-1864-v1.txt, YARN-1864-v2.txt, YARN-1864-v3.txt, 
> YARN-1864-v4.txt, YARN-1864-v5.txt
>
>
> In Fair Scheduler, we want to be able to create user queues under any parent 
> queue in the hierarchy. For eg. Say user1 submits a job to a parent queue 
> called root.allUserQueues, we want be able to create a new queue called 
> root.allUserQueues.user1 and run user1's job in it.Any further jobs submitted 
> by this user to root.allUserQueues will be run in this newly created 
> root.allUserQueues.user1.
> This is very similar to the 'user-as-default' feature in Fair Scheduler which 
> creates user queues under root queue. But we want the ability to create user 
> queues under ANY parent queue.
> Why do we want this ?
> 1. Preemption : these dynamically created user queues can preempt each other 
> if its fair share is not met. So there is fairness among users.
> User queues can also preempt other non-user leaf queue as well if below fair 
> share.
> 2. Allocation to user queues : we want all the user queries(adhoc) to consume 
> only a fraction of resources in the shared cluster. By creating this 
> feature,we could do that by giving a fair share to the parent user queue 
> which is then redistributed to all the dynamically created user queues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1906) TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and branch2

2014-05-05 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989883#comment-13989883
 ] 

Ashwin Shankar commented on YARN-1906:
--

Hey [~mitdesai], I encountered this issue in my pre-commit build,but it seemed 
to have
happened at a different place in this test. Here is the link:
https://builds.apache.org/job/PreCommit-YARN-Build/3686//testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testQueueMetricsOnRMRestart/


> TestRMRestart#testQueueMetricsOnRMRestart fails intermittently on trunk and 
> branch2
> ---
>
> Key: YARN-1906
> URL: https://issues.apache.org/jira/browse/YARN-1906
> Project: Hadoop YARN
>  Issue Type: Test
>Affects Versions: 2.4.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-1906.patch, YARN-1906.patch
>
>
> Here is the output of the format
> {noformat}
> testQueueMetricsOnRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 9.757 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<2> but was:<1>
>   at org.junit.Assert.fail(Assert.java:93)
>   at org.junit.Assert.failNotEquals(Assert.java:647)
>   at org.junit.Assert.assertEquals(Assert.java:128)
>   at org.junit.Assert.assertEquals(Assert.java:472)
>   at org.junit.Assert.assertEquals(Assert.java:456)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.assertQueueMetrics(TestRMRestart.java:1735)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart(TestRMRestart.java:1706)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-05 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1368:
--

Attachment: YARN-1368.1.patch

Uploaded a new patch.
- AbstractYarnScheduler#recoverContainersOnNode() does the majority of recovery 
mechanism which recovers RMContainer, SchedulerNode,Queue. 
SchedulerApplicationAttempt, appSchedulingInfo accordingly.
- ResourceTrackerService#handleContainerStatus is not needed anymore, that’s 
handled in the common recovery flow.
- Changed RMAppRecoveredTransition to add the current attempt to scheduler.
- Changed a few RMAppAttempt transitions to capture the completed containers 
that are recovered.
- some modifications in CapacityScheduler to not send unnecessary 
app_accepted/attempt_added event to the recovered apps/attempts.

Todo:
- Replace the containerStatus sent across via NM registration with a new object 
which captures the resource capability of the container.
-  FSQueue needs to implements its own recoverContainer method

> Common work to re-populate containers’ state into scheduler
> ---
>
> Key: YARN-1368
> URL: https://issues.apache.org/jira/browse/YARN-1368
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1368.1.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-05-05 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1857:
--

Attachment: YARN-1857.patch

> CapacityScheduler headroom doesn't account for other AM's running
> -
>
> Key: YARN-1857
> URL: https://issues.apache.org/jira/browse/YARN-1857
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Chen He
> Attachments: YARN-1857.patch, YARN-1857.patch
>
>
> Its possible to get an application to hang forever (or a long time) in a 
> cluster with multiple users.  The reason why is that the headroom sent to the 
> application is based on the user limit but it doesn't account for other 
> Application masters using space in that queue.  So the headroom (user limit - 
> user consumed) can be > 0 even though the cluster is 100% full because the 
> other space is being used by application masters from other users.  
> For instance if you have a cluster with 1 queue, user limit is 100%, you have 
> multiple users submitting applications.  One very large application by user 1 
> starts up, runs most of its maps and starts running reducers. other users try 
> to start applications and get their application masters started but not 
> tasks.  The very large application then gets to the point where it has 
> consumed the rest of the cluster resources with all reduces.  But at this 
> point it needs to still finish a few maps.  The headroom being sent to this 
> application is only based on the user limit (which is 100% of the cluster 
> capacity) its using lets say 95% of the cluster for reduces and then other 5% 
> is being used by other users running application masters.  The MRAppMaster 
> thinks it still has 5% so it doesn't know that it should kill a reduce in 
> order to run a map.  
> This can happen in other scenarios also.  Generally in a large cluster with 
> multiple queues this shouldn't cause a hang forever but it could cause the 
> application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-05-05 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989865#comment-13989865
 ] 

Chen He commented on YARN-1857:
---

This failure is related to YARN-1906.

> CapacityScheduler headroom doesn't account for other AM's running
> -
>
> Key: YARN-1857
> URL: https://issues.apache.org/jira/browse/YARN-1857
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Chen He
> Attachments: YARN-1857.patch
>
>
> Its possible to get an application to hang forever (or a long time) in a 
> cluster with multiple users.  The reason why is that the headroom sent to the 
> application is based on the user limit but it doesn't account for other 
> Application masters using space in that queue.  So the headroom (user limit - 
> user consumed) can be > 0 even though the cluster is 100% full because the 
> other space is being used by application masters from other users.  
> For instance if you have a cluster with 1 queue, user limit is 100%, you have 
> multiple users submitting applications.  One very large application by user 1 
> starts up, runs most of its maps and starts running reducers. other users try 
> to start applications and get their application masters started but not 
> tasks.  The very large application then gets to the point where it has 
> consumed the rest of the cluster resources with all reduces.  But at this 
> point it needs to still finish a few maps.  The headroom being sent to this 
> application is only based on the user limit (which is 100% of the cluster 
> capacity) its using lets say 95% of the cluster for reduces and then other 5% 
> is being used by other users running application masters.  The MRAppMaster 
> thinks it still has 5% so it doesn't know that it should kill a reduce in 
> order to run a map.  
> This can happen in other scenarios also.  Generally in a large cluster with 
> multiple queues this shouldn't cause a hang forever but it could cause the 
> application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1896) For FairScheduler expose MinimumQueueResource of each queue in QueueMetrics

2014-05-05 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-1896:
--

Attachment: YARN-1896.v2.patch

> For FairScheduler expose MinimumQueueResource of each queue in QueueMetrics
> ---
>
> Key: YARN-1896
> URL: https://issues.apache.org/jira/browse/YARN-1896
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siqi Li
> Attachments: YARN-1896.v1.patch, YARN-1896.v2.patch
>
>
> For FairScheduler, it's very useful to expose MinimumQueueResource and 
> MaximumQueueResource of each queu in QueueMetrics. Therefore, people can use 
> monitoring graph to see what are their current usage and their limit. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1805) Signal container request delivery from resourcemanager to nodemanager

2014-05-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989672#comment-13989672
 ] 

Hadoop QA commented on YARN-1805:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12643371/YARN-1805.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3695//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3695//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3695//console

This message is automatically generated.

> Signal container request delivery from resourcemanager to nodemanager
> -
>
> Key: YARN-1805
> URL: https://issues.apache.org/jira/browse/YARN-1805
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: YARN-1805.patch
>
>
> 1. Update ResourceTracker's HeartbeatResponse to include the list of 
> SignalContainerRequest.
> 2. Upon receiving the request, NM's NodeStatusUpdater will deliver the 
> request to ContainerManager.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java

2014-05-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989643#comment-13989643
 ] 

Hadoop QA commented on YARN-2020:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12643377/YARN-2020.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3696//console

This message is automatically generated.

> observeOnly should be checked before any preemption computation started 
> inside containerBasedPreemptOrKill() of 
> ProportionalCapacityPreemptionPolicy.java
> -
>
> Key: YARN-2020
> URL: https://issues.apache.org/jira/browse/YARN-2020
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: all
>Reporter: yeqi
>Priority: Trivial
> Fix For: 2.5.0
>
> Attachments: YARN-2020.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> observeOnly should be checked in the very beginning of  
> ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(),  so that 
> to avoid unnecessary workload.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java

2014-05-05 Thread yeqi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yeqi updated YARN-2020:
---

Attachment: YARN-2020.patch

patch submitted

> observeOnly should be checked before any preemption computation started 
> inside containerBasedPreemptOrKill() of 
> ProportionalCapacityPreemptionPolicy.java
> -
>
> Key: YARN-2020
> URL: https://issues.apache.org/jira/browse/YARN-2020
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.4.0
> Environment: all
>Reporter: yeqi
>Priority: Trivial
> Fix For: 2.5.0
>
> Attachments: YARN-2020.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> observeOnly should be checked in the very beginning of  
> ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(),  so that 
> to avoid unnecessary workload.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2020) observeOnly should be checked before any preemption computation started inside containerBasedPreemptOrKill() of ProportionalCapacityPreemptionPolicy.java

2014-05-05 Thread yeqi (JIRA)
yeqi created YARN-2020:
--

 Summary: observeOnly should be checked before any preemption 
computation started inside containerBasedPreemptOrKill() of 
ProportionalCapacityPreemptionPolicy.java
 Key: YARN-2020
 URL: https://issues.apache.org/jira/browse/YARN-2020
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
 Environment: all
Reporter: yeqi
Priority: Trivial
 Fix For: 2.5.0


observeOnly should be checked in the very beginning of  
ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(),  so that to 
avoid unnecessary workload.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1805) Signal container request delivery from resourcemanager to nodemanager

2014-05-05 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-1805:
--

Attachment: YARN-1805.patch

The patch includes YARN-1803 and YARN-1897 for jenkins to build.

> Signal container request delivery from resourcemanager to nodemanager
> -
>
> Key: YARN-1805
> URL: https://issues.apache.org/jira/browse/YARN-1805
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Ming Ma
> Attachments: YARN-1805.patch
>
>
> 1. Update ResourceTracker's HeartbeatResponse to include the list of 
> SignalContainerRequest.
> 2. Upon receiving the request, NM's NodeStatusUpdater will deliver the 
> request to ContainerManager.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1805) Signal container request delivery from resourcemanager to nodemanager

2014-05-05 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma reassigned YARN-1805:
-

Assignee: Ming Ma

> Signal container request delivery from resourcemanager to nodemanager
> -
>
> Key: YARN-1805
> URL: https://issues.apache.org/jira/browse/YARN-1805
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: YARN-1805.patch
>
>
> 1. Update ResourceTracker's HeartbeatResponse to include the list of 
> SignalContainerRequest.
> 2. Upon receiving the request, NM's NodeStatusUpdater will deliver the 
> request to ContainerManager.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1201) TestAMAuthorization fails with local hostname cannot be resolved

2014-05-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989571#comment-13989571
 ] 

Hudson commented on YARN-1201:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1775 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1775/])
YARN-1201. TestAMAuthorization fails with local hostname cannot be resolved. 
(Wangda Tan via junping_du) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1592197)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAMAuthorization.java


> TestAMAuthorization fails with local hostname cannot be resolved
> 
>
> Key: YARN-1201
> URL: https://issues.apache.org/jira/browse/YARN-1201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
> Environment: SUSE Linux Enterprise Server 11 (x86_64)
>Reporter: Nemon Lou
>Assignee: Wangda Tan
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, 
> YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, YARN-1201.patch
>
>
> When hostname is 158-1-131-10, TestAMAuthorization fails.
> {code}
> Running org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
> Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 14.034 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
> testUnauthorizedAccess[0](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization)
>   Time elapsed: 3.952 sec  <<< ERROR!
> java.lang.NullPointerException: null
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284)
> testUnauthorizedAccess[1](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization)
>   Time elapsed: 3.116 sec  <<< ERROR!
> java.lang.NullPointerException: null
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284)
> Results :
> Tests in error:
>   TestAMAuthorization.testUnauthorizedAccess:284 NullPointer
>   TestAMAuthorization.testUnauthorizedAccess:284 NullPointer
> Tests run: 4, Failures: 0, Errors: 2, Skipped: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore

2014-05-05 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989487#comment-13989487
 ] 

Junping Du commented on YARN-2019:
--

The bad news could be: the exception could be repeated on new active RM as 
ZKRMStateStore is shared. Am I missing anything here?

> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> 
>
> Key: YARN-2019
> URL: https://issues.apache.org/jira/browse/YARN-2019
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Priority: Critical
>  Labels: ha
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1201) TestAMAuthorization fails with local hostname cannot be resolved

2014-05-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989451#comment-13989451
 ] 

Hudson commented on YARN-1201:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1749 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1749/])
YARN-1201. TestAMAuthorization fails with local hostname cannot be resolved. 
(Wangda Tan via junping_du) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1592197)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAMAuthorization.java


> TestAMAuthorization fails with local hostname cannot be resolved
> 
>
> Key: YARN-1201
> URL: https://issues.apache.org/jira/browse/YARN-1201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
> Environment: SUSE Linux Enterprise Server 11 (x86_64)
>Reporter: Nemon Lou
>Assignee: Wangda Tan
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, 
> YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, YARN-1201.patch
>
>
> When hostname is 158-1-131-10, TestAMAuthorization fails.
> {code}
> Running org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
> Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 14.034 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
> testUnauthorizedAccess[0](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization)
>   Time elapsed: 3.952 sec  <<< ERROR!
> java.lang.NullPointerException: null
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284)
> testUnauthorizedAccess[1](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization)
>   Time elapsed: 3.116 sec  <<< ERROR!
> java.lang.NullPointerException: null
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284)
> Results :
> Tests in error:
>   TestAMAuthorization.testUnauthorizedAccess:284 NullPointer
>   TestAMAuthorization.testUnauthorizedAccess:284 NullPointer
> Tests run: 4, Failures: 0, Errors: 2, Skipped: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1201) TestAMAuthorization fails with local hostname cannot be resolved

2014-05-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989424#comment-13989424
 ] 

Hudson commented on YARN-1201:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #558 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/558/])
YARN-1201. TestAMAuthorization fails with local hostname cannot be resolved. 
(Wangda Tan via junping_du) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1592197)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestAMAuthorization.java


> TestAMAuthorization fails with local hostname cannot be resolved
> 
>
> Key: YARN-1201
> URL: https://issues.apache.org/jira/browse/YARN-1201
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
> Environment: SUSE Linux Enterprise Server 11 (x86_64)
>Reporter: Nemon Lou
>Assignee: Wangda Tan
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, 
> YARN-1201.patch, YARN-1201.patch, YARN-1201.patch, YARN-1201.patch
>
>
> When hostname is 158-1-131-10, TestAMAuthorization fails.
> {code}
> Running org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
> Tests run: 4, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 14.034 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
> testUnauthorizedAccess[0](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization)
>   Time elapsed: 3.952 sec  <<< ERROR!
> java.lang.NullPointerException: null
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284)
> testUnauthorizedAccess[1](org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization)
>   Time elapsed: 3.116 sec  <<< ERROR!
> java.lang.NullPointerException: null
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization.testUnauthorizedAccess(TestAMAuthorization.java:284)
> Results :
> Tests in error:
>   TestAMAuthorization.testUnauthorizedAccess:284 NullPointer
>   TestAMAuthorization.testUnauthorizedAccess:284 NullPointer
> Tests run: 4, Failures: 0, Errors: 2, Skipped: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)