[jira] [Created] (YARN-4235) FairScheduler PrimaryGroup does not handle empty groups returned for a user

2015-10-07 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4235:
---

 Summary: FairScheduler PrimaryGroup does not handle empty groups 
returned for a user 
 Key: YARN-4235
 URL: https://issues.apache.org/jira/browse/YARN-4235
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


We see NPE if empty groups are returned for a user. This causes a NPE and cause 
RM to crash as below

{noformat}
2015-09-22 16:51:52,780  FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type APP_ADDED to the scheduler
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:3212)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$PrimaryGroup.getQueueForApp(QueuePlacementRule.java:149)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:689)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:595)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
at java.lang.Thread.run(Thread.java:745)
2015-09-22 16:51:52,797  INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo

2015-09-23 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4204:
---

 Summary: ConcurrentModificationException in FairSchedulerQueueInfo
 Key: YARN-4204
 URL: https://issues.apache.org/jira/browse/YARN-4204
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Saw this exception
{noformat}
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at 
java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552)
at 
org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1279)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 

[jira] [Created] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry

2015-09-18 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4185:
---

 Summary: Retry interval delay for NM client can be improved from 
the fixed static retry 
 Key: YARN-4185
 URL: https://issues.apache.org/jira/browse/YARN-4185
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Instead of having a fixed retry interval that starts off very high and stays 
there, we are better off using an exponential backoff that has the same fixed 
max limit. Today the retry interval is fixed at 10 sec that can be 
unnecessarily high especially when NMs could rolling restart within a sec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem

2015-09-18 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4184:
---

 Summary: Remove update reservation state api from state store as 
its not used by ReservationSystem
 Key: YARN-4184
 URL: https://issues.apache.org/jira/browse/YARN-4184
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


ReservationSystem uses remove/add for updates and thus update api in state 
store is not needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4180) AMLauncher does not retry on failures when talking to NM

2015-09-17 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4180:
---

 Summary: AMLauncher does not retry on failures when talking to NM 
 Key: YARN-4180
 URL: https://issues.apache.org/jira/browse/YARN-4180
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


We see issues with RM trying to launch a container while a NM is restarting and 
we get exceptions like NMNotReadyException. While YARN-3842 added retry for 
other clients of NM (AMs mainly) its not used by AMLauncher in RM causing there 
intermittent errors to cause job failures. This can manifest during rolling 
restart of NMs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4150) Failure in TestNMClient because nodereports were not available

2015-09-11 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4150:
---

 Summary: Failure in TestNMClient because nodereports were not 
available
 Key: YARN-4150
 URL: https://issues.apache.org/jira/browse/YARN-4150
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Saw a failure in a test run




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4144) Add NM that causes LaunchFailedTransition to blacklist

2015-09-10 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4144:
---

 Summary: Add NM that causes LaunchFailedTransition to blacklist
 Key: YARN-4144
 URL: https://issues.apache.org/jira/browse/YARN-4144
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


During discussion of YARN-2005 we need to add more cases where blacklisting can 
occur. This tracks adding any failures in launch via LaunchFailedTransition to 
also contribute to blacklisting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace

2015-09-10 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4145:
---

 Summary: Make RMHATestBase abstract so its not run when running 
all tests under that namespace
 Key: YARN-4145
 URL: https://issues.apache.org/jira/browse/YARN-4145
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Minor


Trivial patch to make it abstract



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType

2015-09-10 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4143:
---

 Summary: Optimize the check for AMContainer allocation needed by 
blacklisting and ContainerType
 Key: YARN-4143
 URL: https://issues.apache.org/jira/browse/YARN-4143
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot


In YARN-2005 there are checks made to determine if the allocation is for an AM 
container. This happens in every allocate call and should be optimized away 
since it changes only once per SchedulerApplicationAttempt





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug

2015-09-04 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4115:
---

 Summary: Reduce loglevel of ContainerManagementProtocolProxy to 
Debug
 Key: YARN-4115
 URL: https://issues.apache.org/jira/browse/YARN-4115
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Minor


We see log spams of Aug 28, 1:57:52.441 PM  INFO
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy 
Opening proxy : :8041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4077) FairScheduler Reservation should wait for most relaxed scheduling delay permitted before issuing reservation

2015-08-24 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4077:
---

 Summary: FairScheduler Reservation should wait for most relaxed 
scheduling delay permitted before issuing reservation
 Key: YARN-4077
 URL: https://issues.apache.org/jira/browse/YARN-4077
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot


Today if an allocation has a node local request that allows for relaxation, we 
do not wait for the relaxation delay before issuing the reservation. This can 
be too aggressive. Instead we should allow the scheduling delays of relaxation 
to expire before we choose to allow reserving a node for the container. This 
allows for the request to be satisfied on a different node instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4076) FairScheduler does not allow AM to choose which containers to preempt

2015-08-24 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4076:
---

 Summary: FairScheduler does not allow AM to choose which 
containers to preempt
 Key: YARN-4076
 URL: https://issues.apache.org/jira/browse/YARN-4076
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Capacity scheduler allows for AM to choose which containers will be preempted. 
See comment about corresponding work pending for FairScheduler 
https://issues.apache.org/jira/browse/YARN-568?focusedCommentId=13649126page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13649126



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4046) NM container recovery is broken on some linux distro because of syntax of signal

2015-08-11 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4046:
---

 Summary: NM container recovery is broken on some linux distro 
because of syntax of signal
 Key: YARN-4046
 URL: https://issues.apache.org/jira/browse/YARN-4046
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Critical


On a debian machine we have seen node manager recovery of containers fail 
because the signal syntax for process group may not work. We see errors in 
checking if process is alive during container recovery which causes the 
container to be declared as LOST (154) on a NodeManager restart.

The application will fail with error
{noformat}
Application application_1439244348718_0001 failed 1 times due to Attempt 
recovered after RM restartAM Container for appattempt_1439244348718_0001_01 
exited with exitCode: 154
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834

2015-08-07 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4032:
---

 Summary: Corrupted state from a previous version can still cause 
RM to fail with NPE due to same reasons as YARN-2834
 Key: YARN-4032
 URL: https://issues.apache.org/jira/browse/YARN-4032
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if 
someone is upgrading from a previous version, the state can still be 
inconsistent and then RM will still fail with NPE after upgrade to 2.6.0.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4030) Make Nodemanager cgroup usage for container easier to use when its running inside a cgroup

2015-08-07 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4030:
---

 Summary: Make Nodemanager cgroup usage for container easier to use 
when its running inside a cgroup 
 Key: YARN-4030
 URL: https://issues.apache.org/jira/browse/YARN-4030
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Today nodemanager uses the cgroup prefix pointed by 
yarn.nodemanager.linux-container-executor.cgroups.hierarchy (default value 
/hadoop-yarn) directly at the path of the controller say 
/sys/fs/cgroup/cpu/hadoop-yarn. 

If there are nodemanagers running inside docker containers on a host, each 
would typically be separated by a cgroup under the controller path say
/sys/fs/cgroup/cpu/docker/dockerid1/nmcgroup for NM1 and 
/sys/fs/cgroup/cpu/docker/dockerid2/nmcgroup for NM2. 

In this case the correct behavior should be to use the docker cgroup paths as 
/sys/fs/cgroup/cpu/docker/dockerid1/hadoop-yarn for NM1
/sys/fs/cgroup/cpu/docker/dockerid2/hadoop-yarn for NM2.
But the default behavior would make both NMs try to use 
/sys/fs/cgroup/cpu/hadoop-yarn which is incorrect and would usually fail based 
on the permissions setup.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4021) RuntimeException/YarnRuntimeException sent over to the client can cause client to assume a local fatal failure

2015-08-05 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4021:
---

 Summary: RuntimeException/YarnRuntimeException sent over to the 
client can cause client to assume a local fatal failure 
 Key: YARN-4021
 URL: https://issues.apache.org/jira/browse/YARN-4021
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Currently RuntimeException and its derived types such as YarnRuntimeExceptions 
are serialized over to the client and thrown at the client after YARN-731. This 
can cause issues like MAPREDUCE-6439 where we assume a local fatal exception 
has happened. 
Instead we should have a way to distinguish local RuntimeException versus 
remote RuntimeException to avoid these issues. We need to go over all the 
current client side code that is expecting a remote RuntimeException inorder to 
make it work with this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305

2015-07-29 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3996:
---

 Summary: YARN-789 (Support for zero capabilities in fairscheduler) 
is broken after YARN-3305
 Key: YARN-3996
 URL: https://issues.apache.org/jira/browse/YARN-3996
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Critical


RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest with 
mininumResource for the incrementResource. This causes normalize to return zero 
if minimum is set to zero as per YARN-789



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs

2015-07-27 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3985:
---

 Summary: Make ReservationSystem persist state using RMStateStore 
reservation APIs 
 Key: YARN-3985
 URL: https://issues.apache.org/jira/browse/YARN-3985
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


YARN-3736 adds the RMStateStore apis to store and load reservation state. This 
jira adds the actual storing of state from ReservationSystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3957) FairScheduler NPE In FairSchedulerQueueInfo causing scheduler page to return 500

2015-07-22 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3957:
---

 Summary: FairScheduler NPE In FairSchedulerQueueInfo causing 
scheduler page to return 500
 Key: YARN-3957
 URL: https://issues.apache.org/jira/browse/YARN-3957
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


There is a NPE causing the webpage of http://localhost:23188/cluster/scheduler 
to return a 500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3961) Expose queue container information (pending, running, reserved) in UI and yarn top

2015-07-22 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3961:
---

 Summary: Expose queue container information (pending, running, 
reserved) in UI and yarn top
 Key: YARN-3961
 URL: https://issues.apache.org/jira/browse/YARN-3961
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, webapp
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


It would be nice to expose container (allocated, pending, reserved) information 
in the UI and in yarn top tool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3920) FairScheduler Reserving a node for a container should be configurable to allow it used only for large containers

2015-07-13 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3920:
---

 Summary: FairScheduler Reserving a node for a container should be 
configurable to allow it used only for large containers
 Key: YARN-3920
 URL: https://issues.apache.org/jira/browse/YARN-3920
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Reserving a node for a container was designed for preventing large containers 
from starvation from small requests that keep getting into a node. Today we let 
this be used even for a small container request. This has a huge impact on 
scheduling since we block other scheduling requests until that reservation is 
fulfilled. We should make this configurable so its impact can be minimized by 
limiting it for large container requests as originally intended. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it

2015-07-08 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3900:
---

 Summary: Protobuf layout  of yarn_security_token causes errors in 
other protos that include it
 Key: YARN-3900
 URL: https://issues.apache.org/jira/browse/YARN-3900
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Because of the subdirectory server used in 
{{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}}
 there are errors in other protos that include them.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3890) FairScheduler should show the scheduler health metrics similar to ones added in CapacityScheduler

2015-07-06 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3890:
---

 Summary: FairScheduler should show the scheduler health metrics 
similar to ones added in CapacityScheduler
 Key: YARN-3890
 URL: https://issues.apache.org/jira/browse/YARN-3890
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Anubhav Dhoot


We should add information displayed in YARN-3293 in FairScheduler as well 
possibly sharing the implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3800) Simplify inmemory state for ReservationAllocation

2015-06-12 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3800:
---

 Summary: Simplify inmemory state for ReservationAllocation
 Key: YARN-3800
 URL: https://issues.apache.org/jira/browse/YARN-3800
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Instead of storing the ReservationRequest we store the Resource for 
allocations, as thats the only thing we need. Ultimately we convert everything 
to resources anyway



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3675) FairScheduler: RM quits when node removal races with continousscheduling on the same node

2015-05-18 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3675:
---

 Summary: FairScheduler: RM quits when node removal races with 
continousscheduling on the same node
 Key: YARN-3675
 URL: https://issues.apache.org/jira/browse/YARN-3675
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


With continuous scheduling, scheduling can be done on a node thats just removed 
causing errors like below.

{noformat}
12:28:53.782 AM FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager

Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.unreserve(FSAppAttempt.java:469)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:815)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:763)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1217)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
at java.lang.Thread.run(Thread.java:745)
12:28:53.783 AM  INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Exiting, bbye..
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3392) Change NodeManager metrics to not populate resource usage metrics if they are unavailable

2015-04-29 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot resolved YARN-3392.
-
Resolution: Duplicate

 Change NodeManager metrics to not populate resource usage metrics if they are 
 unavailable 
 --

 Key: YARN-3392
 URL: https://issues.apache.org/jira/browse/YARN-3392
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-3392.prelim.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3533) Test: Fix launchAM in MockRM to wait for attempt to be scheduled

2015-04-22 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3533:
---

 Summary: Test: Fix launchAM in MockRM to wait for attempt to be 
scheduled
 Key: YARN-3533
 URL: https://issues.apache.org/jira/browse/YARN-3533
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.6.0
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


MockRM#launchAM fails in many test runs because it does not wait for the app 
attempt to be scheduled before NM update is sent as noted in [recent 
builds|https://issues.apache.org/jira/browse/YARN-3387?focusedCommentId=14507255page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14507255]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3424) Reduce log for ContainerMonitorImpl resoure monitoring from info to debug

2015-03-31 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3424:
---

 Summary: Reduce log for ContainerMonitorImpl resoure monitoring 
from info to debug
 Key: YARN-3424
 URL: https://issues.apache.org/jira/browse/YARN-3424
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Today we log the memory usage of process at info level which spams the log with 
hundreds of log lines 
Proposing changing this to debug level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3392) Change NodeManager metrics to not populate resource usage metrics if they are unavailable

2015-03-23 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3392:
---

 Summary: Change NodeManager metrics to not populate resource usage 
metrics if they are unavailable 
 Key: YARN-3392
 URL: https://issues.apache.org/jira/browse/YARN-3392
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3351) AppMaster tracking URL is broken in HA

2015-03-16 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3351:
---

 Summary: AppMaster tracking URL is broken in HA
 Key: YARN-3351
 URL: https://issues.apache.org/jira/browse/YARN-3351
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


After YARN-2713, the AppMaster link is broken in HA. 
The log and full stack trace is shown below
{noformat}
2015-02-05 20:47:43,478 WARN org.mortbay.log: 
/proxy/application_1423182188062_0002/: java.net.BindException: Cannot assign 
requested address
{noformat}
{noformat}
java.net.BindException: Cannot assign requested address
at java.net.PlainSocketImpl.socketBind(Native Method)
at 
java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)
at java.net.Socket.bind(Socket.java:631)
at java.net.Socket.init(Socket.java:423)
at java.net.Socket.init(Socket.java:280)
at 
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
at 
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
at 
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:188)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:345)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3259) FairScheduler: Update to fairShare could be triggered early on node events instead of waiting for update interval

2015-02-25 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3259:
---

 Summary: FairScheduler: Update to fairShare could be triggered 
early on node events instead of waiting for update interval 
 Key: YARN-3259
 URL: https://issues.apache.org/jira/browse/YARN-3259
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Instead of waiting for update interval unconditionally, we can trigger early 
updates on important events - for eg node join and leave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3258) FairScheduler: Need to add more logging to investigate allocations

2015-02-25 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3258:
---

 Summary: FairScheduler: Need to add more logging to investigate 
allocations
 Key: YARN-3258
 URL: https://issues.apache.org/jira/browse/YARN-3258
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Minor


Its hard to investigate allocation failures without any logging. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3257) FairScheduler: MaxAm may be set too low preventing apps from starting

2015-02-25 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3257:
---

 Summary: FairScheduler: MaxAm may be set too low preventing apps 
from starting
 Key: YARN-3257
 URL: https://issues.apache.org/jira/browse/YARN-3257
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Anubhav Dhoot


In YARN-2637 CapacityScheduler#LeafQueue does not enforce max am share if the 
limit prevents the first application from starting. This would be good to add 
to FSLeafQueue as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3256) TestClientToAMToken#testClientTokenRace is not running against all Schedulers even when using ParameterizedSchedulerTestBase

2015-02-25 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3256:
---

 Summary: TestClientToAMToken#testClientTokenRace is not running 
against all Schedulers even when using ParameterizedSchedulerTestBase
 Key: YARN-3256
 URL: https://issues.apache.org/jira/browse/YARN-3256
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


The test testClientTokenRace was not using the base class conf causing it to 
run twice on the same Scheduler configured in the default.
All tests deriving from ParameterizedSchedulerTestBase should use the conf 
created in the base class instead of newing up inside the test and hiding the 
member. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3229) Incorrect processing of container as LOST on Interruption during NM shutdown

2015-02-19 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3229:
---

 Summary: Incorrect processing of container as LOST on Interruption 
during NM shutdown
 Key: YARN-3229
 URL: https://issues.apache.org/jira/browse/YARN-3229
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot


YARN-2846 fixed the issue of writing to the state store incorrectly that the 
process is LOST. But even after that we still process the ContainerExitEvent. 
If notInterrupted is false in RecoveredContainerLaunch#call we should skip the 
following
{noformat}
 if (retCode != 0) {
  LOG.warn(Recovered container exited with a non-zero exit code 
  + retCode);
  this.dispatcher.getEventHandler().handle(new ContainerExitEvent(
  containerId,
  ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, retCode,
  Container exited with a non-zero exit code  + retCode));
  return retCode;
}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3209) RM and NM state should be added to the list of Hadoop Compatibility File list

2015-02-17 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3209:
---

 Summary: RM and NM state should be added to the list of Hadoop 
Compatibility File list 
 Key: YARN-3209
 URL: https://issues.apache.org/jira/browse/YARN-3209
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Anubhav Dhoot


The Hadoop Compatibility guide lists different internal files used by different 
components at 
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#System-internal_file_formats

We should add NodeManager recovery state and ResourceManager ZK state to the 
list



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3184) Inefficient iteration of map

2015-02-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot resolved YARN-3184.
-
Resolution: Duplicate

 Inefficient iteration of map
 

 Key: YARN-3184
 URL: https://issues.apache.org/jira/browse/YARN-3184
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Minor
 Attachments: YARN-3184.001.patch


 Iteration of keySet and then lookup of value is not as efficient as iterating 
 the entrySet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3184) Inefficient iteration of map

2015-02-11 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3184:
---

 Summary: Inefficient iteration of map
 Key: YARN-3184
 URL: https://issues.apache.org/jira/browse/YARN-3184
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Priority: Minor


Iteration of keySet and then lookup of value is not as efficient as iterating 
the entrySet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3138) TestFairScheduler#testContinuousScheduling fails intermittently on trunk

2015-02-04 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3138:
---

 Summary: TestFairScheduler#testContinuousScheduling fails 
intermittently on trunk
 Key: YARN-3138
 URL: https://issues.apache.org/jira/browse/YARN-3138
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Anubhav Dhoot


This test failed randomly in a precheckin and passed on rerun

https://builds.apache.org/job/PreCommit-YARN-Build/6497//testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3138) TestFairScheduler#testContinuousScheduling fails intermittently on trunk

2015-02-04 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot resolved YARN-3138.
-
Resolution: Duplicate

 TestFairScheduler#testContinuousScheduling fails intermittently on trunk
 

 Key: YARN-3138
 URL: https://issues.apache.org/jira/browse/YARN-3138
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Anubhav Dhoot

 This test failed randomly in a precheckin and passed on rerun
 https://builds.apache.org/job/PreCommit-YARN-Build/6497//testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3122) Metrics for container's actual CPU usage

2015-01-30 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3122:
---

 Summary: Metrics for container's actual CPU usage
 Key: YARN-3122
 URL: https://issues.apache.org/jira/browse/YARN-3122
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Anubhav Dhoot
Assignee: Karthik Kambatla
 Fix For: 2.7.0


It would be nice to capture resource usage per container, for a variety of 
reasons. This JIRA is to track memory usage. 

YARN-2965 tracks the resource usage on the node, and the two implementations 
should reuse code as much as possible. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3121) FairScheduler preemption metrics

2015-01-30 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3121:
---

 Summary: FairScheduler preemption metrics
 Key: YARN-3121
 URL: https://issues.apache.org/jira/browse/YARN-3121
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Anubhav Dhoot


Add FSQueuemetrics for preemption related information



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3101) FairScheduler#fitInMaxShare was added to validate reservations but it does not consider it

2015-01-26 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3101:
---

 Summary: FairScheduler#fitInMaxShare was added to validate 
reservations but it does not consider it 
 Key: YARN-3101
 URL: https://issues.apache.org/jira/browse/YARN-3101
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Anubhav Dhoot


YARN-2811 added fitInMaxShare to validate reservations on a queue, but did not 
count it during its calculations. It also had the condition reversed so the 
test was still passing because both cancelled each other. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3082) Non thread safe access to systemCredentials in NodeHeartbeatResponse processing

2015-01-21 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3082:
---

 Summary: Non thread safe access to systemCredentials in 
NodeHeartbeatResponse processing
 Key: YARN-3082
 URL: https://issues.apache.org/jira/browse/YARN-3082
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


When you use system credentials via feature added in YARN-2704, the proto 
conversion code throws exception in converting ByteBuffer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3027) Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation

2015-01-09 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3027:
---

 Summary: Scheduler should use totalAvailable resource from node 
instead of availableResource for maxAllocation
 Key: YARN-3027
 URL: https://issues.apache.org/jira/browse/YARN-3027
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


YARN-2604 added support for updating maxiumum allocation resource size based on 
nodes. But it incorrectly uses available resource instead of maximum resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3022) Expose Container resource information from NodeManager for monitoring

2015-01-08 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3022:
---

 Summary: Expose Container resource information from NodeManager 
for monitoring
 Key: YARN-3022
 URL: https://issues.apache.org/jira/browse/YARN-3022
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Along with exposing resource consumption of each container such as (YARN-2141) 
its worth exposing the actual resource limit associated with them to get better 
insight into YARN allocation and consumption



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2574) Add support for FairScheduler to the ReservationSystem

2015-01-06 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot resolved YARN-2574.
-
   Resolution: Fixed
Fix Version/s: 2.7.0

 Add support for FairScheduler to the ReservationSystem
 --

 Key: YARN-2574
 URL: https://issues.apache.org/jira/browse/YARN-2574
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Subru Krishnan
Assignee: Anubhav Dhoot
 Fix For: 2.7.0


 YARN-1051 introduces the ReservationSystem and the current implementation is 
 based on CapacityScheduler. This JIRA proposes adding support for 
 FairScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3008) FairScheduler: Use lock for queuemanager instead of synchronized on FairScheduler

2015-01-05 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-3008:
---

 Summary: FairScheduler: Use lock for queuemanager instead of 
synchronized on FairScheduler
 Key: YARN-3008
 URL: https://issues.apache.org/jira/browse/YARN-3008
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Anubhav Dhoot


Instead of a big monolithic lock on FairScheduler, we can have an explicit lock 
on queuemanager and revisit all synchronized methods in FairScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2998) Abstract out scheduler independant PlanFollower components into AbstractSchedulerPLanFollower

2014-12-30 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2998:
---

 Summary: Abstract out scheduler independant PlanFollower 
components into AbstractSchedulerPLanFollower
 Key: YARN-2998
 URL: https://issues.apache.org/jira/browse/YARN-2998
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2982) Use ReservationQueueConfiguration in CapacityScheduler

2014-12-19 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2982:
---

 Summary: Use ReservationQueueConfiguration in CapacityScheduler
 Key: YARN-2982
 URL: https://issues.apache.org/jira/browse/YARN-2982
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot


ReservationQueueConfiguration is common to reservation irrespective of 
Scheduler. It would be good to have CapacityScheduler also  support this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2881) Implement PlanFollower for FairScheduler

2014-11-20 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2881:
---

 Summary: Implement PlanFollower for FairScheduler
 Key: YARN-2881
 URL: https://issues.apache.org/jira/browse/YARN-2881
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2773) ReservationSystem's use of Queue names vs paths is inconsistent for CapacityReservationSystem and FairReservationSystem

2014-10-29 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2773:
---

 Summary: ReservationSystem's use of Queue names vs paths is 
inconsistent for CapacityReservationSystem and FairReservationSystem  
 Key: YARN-2773
 URL: https://issues.apache.org/jira/browse/YARN-2773
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Anubhav Dhoot
Priority: Minor


Reservation system requires use the ReservationDefinition to use a queue name 
to choose which reservation queue is being used. CapacityScheduler does not 
allow duplicate leaf queue names. Because of this we can refer to a unique leaf 
queue by simply using its name and not full path (which includes parentName + 
.). FairScheduler allows duplicate leaf queue names because of which one 
needs to refer to the full queue name to identify a queue uniquely. This is 
inconsistent for the implementation of the AbstractReservationSystem where one 
implementation of getQueuePath will do conversion (CapacityReservationSystem) 
while the FairReservationSystem will return the same value back 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2738) Add FairReservationSystem for FairScheduler

2014-10-24 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2738:
---

 Summary: Add FairReservationSystem for FairScheduler
 Key: YARN-2738
 URL: https://issues.apache.org/jira/browse/YARN-2738
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot


Need to create a FairReservationSystem that will implement ReservationSystem 
for FairScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type

2014-10-15 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2690:
---

 Summary: Make ReservationSystem and its dependent classes 
independent of Scheduler type  
 Key: YARN-2690
 URL: https://issues.apache.org/jira/browse/YARN-2690
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Anubhav Dhoot


A lot of common reservation classes depend on CapacityScheduler and 
specifically its configuration. This jira is to make them ready for other 
Schedulers by abstracting out the configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2661) Container Localization is not resource limited

2014-10-08 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2661:
---

 Summary: Container Localization is not resource limited
 Key: YARN-2661
 URL: https://issues.apache.org/jira/browse/YARN-2661
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot


Container localization itself can take up a lot of resources. Today this is not 
resource limited in any way and can adversely affect actual containers running 
on the node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2624) Resource Localization fails on a secure cluster until nm are restarted

2014-09-29 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2624:
---

 Summary: Resource Localization fails on a secure cluster until nm 
are restarted
 Key: YARN-2624
 URL: https://issues.apache.org/jira/browse/YARN-2624
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


We have found resource localization fails on a secure cluster with following 
error in certain cases. This happens at some indeterminate point after which it 
will keep failing until NM is restarted.

{noformat}
INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Failed to download rsrc { { 
hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml,
 1412027745352, FILE, null 
},pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING}
java.io.IOException: Rename cannot overwrite non empty destination directory 
/data/yarn/nm/filecache/27
at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
at 
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2224) Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of the default settings

2014-06-27 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2224:
---

 Summary: Let 
TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of 
the default settings
 Key: YARN-2224
 URL: https://issues.apache.org/jira/browse/YARN-2224
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot


If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test 
will fail. Make the test pass not rely on the default settings but just let it 
verify that once the setting is turned on it actually does the memory check. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2192) TestRMHA fails when run with a mix of Schedulers

2014-06-22 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2192:
---

 Summary: TestRMHA fails when run with a mix of Schedulers
 Key: YARN-2192
 URL: https://issues.apache.org/jira/browse/YARN-2192
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot


Some TestRMHA assume CapacityScheduler. If the test is run with multiple 
schedulers, some of the tests fail because the metricsssytem objects that are 
shared across tests and fail as below.

{code}
Error Message

Metrics source QueueMetrics,q0=root already exists!
Stacktrace

org.apache.hadoop.metrics2.MetricsException: Metrics source 
QueueMetrics,q0=root already exists!
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:96)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1281)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:427)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2119) Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590

2014-06-02 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2119:
---

 Summary: Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to 
fix 1590
 Key: YARN-2119
 URL: https://issues.apache.org/jira/browse/YARN-2119
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot


The fix for [YARN-1590|https://issues.apache.org/jira/browse/YARN-1590] 
introduced an method to get web proxy bind address with the incorrect default 
port. Because all the users of the method (only 1 user) ignores the port, its 
not breaking anything yet. Fixing it in case someone else uses this in the 
future. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler

2014-05-28 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2109:
---

 Summary: TestRM fails some tests when some tests run with 
CapacityScheduler and some with FairScheduler
 Key: YARN-2109
 URL: https://issues.apache.org/jira/browse/YARN-2109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Anubhav Dhoot


testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
[YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
it to be CapacityScheduler. But if the default scheduler is set to 
FairScheduler then the rest of the tests that execute after this will fail with 
invalid cast exceptions when getting queuemetrics. This is based on test 
execution order as only the tests that execute after this test will fail. This 
is because the queuemetrics will be initialized by this test to QueueMetrics 
and shared by the subsequent tests. 

We can explicitly clear the metrics at the end of this test to fix this.
For example

java.lang.ClassCastException: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be 
cast to 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:90)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:85)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2110:
---

 Summary: TestAMRestart#testAMRestartWithExistingContainers assumes 
CapacityScheduler
 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: The TestAMRestart#testAMRestartWithExistingContainers 
does a cast to CapacityScheduler in a couple of places
{code}
((CapacityScheduler) rm1.getResourceScheduler())
{code}

If run with FairScheduler as default scheduler the test throws 
{code} java.lang.ClassCastException {code}.
Reporter: Anubhav Dhoot






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2096) testQueueMetricsOnRMRestart has race condition

2014-05-23 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2096:
---

 Summary: testQueueMetricsOnRMRestart has race condition
 Key: YARN-2096
 URL: https://issues.apache.org/jira/browse/YARN-2096
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot


org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart
 fails randomly because of a race condition.
The test validates that metrics are incremented, but does not wait for all 
transitions to finish before checking for the values.
It also resets metrics after kicking off recovery of second RM. The metrics 
that need to be incremented race with this reset causing test to fail randomly.
We need to wait for the right transitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations

2014-05-21 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2089:
---

 Summary: FairScheduler: QueuePlacementPolicy and 
QueuePlacementRule are missing audience annotations
 Key: YARN-2089
 URL: https://issues.apache.org/jira/browse/YARN-2089
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Anubhav Dhoot


We should mark QueuePlacementPolicy and QueuePlacementRule with audience 
annotations @Private  @Unstable



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster

2014-04-10 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-1923:
---

 Summary: Make FairScheduler resource ratio calculations terminate 
faster
 Key: YARN-1923
 URL: https://issues.apache.org/jira/browse/YARN-1923
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


In fair scheduler computing shares continues till iterations are complete even 
when we have a perfect match between the resource shares and total resources. 
This is because the binary search checks only less or greater and not equals. 
Add an early termination condition when its equal



--
This message was sent by Atlassian JIRA
(v6.2#6252)