[jira] [Created] (YARN-4235) FairScheduler PrimaryGroup does not handle empty groups returned for a user
Anubhav Dhoot created YARN-4235: --- Summary: FairScheduler PrimaryGroup does not handle empty groups returned for a user Key: YARN-4235 URL: https://issues.apache.org/jira/browse/YARN-4235 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot We see NPE if empty groups are returned for a user. This causes a NPE and cause RM to crash as below {noformat} 2015-09-22 16:51:52,780 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler java.lang.IndexOutOfBoundsException: Index: 0 at java.util.Collections$EmptyList.get(Collections.java:3212) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$PrimaryGroup.getQueueForApp(QueuePlacementRule.java:149) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:689) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:595) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) at java.lang.Thread.run(Thread.java:745) 2015-09-22 16:51:52,797 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4204) ConcurrentModificationException in FairSchedulerQueueInfo
Anubhav Dhoot created YARN-4204: --- Summary: ConcurrentModificationException in FairSchedulerQueueInfo Key: YARN-4204 URL: https://issues.apache.org/jira/browse/YARN-4204 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Saw this exception {noformat} java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.(FairSchedulerQueueInfo.java:100) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.(FairSchedulerInfo.java:46) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:229) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:589) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:552) at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1279) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at
[jira] [Created] (YARN-4185) Retry interval delay for NM client can be improved from the fixed static retry
Anubhav Dhoot created YARN-4185: --- Summary: Retry interval delay for NM client can be improved from the fixed static retry Key: YARN-4185 URL: https://issues.apache.org/jira/browse/YARN-4185 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Instead of having a fixed retry interval that starts off very high and stays there, we are better off using an exponential backoff that has the same fixed max limit. Today the retry interval is fixed at 10 sec that can be unnecessarily high especially when NMs could rolling restart within a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem
Anubhav Dhoot created YARN-4184: --- Summary: Remove update reservation state api from state store as its not used by ReservationSystem Key: YARN-4184 URL: https://issues.apache.org/jira/browse/YARN-4184 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot ReservationSystem uses remove/add for updates and thus update api in state store is not needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4180) AMLauncher does not retry on failures when talking to NM
Anubhav Dhoot created YARN-4180: --- Summary: AMLauncher does not retry on failures when talking to NM Key: YARN-4180 URL: https://issues.apache.org/jira/browse/YARN-4180 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot We see issues with RM trying to launch a container while a NM is restarting and we get exceptions like NMNotReadyException. While YARN-3842 added retry for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing there intermittent errors to cause job failures. This can manifest during rolling restart of NMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4150) Failure in TestNMClient because nodereports were not available
Anubhav Dhoot created YARN-4150: --- Summary: Failure in TestNMClient because nodereports were not available Key: YARN-4150 URL: https://issues.apache.org/jira/browse/YARN-4150 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Saw a failure in a test run -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4144) Add NM that causes LaunchFailedTransition to blacklist
Anubhav Dhoot created YARN-4144: --- Summary: Add NM that causes LaunchFailedTransition to blacklist Key: YARN-4144 URL: https://issues.apache.org/jira/browse/YARN-4144 Project: Hadoop YARN Issue Type: Improvement Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot During discussion of YARN-2005 we need to add more cases where blacklisting can occur. This tracks adding any failures in launch via LaunchFailedTransition to also contribute to blacklisting -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4145) Make RMHATestBase abstract so its not run when running all tests under that namespace
Anubhav Dhoot created YARN-4145: --- Summary: Make RMHATestBase abstract so its not run when running all tests under that namespace Key: YARN-4145 URL: https://issues.apache.org/jira/browse/YARN-4145 Project: Hadoop YARN Issue Type: Improvement Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Minor Trivial patch to make it abstract -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4143) Optimize the check for AMContainer allocation needed by blacklisting and ContainerType
Anubhav Dhoot created YARN-4143: --- Summary: Optimize the check for AMContainer allocation needed by blacklisting and ContainerType Key: YARN-4143 URL: https://issues.apache.org/jira/browse/YARN-4143 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot In YARN-2005 there are checks made to determine if the allocation is for an AM container. This happens in every allocate call and should be optimized away since it changes only once per SchedulerApplicationAttempt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug
Anubhav Dhoot created YARN-4115: --- Summary: Reduce loglevel of ContainerManagementProtocolProxy to Debug Key: YARN-4115 URL: https://issues.apache.org/jira/browse/YARN-4115 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Minor We see log spams of Aug 28, 1:57:52.441 PM INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy Opening proxy : :8041 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4077) FairScheduler Reservation should wait for most relaxed scheduling delay permitted before issuing reservation
Anubhav Dhoot created YARN-4077: --- Summary: FairScheduler Reservation should wait for most relaxed scheduling delay permitted before issuing reservation Key: YARN-4077 URL: https://issues.apache.org/jira/browse/YARN-4077 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Today if an allocation has a node local request that allows for relaxation, we do not wait for the relaxation delay before issuing the reservation. This can be too aggressive. Instead we should allow the scheduling delays of relaxation to expire before we choose to allow reserving a node for the container. This allows for the request to be satisfied on a different node instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4076) FairScheduler does not allow AM to choose which containers to preempt
Anubhav Dhoot created YARN-4076: --- Summary: FairScheduler does not allow AM to choose which containers to preempt Key: YARN-4076 URL: https://issues.apache.org/jira/browse/YARN-4076 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Capacity scheduler allows for AM to choose which containers will be preempted. See comment about corresponding work pending for FairScheduler https://issues.apache.org/jira/browse/YARN-568?focusedCommentId=13649126page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13649126 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4046) NM container recovery is broken on some linux distro because of syntax of signal
Anubhav Dhoot created YARN-4046: --- Summary: NM container recovery is broken on some linux distro because of syntax of signal Key: YARN-4046 URL: https://issues.apache.org/jira/browse/YARN-4046 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical On a debian machine we have seen node manager recovery of containers fail because the signal syntax for process group may not work. We see errors in checking if process is alive during container recovery which causes the container to be declared as LOST (154) on a NodeManager restart. The application will fail with error {noformat} Application application_1439244348718_0001 failed 1 times due to Attempt recovered after RM restartAM Container for appattempt_1439244348718_0001_01 exited with exitCode: 154 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4032) Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834
Anubhav Dhoot created YARN-4032: --- Summary: Corrupted state from a previous version can still cause RM to fail with NPE due to same reasons as YARN-2834 Key: YARN-4032 URL: https://issues.apache.org/jira/browse/YARN-4032 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot YARN-2834 ensures in 2.6.0 there will not be any inconsistent state. But if someone is upgrading from a previous version, the state can still be inconsistent and then RM will still fail with NPE after upgrade to 2.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4030) Make Nodemanager cgroup usage for container easier to use when its running inside a cgroup
Anubhav Dhoot created YARN-4030: --- Summary: Make Nodemanager cgroup usage for container easier to use when its running inside a cgroup Key: YARN-4030 URL: https://issues.apache.org/jira/browse/YARN-4030 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Today nodemanager uses the cgroup prefix pointed by yarn.nodemanager.linux-container-executor.cgroups.hierarchy (default value /hadoop-yarn) directly at the path of the controller say /sys/fs/cgroup/cpu/hadoop-yarn. If there are nodemanagers running inside docker containers on a host, each would typically be separated by a cgroup under the controller path say /sys/fs/cgroup/cpu/docker/dockerid1/nmcgroup for NM1 and /sys/fs/cgroup/cpu/docker/dockerid2/nmcgroup for NM2. In this case the correct behavior should be to use the docker cgroup paths as /sys/fs/cgroup/cpu/docker/dockerid1/hadoop-yarn for NM1 /sys/fs/cgroup/cpu/docker/dockerid2/hadoop-yarn for NM2. But the default behavior would make both NMs try to use /sys/fs/cgroup/cpu/hadoop-yarn which is incorrect and would usually fail based on the permissions setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4021) RuntimeException/YarnRuntimeException sent over to the client can cause client to assume a local fatal failure
Anubhav Dhoot created YARN-4021: --- Summary: RuntimeException/YarnRuntimeException sent over to the client can cause client to assume a local fatal failure Key: YARN-4021 URL: https://issues.apache.org/jira/browse/YARN-4021 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Currently RuntimeException and its derived types such as YarnRuntimeExceptions are serialized over to the client and thrown at the client after YARN-731. This can cause issues like MAPREDUCE-6439 where we assume a local fatal exception has happened. Instead we should have a way to distinguish local RuntimeException versus remote RuntimeException to avoid these issues. We need to go over all the current client side code that is expecting a remote RuntimeException inorder to make it work with this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305
Anubhav Dhoot created YARN-3996: --- Summary: YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305 Key: YARN-3996 URL: https://issues.apache.org/jira/browse/YARN-3996 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest with mininumResource for the incrementResource. This causes normalize to return zero if minimum is set to zero as per YARN-789 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3985) Make ReservationSystem persist state using RMStateStore reservation APIs
Anubhav Dhoot created YARN-3985: --- Summary: Make ReservationSystem persist state using RMStateStore reservation APIs Key: YARN-3985 URL: https://issues.apache.org/jira/browse/YARN-3985 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot YARN-3736 adds the RMStateStore apis to store and load reservation state. This jira adds the actual storing of state from ReservationSystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3957) FairScheduler NPE In FairSchedulerQueueInfo causing scheduler page to return 500
Anubhav Dhoot created YARN-3957: --- Summary: FairScheduler NPE In FairSchedulerQueueInfo causing scheduler page to return 500 Key: YARN-3957 URL: https://issues.apache.org/jira/browse/YARN-3957 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot There is a NPE causing the webpage of http://localhost:23188/cluster/scheduler to return a 500 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3961) Expose queue container information (pending, running, reserved) in UI and yarn top
Anubhav Dhoot created YARN-3961: --- Summary: Expose queue container information (pending, running, reserved) in UI and yarn top Key: YARN-3961 URL: https://issues.apache.org/jira/browse/YARN-3961 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler, webapp Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot It would be nice to expose container (allocated, pending, reserved) information in the UI and in yarn top tool -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3920) FairScheduler Reserving a node for a container should be configurable to allow it used only for large containers
Anubhav Dhoot created YARN-3920: --- Summary: FairScheduler Reserving a node for a container should be configurable to allow it used only for large containers Key: YARN-3920 URL: https://issues.apache.org/jira/browse/YARN-3920 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Reserving a node for a container was designed for preventing large containers from starvation from small requests that keep getting into a node. Today we let this be used even for a small container request. This has a huge impact on scheduling since we block other scheduling requests until that reservation is fulfilled. We should make this configurable so its impact can be minimized by limiting it for large container requests as originally intended. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3900) Protobuf layout of yarn_security_token causes errors in other protos that include it
Anubhav Dhoot created YARN-3900: --- Summary: Protobuf layout of yarn_security_token causes errors in other protos that include it Key: YARN-3900 URL: https://issues.apache.org/jira/browse/YARN-3900 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Because of the subdirectory server used in {{hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto}} there are errors in other protos that include them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3890) FairScheduler should show the scheduler health metrics similar to ones added in CapacityScheduler
Anubhav Dhoot created YARN-3890: --- Summary: FairScheduler should show the scheduler health metrics similar to ones added in CapacityScheduler Key: YARN-3890 URL: https://issues.apache.org/jira/browse/YARN-3890 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot We should add information displayed in YARN-3293 in FairScheduler as well possibly sharing the implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3800) Simplify inmemory state for ReservationAllocation
Anubhav Dhoot created YARN-3800: --- Summary: Simplify inmemory state for ReservationAllocation Key: YARN-3800 URL: https://issues.apache.org/jira/browse/YARN-3800 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Instead of storing the ReservationRequest we store the Resource for allocations, as thats the only thing we need. Ultimately we convert everything to resources anyway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3675) FairScheduler: RM quits when node removal races with continousscheduling on the same node
Anubhav Dhoot created YARN-3675: --- Summary: FairScheduler: RM quits when node removal races with continousscheduling on the same node Key: YARN-3675 URL: https://issues.apache.org/jira/browse/YARN-3675 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot With continuous scheduling, scheduling can be done on a node thats just removed causing errors like below. {noformat} 12:28:53.782 AM FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Error in handling event type APP_ATTEMPT_REMOVED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.unreserve(FSAppAttempt.java:469) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:815) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:763) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) at java.lang.Thread.run(Thread.java:745) 12:28:53.783 AM INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Exiting, bbye.. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3392) Change NodeManager metrics to not populate resource usage metrics if they are unavailable
[ https://issues.apache.org/jira/browse/YARN-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot resolved YARN-3392. - Resolution: Duplicate Change NodeManager metrics to not populate resource usage metrics if they are unavailable -- Key: YARN-3392 URL: https://issues.apache.org/jira/browse/YARN-3392 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3392.prelim.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3533) Test: Fix launchAM in MockRM to wait for attempt to be scheduled
Anubhav Dhoot created YARN-3533: --- Summary: Test: Fix launchAM in MockRM to wait for attempt to be scheduled Key: YARN-3533 URL: https://issues.apache.org/jira/browse/YARN-3533 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.6.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot MockRM#launchAM fails in many test runs because it does not wait for the app attempt to be scheduled before NM update is sent as noted in [recent builds|https://issues.apache.org/jira/browse/YARN-3387?focusedCommentId=14507255page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14507255] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3424) Reduce log for ContainerMonitorImpl resoure monitoring from info to debug
Anubhav Dhoot created YARN-3424: --- Summary: Reduce log for ContainerMonitorImpl resoure monitoring from info to debug Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Today we log the memory usage of process at info level which spams the log with hundreds of log lines Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3392) Change NodeManager metrics to not populate resource usage metrics if they are unavailable
Anubhav Dhoot created YARN-3392: --- Summary: Change NodeManager metrics to not populate resource usage metrics if they are unavailable Key: YARN-3392 URL: https://issues.apache.org/jira/browse/YARN-3392 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3351) AppMaster tracking URL is broken in HA
Anubhav Dhoot created YARN-3351: --- Summary: AppMaster tracking URL is broken in HA Key: YARN-3351 URL: https://issues.apache.org/jira/browse/YARN-3351 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot After YARN-2713, the AppMaster link is broken in HA. The log and full stack trace is shown below {noformat} 2015-02-05 20:47:43,478 WARN org.mortbay.log: /proxy/application_1423182188062_0002/: java.net.BindException: Cannot assign requested address {noformat} {noformat} java.net.BindException: Cannot assign requested address at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376) at java.net.Socket.bind(Socket.java:631) at java.net.Socket.init(Socket.java:423) at java.net.Socket.init(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:188) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:345) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3259) FairScheduler: Update to fairShare could be triggered early on node events instead of waiting for update interval
Anubhav Dhoot created YARN-3259: --- Summary: FairScheduler: Update to fairShare could be triggered early on node events instead of waiting for update interval Key: YARN-3259 URL: https://issues.apache.org/jira/browse/YARN-3259 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Instead of waiting for update interval unconditionally, we can trigger early updates on important events - for eg node join and leave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3258) FairScheduler: Need to add more logging to investigate allocations
Anubhav Dhoot created YARN-3258: --- Summary: FairScheduler: Need to add more logging to investigate allocations Key: YARN-3258 URL: https://issues.apache.org/jira/browse/YARN-3258 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Minor Its hard to investigate allocation failures without any logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3257) FairScheduler: MaxAm may be set too low preventing apps from starting
Anubhav Dhoot created YARN-3257: --- Summary: FairScheduler: MaxAm may be set too low preventing apps from starting Key: YARN-3257 URL: https://issues.apache.org/jira/browse/YARN-3257 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot In YARN-2637 CapacityScheduler#LeafQueue does not enforce max am share if the limit prevents the first application from starting. This would be good to add to FSLeafQueue as well -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3256) TestClientToAMToken#testClientTokenRace is not running against all Schedulers even when using ParameterizedSchedulerTestBase
Anubhav Dhoot created YARN-3256: --- Summary: TestClientToAMToken#testClientTokenRace is not running against all Schedulers even when using ParameterizedSchedulerTestBase Key: YARN-3256 URL: https://issues.apache.org/jira/browse/YARN-3256 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot The test testClientTokenRace was not using the base class conf causing it to run twice on the same Scheduler configured in the default. All tests deriving from ParameterizedSchedulerTestBase should use the conf created in the base class instead of newing up inside the test and hiding the member. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3229) Incorrect processing of container as LOST on Interruption during NM shutdown
Anubhav Dhoot created YARN-3229: --- Summary: Incorrect processing of container as LOST on Interruption during NM shutdown Key: YARN-3229 URL: https://issues.apache.org/jira/browse/YARN-3229 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot YARN-2846 fixed the issue of writing to the state store incorrectly that the process is LOST. But even after that we still process the ContainerExitEvent. If notInterrupted is false in RecoveredContainerLaunch#call we should skip the following {noformat} if (retCode != 0) { LOG.warn(Recovered container exited with a non-zero exit code + retCode); this.dispatcher.getEventHandler().handle(new ContainerExitEvent( containerId, ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, retCode, Container exited with a non-zero exit code + retCode)); return retCode; } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3209) RM and NM state should be added to the list of Hadoop Compatibility File list
Anubhav Dhoot created YARN-3209: --- Summary: RM and NM state should be added to the list of Hadoop Compatibility File list Key: YARN-3209 URL: https://issues.apache.org/jira/browse/YARN-3209 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Anubhav Dhoot The Hadoop Compatibility guide lists different internal files used by different components at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#System-internal_file_formats We should add NodeManager recovery state and ResourceManager ZK state to the list -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3184) Inefficient iteration of map
[ https://issues.apache.org/jira/browse/YARN-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot resolved YARN-3184. - Resolution: Duplicate Inefficient iteration of map Key: YARN-3184 URL: https://issues.apache.org/jira/browse/YARN-3184 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Minor Attachments: YARN-3184.001.patch Iteration of keySet and then lookup of value is not as efficient as iterating the entrySet -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3184) Inefficient iteration of map
Anubhav Dhoot created YARN-3184: --- Summary: Inefficient iteration of map Key: YARN-3184 URL: https://issues.apache.org/jira/browse/YARN-3184 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Priority: Minor Iteration of keySet and then lookup of value is not as efficient as iterating the entrySet -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3138) TestFairScheduler#testContinuousScheduling fails intermittently on trunk
Anubhav Dhoot created YARN-3138: --- Summary: TestFairScheduler#testContinuousScheduling fails intermittently on trunk Key: YARN-3138 URL: https://issues.apache.org/jira/browse/YARN-3138 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot This test failed randomly in a precheckin and passed on rerun https://builds.apache.org/job/PreCommit-YARN-Build/6497//testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3138) TestFairScheduler#testContinuousScheduling fails intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-3138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot resolved YARN-3138. - Resolution: Duplicate TestFairScheduler#testContinuousScheduling fails intermittently on trunk Key: YARN-3138 URL: https://issues.apache.org/jira/browse/YARN-3138 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot This test failed randomly in a precheckin and passed on rerun https://builds.apache.org/job/PreCommit-YARN-Build/6497//testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3122) Metrics for container's actual CPU usage
Anubhav Dhoot created YARN-3122: --- Summary: Metrics for container's actual CPU usage Key: YARN-3122 URL: https://issues.apache.org/jira/browse/YARN-3122 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Anubhav Dhoot Assignee: Karthik Kambatla Fix For: 2.7.0 It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3121) FairScheduler preemption metrics
Anubhav Dhoot created YARN-3121: --- Summary: FairScheduler preemption metrics Key: YARN-3121 URL: https://issues.apache.org/jira/browse/YARN-3121 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Add FSQueuemetrics for preemption related information -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3101) FairScheduler#fitInMaxShare was added to validate reservations but it does not consider it
Anubhav Dhoot created YARN-3101: --- Summary: FairScheduler#fitInMaxShare was added to validate reservations but it does not consider it Key: YARN-3101 URL: https://issues.apache.org/jira/browse/YARN-3101 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot YARN-2811 added fitInMaxShare to validate reservations on a queue, but did not count it during its calculations. It also had the condition reversed so the test was still passing because both cancelled each other. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3082) Non thread safe access to systemCredentials in NodeHeartbeatResponse processing
Anubhav Dhoot created YARN-3082: --- Summary: Non thread safe access to systemCredentials in NodeHeartbeatResponse processing Key: YARN-3082 URL: https://issues.apache.org/jira/browse/YARN-3082 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot When you use system credentials via feature added in YARN-2704, the proto conversion code throws exception in converting ByteBuffer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3027) Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation
Anubhav Dhoot created YARN-3027: --- Summary: Scheduler should use totalAvailable resource from node instead of availableResource for maxAllocation Key: YARN-3027 URL: https://issues.apache.org/jira/browse/YARN-3027 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot YARN-2604 added support for updating maxiumum allocation resource size based on nodes. But it incorrectly uses available resource instead of maximum resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3022) Expose Container resource information from NodeManager for monitoring
Anubhav Dhoot created YARN-3022: --- Summary: Expose Container resource information from NodeManager for monitoring Key: YARN-3022 URL: https://issues.apache.org/jira/browse/YARN-3022 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Along with exposing resource consumption of each container such as (YARN-2141) its worth exposing the actual resource limit associated with them to get better insight into YARN allocation and consumption -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2574) Add support for FairScheduler to the ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot resolved YARN-2574. - Resolution: Fixed Fix Version/s: 2.7.0 Add support for FairScheduler to the ReservationSystem -- Key: YARN-2574 URL: https://issues.apache.org/jira/browse/YARN-2574 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Subru Krishnan Assignee: Anubhav Dhoot Fix For: 2.7.0 YARN-1051 introduces the ReservationSystem and the current implementation is based on CapacityScheduler. This JIRA proposes adding support for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3008) FairScheduler: Use lock for queuemanager instead of synchronized on FairScheduler
Anubhav Dhoot created YARN-3008: --- Summary: FairScheduler: Use lock for queuemanager instead of synchronized on FairScheduler Key: YARN-3008 URL: https://issues.apache.org/jira/browse/YARN-3008 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Instead of a big monolithic lock on FairScheduler, we can have an explicit lock on queuemanager and revisit all synchronized methods in FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2998) Abstract out scheduler independant PlanFollower components into AbstractSchedulerPLanFollower
Anubhav Dhoot created YARN-2998: --- Summary: Abstract out scheduler independant PlanFollower components into AbstractSchedulerPLanFollower Key: YARN-2998 URL: https://issues.apache.org/jira/browse/YARN-2998 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2982) Use ReservationQueueConfiguration in CapacityScheduler
Anubhav Dhoot created YARN-2982: --- Summary: Use ReservationQueueConfiguration in CapacityScheduler Key: YARN-2982 URL: https://issues.apache.org/jira/browse/YARN-2982 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot ReservationQueueConfiguration is common to reservation irrespective of Scheduler. It would be good to have CapacityScheduler also support this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2881) Implement PlanFollower for FairScheduler
Anubhav Dhoot created YARN-2881: --- Summary: Implement PlanFollower for FairScheduler Key: YARN-2881 URL: https://issues.apache.org/jira/browse/YARN-2881 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2773) ReservationSystem's use of Queue names vs paths is inconsistent for CapacityReservationSystem and FairReservationSystem
Anubhav Dhoot created YARN-2773: --- Summary: ReservationSystem's use of Queue names vs paths is inconsistent for CapacityReservationSystem and FairReservationSystem Key: YARN-2773 URL: https://issues.apache.org/jira/browse/YARN-2773 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Anubhav Dhoot Priority: Minor Reservation system requires use the ReservationDefinition to use a queue name to choose which reservation queue is being used. CapacityScheduler does not allow duplicate leaf queue names. Because of this we can refer to a unique leaf queue by simply using its name and not full path (which includes parentName + .). FairScheduler allows duplicate leaf queue names because of which one needs to refer to the full queue name to identify a queue uniquely. This is inconsistent for the implementation of the AbstractReservationSystem where one implementation of getQueuePath will do conversion (CapacityReservationSystem) while the FairReservationSystem will return the same value back -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2738) Add FairReservationSystem for FairScheduler
Anubhav Dhoot created YARN-2738: --- Summary: Add FairReservationSystem for FairScheduler Key: YARN-2738 URL: https://issues.apache.org/jira/browse/YARN-2738 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot Need to create a FairReservationSystem that will implement ReservationSystem for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type
Anubhav Dhoot created YARN-2690: --- Summary: Make ReservationSystem and its dependent classes independent of Scheduler type Key: YARN-2690 URL: https://issues.apache.org/jira/browse/YARN-2690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Anubhav Dhoot A lot of common reservation classes depend on CapacityScheduler and specifically its configuration. This jira is to make them ready for other Schedulers by abstracting out the configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2661) Container Localization is not resource limited
Anubhav Dhoot created YARN-2661: --- Summary: Container Localization is not resource limited Key: YARN-2661 URL: https://issues.apache.org/jira/browse/YARN-2661 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Container localization itself can take up a lot of resources. Today this is not resource limited in any way and can adversely affect actual containers running on the node -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2624) Resource Localization fails on a secure cluster until nm are restarted
Anubhav Dhoot created YARN-2624: --- Summary: Resource Localization fails on a secure cluster until nm are restarted Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot We have found resource localization fails on a secure cluster with following error in certain cases. This happens at some indeterminate point after which it will keep failing until NM is restarted. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2224) Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of the default settings
Anubhav Dhoot created YARN-2224: --- Summary: Let TestContainersMonitor#testContainerKillOnMemoryOverflow work irrespective of the default settings Key: YARN-2224 URL: https://issues.apache.org/jira/browse/YARN-2224 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test will fail. Make the test pass not rely on the default settings but just let it verify that once the setting is turned on it actually does the memory check. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2192) TestRMHA fails when run with a mix of Schedulers
Anubhav Dhoot created YARN-2192: --- Summary: TestRMHA fails when run with a mix of Schedulers Key: YARN-2192 URL: https://issues.apache.org/jira/browse/YARN-2192 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Some TestRMHA assume CapacityScheduler. If the test is run with multiple schedulers, some of the tests fail because the metricsssytem objects that are shared across tests and fail as below. {code} Error Message Metrics source QueueMetrics,q0=root already exists! Stacktrace org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:96) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1281) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:427) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2119) Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590
Anubhav Dhoot created YARN-2119: --- Summary: Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590 Key: YARN-2119 URL: https://issues.apache.org/jira/browse/YARN-2119 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot The fix for [YARN-1590|https://issues.apache.org/jira/browse/YARN-1590] introduced an method to get web proxy bind address with the incorrect default port. Because all the users of the method (only 1 user) ignores the port, its not breaking anything yet. Fixing it in case someone else uses this in the future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler
Anubhav Dhoot created YARN-2109: --- Summary: TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler Key: YARN-2109 URL: https://issues.apache.org/jira/browse/YARN-2109 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Anubhav Dhoot testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set it to be CapacityScheduler. But if the default scheduler is set to FairScheduler then the rest of the tests that execute after this will fail with invalid cast exceptions when getting queuemetrics. This is based on test execution order as only the tests that execute after this test will fail. This is because the queuemetrics will be initialized by this test to QueueMetrics and shared by the subsequent tests. We can explicitly clear the metrics at the end of this test to fix this. For example java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be cast to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:90) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:85) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:81) at org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
Anubhav Dhoot created YARN-2110: --- Summary: TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler Key: YARN-2110 URL: https://issues.apache.org/jira/browse/YARN-2110 Project: Hadoop YARN Issue Type: Bug Environment: The TestAMRestart#testAMRestartWithExistingContainers does a cast to CapacityScheduler in a couple of places {code} ((CapacityScheduler) rm1.getResourceScheduler()) {code} If run with FairScheduler as default scheduler the test throws {code} java.lang.ClassCastException {code}. Reporter: Anubhav Dhoot -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2096) testQueueMetricsOnRMRestart has race condition
Anubhav Dhoot created YARN-2096: --- Summary: testQueueMetricsOnRMRestart has race condition Key: YARN-2096 URL: https://issues.apache.org/jira/browse/YARN-2096 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testQueueMetricsOnRMRestart fails randomly because of a race condition. The test validates that metrics are incremented, but does not wait for all transitions to finish before checking for the values. It also resets metrics after kicking off recovery of second RM. The metrics that need to be incremented race with this reset causing test to fail randomly. We need to wait for the right transitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
Anubhav Dhoot created YARN-2089: --- Summary: FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations Key: YARN-2089 URL: https://issues.apache.org/jira/browse/YARN-2089 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.0 Reporter: Anubhav Dhoot We should mark QueuePlacementPolicy and QueuePlacementRule with audience annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1923) Make FairScheduler resource ratio calculations terminate faster
Anubhav Dhoot created YARN-1923: --- Summary: Make FairScheduler resource ratio calculations terminate faster Key: YARN-1923 URL: https://issues.apache.org/jira/browse/YARN-1923 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot In fair scheduler computing shares continues till iterations are complete even when we have a perfect match between the resource shares and total resources. This is because the binary search checks only less or greater and not equals. Add an early termination condition when its equal -- This message was sent by Atlassian JIRA (v6.2#6252)