[jira] [Created] (YARN-9013) [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner
Botong Huang created YARN-9013: -- Summary: [GPG] fix order of steps cleaning Registry entries in ApplicationCleaner Key: YARN-9013 URL: https://issues.apache.org/jira/browse/YARN-9013 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang ApplicationCleaner today deletes the entry for all finished (non-running) application in YarnRegistry using this logic: # GPG gets the list of running applications from Router. # GPG gets the full list of applications in registry # GPG deletes in registry every app in 2 that’s not in 1 The problem is that jobs that started between 1 and 2 meets the criteria in 3, and thus get deleted by mistake. The fix/right order should be 2->1->3, rather than 1->2->3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8933) [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response
Botong Huang created YARN-8933: -- Summary: [AMRMProxy] Fix potential null AvailableResource and NumClusterNode in allocation response Key: YARN-8933 URL: https://issues.apache.org/jira/browse/YARN-8933 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang After YARN-8696, the allocate response by FederationInterceptor is merged from the responses from a random subset of all sub-clusters, depending on the async heartbeat timing. As a result, cluster-wide information fields in the response, e.g. AvailableResources and NumClusterNodes, are not consistent at all. It can even be null/zero because the specific response is merged from an empty set of sub-cluster responses. In this patch, we let FederationInterceptor remember the last allocate response from all known sub-clusters, and always construct the cluster-wide info fields from all of them. We also moved sub-cluster timeout from LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that sub-clusters that expired (haven't had a successful allocate response for a while) won't be included in the computation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
Botong Huang created YARN-8893: -- Summary: [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client Key: YARN-8893 URL: https://issues.apache.org/jira/browse/YARN-8893 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Fix thread leak in AMRMClientRelayer and UAM client used by FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
Botong Huang created YARN-8862: -- Summary: [GPG] add Yarn Registry cleanup in ApplicationCleaner Key: YARN-8862 URL: https://issues.apache.org/jira/browse/YARN-8862 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in secondary sub-clusters. Because of potential more app attempts later, AMRMProxy cannot kill the UAM and delete the tokens when one local attempt finishes. So similar to the StateStore application table, we need ApplicationCleaner in GPG to cleanup the finished app entries in Yarn Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator
[ https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang resolved YARN-7599. Resolution: Fixed > [GPG] ApplicationCleaner in Global Policy Generator > --- > > Key: YARN-7599 > URL: https://issues.apache.org/jira/browse/YARN-7599 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Labels: federation, gpg > Attachments: YARN-7599-YARN-7402.v1.patch, > YARN-7599-YARN-7402.v2.patch, YARN-7599-YARN-7402.v3.patch, > YARN-7599-YARN-7402.v4.patch, YARN-7599-YARN-7402.v5.patch, > YARN-7599-YARN-7402.v6.patch, YARN-7599-YARN-7402.v7.patch, > YARN-7599-YARN-7402.v8.patch > > > In Federation, we need a cleanup service for StateStore as well as Yarn > Registry. For the former, we need to remove old application records. For the > latter, failed and killed applications might leave records in the Yarn > Registry (see YARN-6128). We plan to do both cleanup work in > ApplicationCleaner in GPG -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8760) Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer
Botong Huang created YARN-8760: -- Summary: Fix concurrent re-register due to YarnRM failover in AMRMClientRelayer Key: YARN-8760 URL: https://issues.apache.org/jira/browse/YARN-8760 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang When home YarnRM is failing over, FinishApplicationMaster call from AM can have multiple retry threads outstanding in FederationInterceptor. When new YarnRM come back up, all retry threads will re-register to YarnRM. The first one will succeed but the rest will get "Application Master is already registered" exception. We should catch and swallow this exception and move on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8705) Refactor in preparation for YARN-8696
Botong Huang created YARN-8705: -- Summary: Refactor in preparation for YARN-8696 Key: YARN-8705 URL: https://issues.apache.org/jira/browse/YARN-8705 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Refactor the UAM heartbeat thread as well as call back method in preparation for YARN-8696 FederationInterceptor upgrade -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8697) LocalityMulticastAMRMProxyPolicy should fallback to random sub-cluster when cannot resolve resource
Botong Huang created YARN-8697: -- Summary: LocalityMulticastAMRMProxyPolicy should fallback to random sub-cluster when cannot resolve resource Key: YARN-8697 URL: https://issues.apache.org/jira/browse/YARN-8697 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Right now in LocalityMulticastAMRMProxyPolicy, whenever we cannot resolve the resource name (node or rack), we always route the request to home sub-cluster. However, home sub-cluster might not be always be ready to use (timed out YARN-8581) or enabled (by AMRMProxyPolicy weights). It might also be overwhelmed by the requests if sub-cluster resolver has some issue. In this Jira, we are changing it to pick a random active and enabled sub-cluster for resource request we cannot resolve. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8696) FederationInterceptor upgrade: home sub-cluster heartbeat async
Botong Huang created YARN-8696: -- Summary: FederationInterceptor upgrade: home sub-cluster heartbeat async Key: YARN-8696 URL: https://issues.apache.org/jira/browse/YARN-8696 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Today in _FederationInterceptor_, the heartbeat to home sub-cluster is synchronous. After the heartbeat is sent out to home sub-cluster, it waits for the home response to come back before merging and returning the (merged) heartbeat result to back AM. If home sub-cluster is suffering from connection issues, or down during an YarnRM master-slave switch, all heartbeat threads in _FederationInterceptor_ will be blocked waiting for home response. As a result, the successful UAM heartbeats from secondary sub-clusters will not be returned to AM at all. Additionally, because of the fact that we kept the same heartbeat responseId between AM and home RM, lots of tricky handling are needed regarding the responseId resync when it comes to _FederationInterceptor_ (part of AMRMProxy, NM) work preserving restart (YARN-6127, YARN-1336), home RM master-slave switch etc. In this patch, we change the heartbeat to home sub-cluster to asynchronous, same as the way we handle UAM heartbeats in secondaries. So that any sub-cluster down or connection issues won't impact AM getting responses from other sub-clusters. The responseId is also managed separately for home sub-cluster and AM, and they increment independently. The resync logic becomes much cleaner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8673) [AMRMProxy] More robust responseId resync after an YarnRM master slave switch
Botong Huang created YARN-8673: -- Summary: [AMRMProxy] More robust responseId resync after an YarnRM master slave switch Key: YARN-8673 URL: https://issues.apache.org/jira/browse/YARN-8673 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang After master slave switch of YarnRM, an _ApplicationNotRegisteredException_ will be thrown from the new YarnRM. AM will re-regsiter and reset the responseId to zero. _AMRMClientRelayer_ inside _FederationInterceptor_ follows the same protocol, and does the automatic re-register and responseId resync. However, when exceptions or temporary network issue happens in the allocate call after re-register, the resync logic might be broken. This patch improves the robustness of the process by parsing the expected repsonseId from YarnRM exception message. So that whenever the responseId is out of sync for whatever reason, we can automatically resync and move on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8658) Metrics for AMRMClientRelayer inside FederationInterceptor
Botong Huang created YARN-8658: -- Summary: Metrics for AMRMClientRelayer inside FederationInterceptor Key: YARN-8658 URL: https://issues.apache.org/jira/browse/YARN-8658 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Young Chen AMRMClientRelayer (YARN-7900) is introduced for stateful FederationInterceptor (YARN-7899), to keep track of all pending requests sent to every subcluster YarnRM. We need to add metrics for AMRMClientRelayer to show the state of things in FederationInterceptor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8581) [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy
Botong Huang created YARN-8581: -- Summary: [AMRMProxy] Add sub-cluster timeout in LocalityMulticastAMRMProxyPolicy Key: YARN-8581 URL: https://issues.apache.org/jira/browse/YARN-8581 Project: Hadoop YARN Issue Type: Task Components: amrmproxy, federation Reporter: Botong Huang Assignee: Botong Huang In Federation, every time an AM heartbeat comes in, LocalityMulticastAMRMProxyPolicy in AMRMProxy splits the asks according to the list of active and enabled sub-clusters. However, if we haven't been able to heartbeat to a sub-cluster for some time (network issues, or we keep hitting some exception from YarnRM, or YarnRM master-slave switch is taking a long time etc.), we should consider the sub-cluster as unhealthy and stop routing asks there, until the heartbeat channel becomes healthy again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8536) Add max heap config option for Federation Router
Botong Huang created YARN-8536: -- Summary: Add max heap config option for Federation Router Key: YARN-8536 URL: https://issues.apache.org/jira/browse/YARN-8536 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8534) Add max heap config option for Federation Router and GPG
Botong Huang created YARN-8534: -- Summary: Add max heap config option for Federation Router and GPG Key: YARN-8534 URL: https://issues.apache.org/jira/browse/YARN-8534 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8481) AMRMProxyPolicies should accept heartbeat response from new/unknown subclusters
Botong Huang created YARN-8481: -- Summary: AMRMProxyPolicies should accept heartbeat response from new/unknown subclusters Key: YARN-8481 URL: https://issues.apache.org/jira/browse/YARN-8481 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Currently BroadcastAMRMProxyPolicy assumes that we only span the application to the sub-clusters instructed by itself via _splitResourceRequests_. However, with AMRMProxy HA, second attempts of the application might come up with multiple sub-clusters initially without consulting the AMRMProxyPolicy at all. This leads to exceptions in _notifyOfResponse._ It should simply allow the new/unknown sub-cluster heartbeat responses. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM
Botong Huang created YARN-8451: -- Summary: Multiple NM heartbeat thread created when a slow NM resync with RM Key: YARN-8451 URL: https://issues.apache.org/jira/browse/YARN-8451 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang During a NM resync with RM (say RM did a master slave switch), if NM is running slow, more than one RESYNC event may be put into the NM dispatcher by the existing heartbeat thread before they are processed. As a result, multiple new heartbeat thread are later created and start to hb to RM concurrently with their own responseId. If at some point of time, one thread becomes more than one step behind others, RM will send back a resync signal in this heartbeat response, killing all containers in this NM. See comments below for details on how this can happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8433) TestAMRestart flaky in trunk
Botong Huang created YARN-8433: -- Summary: TestAMRestart flaky in trunk Key: YARN-8433 URL: https://issues.apache.org/jira/browse/YARN-8433 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang [org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testContainersFromPreviousAttemptsWithRMRestart[FAIR]|https://builds.apache.org/job/PreCommit-YARN-Build/21002/testReport/org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager/TestAMRestart/testContainersFromPreviousAttemptsWithRMRestart_FAIR_/] Attempt state is not correct (timeout). expected: but was: [org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testPreemptedAMRestartOnRMRestart[FAIR]|https://builds.apache.org/job/PreCommit-YARN-Build/21014/testReport/org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager/TestAMRestart/testPreemptedAMRestartOnRMRestart_FAIR_/] test timed out after 6 milliseconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8412) Move ResourceRequest.clone logic everywhere into a proper API
Botong Huang created YARN-8412: -- Summary: Move ResourceRequest.clone logic everywhere into a proper API Key: YARN-8412 URL: https://issues.apache.org/jira/browse/YARN-8412 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang ResourceRequest.clone code is replicated in lots of places, some missing to copy one field or two due to new fields added over time. This JIRA attempts to move them into a proper API so that everyone can use this single implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8334) [GPG] Fix potential connection leak in GPGUtils
[ https://issues.apache.org/jira/browse/YARN-8334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang resolved YARN-8334. Resolution: Fixed > [GPG] Fix potential connection leak in GPGUtils > --- > > Key: YARN-8334 > URL: https://issues.apache.org/jira/browse/YARN-8334 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Giovanni Matteo Fumarola >Assignee: Giovanni Matteo Fumarola >Priority: Minor > Attachments: YARN-8334-YARN-7402.v1.patch, > YARN-8334-YARN-7402.v2.patch > > > Missing ClientResponse.close and Client.destroy can lead to a connection leak. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8227) TestPlacementConstraintTransformations is failing in trunk
Botong Huang created YARN-8227: -- Summary: TestPlacementConstraintTransformations is failing in trunk Key: YARN-8227 URL: https://issues.apache.org/jira/browse/YARN-8227 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.1 s <<< FAILURE! - in org.apache.hadoop.yarn.api.resource.TestPlacementConstraintTransformations [ERROR] testCardinalityConstraint(org.apache.hadoop.yarn.api.resource.TestPlacementConstraintTransformations) Time elapsed: 0.007 s <<< FAILURE! java.lang.AssertionError: expected: java.util.HashSet<[hb]> but was: java.util.HashSet<[hb]> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.api.resource.TestPlacementConstraintTransformations.testCardinalityConstraint(TestPlacementConstraintTransformations.java:116) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8110) AMRMProxy recover should catch for all throwable retrying to recover apps
Botong Huang created YARN-8110: -- Summary: AMRMProxy recover should catch for all throwable retrying to recover apps Key: YARN-8110 URL: https://issues.apache.org/jira/browse/YARN-8110 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang In NM work preserving restart, when AMRMProxy recovers applications one by one, the current catch only catch for IOException. If one app recovery throws other thing (e.g. RuntimeException), it will fail the entire AMRMProxy recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8010) add config in FederationRMFailoverProxy to not bypass facade cache when failing over
Botong Huang created YARN-8010: -- Summary: add config in FederationRMFailoverProxy to not bypass facade cache when failing over Key: YARN-8010 URL: https://issues.apache.org/jira/browse/YARN-8010 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Today when YarnRM is failing over, the FederationRMFailoverProxy running in AMRMProxy will perform failover, try to get latest subcluster info from FederationStateStore and then retry connect to the latest YarnRM master. When calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache with a flush flag. When YarnRM is failing over, every AM heartbeat thread creates a different thread inside FederationInterceptor, each of which keeps performing failover several times. This leads to a big spike of getSubCluster call to FederationStateStore. Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), YarnRM master slave change might not result in RM addr change. In other cases, a small delay of getting latest subcluster information may be acceptable. This patch thus creates a config option, so that it is possible to ask the FederationRMFailoverProxy to not flush cache when calling getSubCluster(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7918) TestAMRMClientPlacementConstraints.testAMRMClientWithPlacementConstraints failing in trunk
Botong Huang created YARN-7918: -- Summary: TestAMRMClientPlacementConstraints.testAMRMClientWithPlacementConstraints failing in trunk Key: YARN-7918 URL: https://issues.apache.org/jira/browse/YARN-7918 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang java.lang.AssertionError: expected:<2> but was:<1> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.client.api.impl.TestAMRMClientPlacementConstraints.testAMRMClientWithPlacementConstraints(TestAMRMClientPlacementConstraints.java:161) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7900) [AMRMProxy] AMRMClientRelayer for stateful FederationInterceptor
Botong Huang created YARN-7900: -- Summary: [AMRMProxy] AMRMClientRelayer for stateful FederationInterceptor Key: YARN-7900 URL: https://issues.apache.org/jira/browse/YARN-7900 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Inside stateful FederationInterceptor (YARN-7899), we need a component similar to AMRMClient that remembers all pending (outstands) requests we've sent to YarnRM, auto re-register and do full pending resend when YarnRM fails over and throws ApplicationMasterNotRegisteredException back. This JIRA adds this component as AMRMClientRelayer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7899) [AMRMProxy] Stateful FederationInterceptor for pending requests
Botong Huang created YARN-7899: -- Summary: [AMRMProxy] Stateful FederationInterceptor for pending requests Key: YARN-7899 URL: https://issues.apache.org/jira/browse/YARN-7899 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Today FederationInterceptor (in AMRMProxy for YARN Federation) is stateless in terms of pending (outstanding) requests. Whenever AM issues new requests, FI simply splits and sends them to sub-cluster YarnRMs and forget about them. This JIRA attempts to make FI stateful so that it remembers the pending requests in all relevant sub-clusters. This has two major benefits: 1. It is a prerequisite for FI to be able to cancel pending request in one sub-cluster and re-send it to other sub-clusters. This is needed for load balancing and to fully comply with the relax locality fallback to ANY semantic. When we send a request to one sub-cluster, we have effectively restrained the allocation for this request to be within this sub-cluster rather than everywhere. If the cluster capacity in this sub-cluster for this app is full or this YarnRM is overloaded and slow, the request will be stuck there for a long time even if there is free capacity in other sub-clusters. We need FI to remember and adjust the pending requests on the fly. 2. This makes pending request recovery easier when YarnRM fails over. Today whenever one sub-cluster RM fails over, in order to recover lost pending requests for this sub-cluster, we have to propagate the ApplicationMasterNotRegisteredException from the YarnRM back to AM, triggering a full pending resend from AM. This contains pending for not only the failing-over sub-cluster, but everyone. Since our split-merge (AMRMProxyPolicy) does not guarantee idempotency, the same request we sent to sub-cluster-1 earlier might be resent to sub-cluster-2. If both these YarnRMs have not failed over, they will both allocate for this request, leading to over-allocation. Also, these full pending resends also puts unnecessary load on every YarnRM in the cluster everytime one YarnRM fails over. With stateful FederationInterceptor, since we remember pending requests we have sent out earlier, we can shield the ApplicationMasterNotRegisteredException for AM and resend the pending only to the failed over YarnRM. This eliminates over-allocation and minimizes the recovery overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7720) [Federation] Race condition between second app attempt and UAM heartbeat when first attempt node is down
Botong Huang created YARN-7720: -- Summary: [Federation] Race condition between second app attempt and UAM heartbeat when first attempt node is down Key: YARN-7720 URL: https://issues.apache.org/jira/browse/YARN-7720 Project: Hadoop YARN Issue Type: Sub-task Reporter: Botong Huang Assignee: Botong Huang In Federation, multiple attempts of an application share the same UAM in each secondary sub-cluster. When first attempt fails, we reply on the fact that secondary RM won't kill the existing UAM before the AM heartbeat timeout (default at 10 min). When second attempt comes up in the home sub-cluster, it will pick up the UAM token from Yarn Registry and resume the UAM heartbeat to secondary RMs. The default heartbeat timeout for NM and AM are both 10 mins. The problem is that when the first attempt node goes down or out of connection, only after 10 mins will the home RM mark the first attempt as failed, and then schedule the 2nd attempt in some other node. By then the UAMs in secondaries are already timing out, and they might not survive until the second attempt comes up. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7676) Fix inconsistent priority ordering in Priority and SchedulerRequestKey
Botong Huang created YARN-7676: -- Summary: Fix inconsistent priority ordering in Priority and SchedulerRequestKey Key: YARN-7676 URL: https://issues.apache.org/jira/browse/YARN-7676 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Today the priority ordering in _Priority.compareTo()_ and _SchedulerRequestKey.compareTo()_ is inconsistent. Both _compareTo_ method is trying to reverse the order: P0.compareTo(P1) > 0, meaning priority wise P0 < P1. However, SK(P0).comapreTo(SK(P1)) < 0, meaning priority wise SK(P0) > SK(P1). This is attempting to fix that by undo both reversing logic. So that priority wise P0 > P1 and SK(P0) > SK(P1). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7631) ResourceRequest with different Capacity (Resource) overrides each other in RM
Botong Huang created YARN-7631: -- Summary: ResourceRequest with different Capacity (Resource) overrides each other in RM Key: YARN-7631 URL: https://issues.apache.org/jira/browse/YARN-7631 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> ResourceRequestInfo (the actual RR). This means that only RRs with the same (requestId, priority, resourcename, executionType, resource) will be grouped and aggregated together. While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> LocalityAppPlacementAllocator (ResourceName -> RR). The issue is that in RM side Resource is not in the key to the RR at all. (Note that executionType is also not in the RM side, but it is fine because RM handles it separately as container update requests.) This means that under the same value of (requestId, priority, resourcename), RRs with different Resource values will be grouped together and override each other in RM. As a result, some of the container requests are lost and will never be allocated. Furthermore, since the two RRs are kept under different keys in AMRMClient side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 will not get resend as well. I’ve attached an unit test (resourcebug.patch) which is failing in trunk to illustrate this issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7630) Fix AMRMToken handling in AMRMProxy
Botong Huang created YARN-7630: -- Summary: Fix AMRMToken handling in AMRMProxy Key: YARN-7630 URL: https://issues.apache.org/jira/browse/YARN-7630 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Symptom: after RM rolls over the master key for AMRMToken, whenever the RPC connection from FederationInterceptor to RM breaks due to transient network issue and reconnects, heartbeat to RM starts failing because of the “Invalid AMRMToken” exception. Whenever it hits, it happens for both home RM and secondary RMs. Related facts: 1. When RM issues a new AMRMToken, it always send with service name field as empty string. RPC layer in AM side will set it properly before start using it. 2. UGI keeps all tokens using a map from serviceName->Token. Initially AMRMClientUtils.createRMProxy() is used to load the first token and start the RM connection. 3. When RM renew the token, YarnServerSecurityUtils.updateAMRMToken() is used to load it into UGI and replace the existing token (with the same serviceName key). Bug: The bug is that 2-AMRMClientUtils.createRMProxy() and 3-YarnServerSecurityUtils.updateAMRMToken() is not handling the sequence consistently. We always need to load the token (with empty service name) into UGI first before we set the serviceName, so that the previous AMRMToken will be overridden. But 2 is doing it reversely. That’s why after RM rolls the amrmToken, the UGI end up with two tokens. Whenever the RPC connection break and reconnect, the wrong token could be picked and thus trigger the exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7599) Application cleaner and subcluster cleaner in Global Policy Generator
Botong Huang created YARN-7599: -- Summary: Application cleaner and subcluster cleaner in Global Policy Generator Key: YARN-7599 URL: https://issues.apache.org/jira/browse/YARN-7599 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor In Federation, we need a cleanup service for StateStore as well as Yarn Registry. For the former, we need to remove old application records as well as inactive subclusters. For the latter, failed and killed applications might leave records in the Yarn Registry (see YARN-6128). We plan to add both cleanup service in GPG -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7479) TestContainerManagerSecurity.testContainerManager[Simple] flaky in trunk
Botong Huang created YARN-7479: -- Summary: TestContainerManagerSecurity.testContainerManager[Simple] flaky in trunk Key: YARN-7479 URL: https://issues.apache.org/jira/browse/YARN-7479 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Was waiting for container_1_0001_01_00 to get to state COMPLETE but was in state RUNNING after the timeout java.lang.AssertionError: Was waiting for container_1_0001_01_00 to get to state COMPLETE but was in state RUNNING after the timeout at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:431) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:360) at org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:171) Pasting some exception message during test run here: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Given NMToken for application : appattempt_1_0001_01 seems to have been generated illegally. at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) at org.apache.hadoop.ipc.Client.call(Client.java:1437) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Given NMToken for application : appattempt_1_0001_01 is not valid for current node manager.expected : localhost:46649 found : InvalidHost:1234 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) at org.apache.hadoop.ipc.Client.call(Client.java:1437) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7339) LocalityMulticastAMRMProxyPolicy should handle cancel request properly
Botong Huang created YARN-7339: -- Summary: LocalityMulticastAMRMProxyPolicy should handle cancel request properly Key: YARN-7339 URL: https://issues.apache.org/jira/browse/YARN-7339 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Currently inside AMRMProxy, LocalityMulticastAMRMProxyPolicy is not handling and splitting cancel requests from AM properly: # For node cancel request, we should not treat it as a localized resource request. Otherwise it can lead to all weight zero issue when computing localized resource weight. # For ANY cancel, we should broadcast to all known subclusters, not just the ones associated with localized resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7317) Fix overallocation resulted from ceiling in LocalityMulticastAMRMProxyPolicy
Botong Huang created YARN-7317: -- Summary: Fix overallocation resulted from ceiling in LocalityMulticastAMRMProxyPolicy Key: YARN-7317 URL: https://issues.apache.org/jira/browse/YARN-7317 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor When LocalityMulticastAMRMProxyPolicy is splitting up the ANY requests into different subclusters, we are doing Ceil(N * weight), leading to overallocation overall. It is better to do Floor(N * weight) for each subcluster and then assign the residue randomly according to the weights. So that the total number of containers we ask from all subclusters sum up to be N. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7281) Auto inject AllocationRequestId in AMRMClient.ContainerRequest when not supplied
Botong Huang created YARN-7281: -- Summary: Auto inject AllocationRequestId in AMRMClient.ContainerRequest when not supplied Key: YARN-7281 URL: https://issues.apache.org/jira/browse/YARN-7281 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor AllocationRequestId is introduced in YARN-4879 to simplify the resource allocation protocol inside AM-RM heartbeat. Many new features (e.g. Yarn Federation) are/will be built preferring AllocationRequestId to present. This Jira is modifying AMRMClient so that when AM is not supplying the AllocationRequestId, it will be auto-generated in the constructor of AMRMClient.ContainerRequest. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7203) Add container ExecutionType into ContainerReport
Botong Huang created YARN-7203: -- Summary: Add container ExecutionType into ContainerReport Key: YARN-7203 URL: https://issues.apache.org/jira/browse/YARN-7203 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7199) TestAMRMClientContainerRequest.testOpportunisticAndGuaranteedRequests is failing in trunk
Botong Huang created YARN-7199: -- Summary: TestAMRMClientContainerRequest.testOpportunisticAndGuaranteedRequests is failing in trunk Key: YARN-7199 URL: https://issues.apache.org/jira/browse/YARN-7199 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang java.lang.IllegalArgumentException: The profile name cannot be null at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.yarn.api.records.ProfileCapability.newInstance(ProfileCapability.java:68) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.addContainerRequest(AMRMClientImpl.java:512) at org.apache.hadoop.yarn.client.api.impl.TestAMRMClientContainerRequest.testOpportunisticAndGuaranteedRequests(TestAMRMClientContainerRequest.java:59) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7102) NM heartbeat stuck when responseId overflows MAX_INT
Botong Huang created YARN-7102: -- Summary: NM heartbeat stuck when responseId overflows MAX_INT Key: YARN-7102 URL: https://issues.apache.org/jira/browse/YARN-7102 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM heartbeat in YARN-6640, please refer to YARN-6640 for details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7074) Fix NM state store update comment
Botong Huang created YARN-7074: -- Summary: Fix NM state store update comment Key: YARN-7074 URL: https://issues.apache.org/jira/browse/YARN-7074 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6962) Federation interceptor should support full allocate request/response api
Botong Huang created YARN-6962: -- Summary: Federation interceptor should support full allocate request/response api Key: YARN-6962 URL: https://issues.apache.org/jira/browse/YARN-6962 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6955) Concurrent registerAM thread in Federation Interceptor
Botong Huang created YARN-6955: -- Summary: Concurrent registerAM thread in Federation Interceptor Key: YARN-6955 URL: https://issues.apache.org/jira/browse/YARN-6955 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor The timeout between AM and AMRMProxy is shorter than the timeout + failOver between FederationInterceptor (AMRMProxy) and RM. When the first register thread in FI is blocked because of an RM failover, AM can timeout and resend register call, leading to two outstanding register call inside FI. Eventually when RM comes back up, one thread succeeds register and the other thread got an application already registered exception. FI should swallow the exception and return success back to AM in both threads. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6902) Update SQL server note in License.txt
Botong Huang created YARN-6902: -- Summary: Update SQL server note in License.txt Key: YARN-6902 URL: https://issues.apache.org/jira/browse/YARN-6902 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6730) Make sure NM state store is not null consistently
Botong Huang created YARN-6730: -- Summary: Make sure NM state store is not null consistently Key: YARN-6730 URL: https://issues.apache.org/jira/browse/YARN-6730 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor In the NM statestore for NM restart, there are a lot of places where we check if the stateStore != null. This is true in the existing codebase too. Ideally, the stateStore should never be null because we have the NullStateStore implementation and we should not have to perform so many defensive checks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6704) Add Federation Interceptor restart when work preserving NM is enabled
Botong Huang created YARN-6704: -- Summary: Add Federation Interceptor restart when work preserving NM is enabled Key: YARN-6704 URL: https://issues.apache.org/jira/browse/YARN-6704 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang YARN-1336 added the ability to restart NM without loosing any running containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN environment, there's additional state in the {{FederationInterceptor}} to allow for spanning across multiple sub-clusters, so we need to enhance {{FederationInterceptor}} to support work-preserving restart. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6667) Handle containerId duplicate without throwing in Federation Interceptor
Botong Huang created YARN-6667: -- Summary: Handle containerId duplicate without throwing in Federation Interceptor Key: YARN-6667 URL: https://issues.apache.org/jira/browse/YARN-6667 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6666) Fix unit test in TestRouterClientRMService
Botong Huang created YARN-: -- Summary: Fix unit test in TestRouterClientRMService Key: YARN- URL: https://issues.apache.org/jira/browse/YARN- Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Running org.apache.hadoop.yarn.server.router.clientrm.TestRouterClientRMService Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.041 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.router.clientrm.TestRouterClientRMService testRouterClientRMServiceE2E(org.apache.hadoop.yarn.server.router.clientrm.TestRouterClientRMService) Time elapsed: 0.07 sec <<< ERROR! java.lang.reflect.UndeclaredThrowableException: null at org.apache.hadoop.yarn.server.MockResourceManagerFacade.forceKillApplication(MockResourceManagerFacade.java:457) at org.apache.hadoop.yarn.server.router.clientrm.DefaultClientRequestInterceptor.forceKillApplication(DefaultClientRequestInterceptor.java:166) at org.apache.hadoop.yarn.server.router.clientrm.PassThroughClientRequestInterceptor.forceKillApplication(PassThroughClientRequestInterceptor.java:105) at org.apache.hadoop.yarn.server.router.clientrm.PassThroughClientRequestInterceptor.forceKillApplication(PassThroughClientRequestInterceptor.java:105) at org.apache.hadoop.yarn.server.router.clientrm.PassThroughClientRequestInterceptor.forceKillApplication(PassThroughClientRequestInterceptor.java:105) at org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.forceKillApplication(RouterClientRMService.java:217) at org.apache.hadoop.yarn.server.router.clientrm.BaseRouterClientRMTest$3.run(BaseRouterClientRMTest.java:218) at org.apache.hadoop.yarn.server.router.clientrm.BaseRouterClientRMTest$3.run(BaseRouterClientRMTest.java:212) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965) at org.apache.hadoop.yarn.server.router.clientrm.BaseRouterClientRMTest.forceKillApplication(BaseRouterClientRMTest.java:212) at org.apache.hadoop.yarn.server.router.clientrm.TestRouterClientRMService.testRouterClientRMServiceE2E(TestRouterClientRMService.java:111) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6648) Add FederationStateStore interfaces for Global Policy Generator
Botong Huang created YARN-6648: -- Summary: Add FederationStateStore interfaces for Global Policy Generator Key: YARN-6648 URL: https://issues.apache.org/jira/browse/YARN-6648 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6640) AM heartbeat stuck when responseId overflows MAX_INT
Botong Huang created YARN-6640: -- Summary: AM heartbeat stuck when responseId overflows MAX_INT Key: YARN-6640 URL: https://issues.apache.org/jira/browse/YARN-6640 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor The current code in {{ApplicationMasterService}}: if ((request.getResponseId() + 1) == lastResponse.getResponseId()) { /* old heartbeat */ return lastResponse; } else if (request.getResponseId() + 1 < lastResponse.getResponseId()) { throw ... } process the heartbeat... When a heartbeat comes in, in usual case we are expecting request.getResponseId() == lastResponse.getResponseId(). The “if“ is for the duplicate heartbeat that’s one step old, the “else if” is to throw and complain for heartbeats more than two steps old, otherwise we accept the new heartbeat and process it. So the bug is: when lastResponse.getResponseId() == MAX_INT, the newest heartbeat comes in with responseId == MAX_INT. However reponseId + 1 will be MIN_INT, and we will fall into the “else if” case and RM will throw. Then we are stuck here… -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6565) Fix memory leak and finish app trigger in AMRMProxy
Botong Huang created YARN-6565: -- Summary: Fix memory leak and finish app trigger in AMRMProxy Key: YARN-6565 URL: https://issues.apache.org/jira/browse/YARN-6565 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Two issues in AMRMProxy: 1. When application finishes, AMRMTokenSecretManager is not updated to remove related data, leading to memory leak. 2. When we kill an application, we should remove the pipeline after the AM container is killed. FINISH_APPLICATION event is sent when the AM container is still being killed. After we remove the pipeline, we might still get heartbeats from AM, triggering exception messages. Instead, we should wait for APPLICATION_RESOURCES_CLEANEDUP event, sent after where the AM container is killed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6511) Federation Intercepting and propagating AM-RM communications (part two: secondary subclusters added)
Botong Huang created YARN-6511: -- Summary: Federation Intercepting and propagating AM-RM communications (part two: secondary subclusters added) Key: YARN-6511 URL: https://issues.apache.org/jira/browse/YARN-6511 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6404) Avoid misleading NoClassDefFoundError caused by ExceptionInInitializerError in FederationStateStoreFacade
Botong Huang created YARN-6404: -- Summary: Avoid misleading NoClassDefFoundError caused by ExceptionInInitializerError in FederationStateStoreFacade Key: YARN-6404 URL: https://issues.apache.org/jira/browse/YARN-6404 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Currently the singleton is created in static block: private static final FederationStateStoreFacade FACADE = new FederationStateStoreFacade(); If the constructor method fails and throw, we will see the full exception stack for the first time, wrapped into an {{ExceptionInInitializerError}}. However, after that, all later hits will be prevented by JVM, throwing a misleading {{NoClassDefFoundError}}. Here's more explanation from Stack Overflow: http://stackoverflow.com/questions/34413/why-am-i-getting-a-noclassdeffounderror-in-java The earlier failure could be a ClassNotFoundException or an ExceptionInInitializerError (indicating a failure in the static initialization block) or any number of other problems. The point is, a NoClassDefFoundError is not necessarily a classpath problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6370) Properly handle rack requests for non-active subclusters in LocalityMulticastAMRMProxyPolicy
Botong Huang created YARN-6370: -- Summary: Properly handle rack requests for non-active subclusters in LocalityMulticastAMRMProxyPolicy Key: YARN-6370 URL: https://issues.apache.org/jira/browse/YARN-6370 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6282) Recreate interceptor chain when different attempt in the same node in AMRMProxy
Botong Huang created YARN-6282: -- Summary: Recreate interceptor chain when different attempt in the same node in AMRMProxy Key: YARN-6282 URL: https://issues.apache.org/jira/browse/YARN-6282 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor In AMRMProxy, an interceptor chain is created per application attempt. But the pipeline mapping uses application Id as key. So when a different attempt comes in the same node, we need to recreate the interceptor chain for it, instead of using the existing one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6281) Cleanup when AMRMProxy fails to initialize a new interceptor chain
Botong Huang created YARN-6281: -- Summary: Cleanup when AMRMProxy fails to initialize a new interceptor chain Key: YARN-6281 URL: https://issues.apache.org/jira/browse/YARN-6281 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor When a app starts, AMRMProxy.initializePipeline creates a new Interceptor chain and add it to its pipeline mapping. Then it initializes the chain and return. The problem is that when the chain initialization throws (e.g. because of configuration error, interceptor class not found etc.), the chain is not removed from AMRMProxy's pipeline mapping. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6247) Add SubClusterResolver into FederationStateStoreFacade
Botong Huang created YARN-6247: -- Summary: Add SubClusterResolver into FederationStateStoreFacade Key: YARN-6247 URL: https://issues.apache.org/jira/browse/YARN-6247 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Add SubClusterResolver into FederationStateStoreFacade. Since the resolver might involve some overhead (read file in the background, potentially periodically), it is good to put it inside FederationStateStoreFacade singleton, so that only one instance will be created. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6213) Failure handling and retry for performFailover in RetryInvocationHandler
Botong Huang created YARN-6213: -- Summary: Failure handling and retry for performFailover in RetryInvocationHandler Key: YARN-6213 URL: https://issues.apache.org/jira/browse/YARN-6213 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Botong Huang Priority: Minor In {{RetryInvocationHandler}}, when the method invocation fails, we reply on {{FailoverProxyProvider}} to performFailover and get a new proxy, so that we can retry the method invocation. However, the performFailover and get new proxy itself might fail (throw exception or return null proxy). This is not handled properly currently, we end up throwing the exception out of the while loop. Instead, we should catch the exception (or check for null proxy) and retry performFailover again, until the max fail over count reaches the maximum. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6203) Occasional test failure in TestWeightedRandomRouterPolicy
Botong Huang created YARN-6203: -- Summary: Occasional test failure in TestWeightedRandomRouterPolicy Key: YARN-6203 URL: https://issues.apache.org/jira/browse/YARN-6203 Project: Hadoop YARN Issue Type: Bug Reporter: Botong Huang Assignee: Carlo Curino Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6190) Bug fixes in federation polices
Botong Huang created YARN-6190: -- Summary: Bug fixes in federation polices Key: YARN-6190 URL: https://issues.apache.org/jira/browse/YARN-6190 Project: Hadoop YARN Issue Type: Bug Components: federation Reporter: Botong Huang Assignee: Botong Huang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6093) Invalid AMRM token exception when RM renew AMRMtoken and FederationRMFailoverProxyProvider failover
Botong Huang created YARN-6093: -- Summary: Invalid AMRM token exception when RM renew AMRMtoken and FederationRMFailoverProxyProvider failover Key: YARN-6093 URL: https://issues.apache.org/jira/browse/YARN-6093 Project: Hadoop YARN Issue Type: Bug Components: federation Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Fix For: YARN-2915 AMRMProxy uses expired AMRMToken to talk to RM, leading to the "Invalid AMRMToken" exception. The bug is triggered when both conditions are met: 1. RM rolls master key and renews AMRMToken for a running AM. 2. Existing RPC connection between AMRMProxy and RM drops and attempt to reconnect via failover in FederationRMFailoverProxyProvider. Here's what happened: In DefaultRequestInterceptor.init(), we create a proxy ugi, load it with the initial AMRMToken issued by RM, and used it for initiating rmClient. Then we arrive at FederationRMFailoverProxyProvider.init(), a full copy of ugi tokens are saved locally, create an actual RM proxy and setup the RPC connection. Later when RM rolls master key and issues a new AMRMToken, DefaultRequestInterceptor.updateAMRMToken() updates it into the proxy ugi. However the new token is never used until the existing RPC connection between AMRMProxy and RM drops for other reasons (say master RM crashes). At this point, since the service name of the new AMRMToken is not yet set correctly in DefaultRequestInterceptor.updateAMRMToken(), RPC found no valid AMRMToken when trying to setup a new connection. We first hit a "Client cannot authenticate via:[TOKEN]" exception. This is expected. Next, FederationRMFailoverProxyProvider fails over, we reset the service token via ClientRMProxy.getRMAddress() and reconnect. Supposedly this would have worked. However since DefaultRequestInterceptor does not use the proxy user for later calls to rmClient, when performing failover in FederationRMFailoverProxyProvider, we are not in the proxy user. Currently the code solve the problem by reloading the current ugi with all tokens saved locally in originalTokens in method addOriginalTokens(). The problem is that the original AMRMToken loaded is no longer accepted by RM, and thus we keep hitting the "Invalid AMRMToken" exception until AM fails. The correct way is that rather than saving the original tokens in the proxy ugi, we save the original ugi itself. Every time we perform failover and create the new RM proxy, we use the original ugi, which is always loaded with the up-to-date AMRMToken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6016) Bugs in AMRMProxy handling AMRMToken and local AMRMToken
Botong Huang created YARN-6016: -- Summary: Bugs in AMRMProxy handling AMRMToken and local AMRMToken Key: YARN-6016 URL: https://issues.apache.org/jira/browse/YARN-6016 Project: Hadoop YARN Issue Type: Bug Components: federation Reporter: Botong Huang Assignee: Botong Huang Priority: Minor Two AMRMProxy bugs: First, the AMRMToken from RM should not be propagated to AM, since AMRMProxy will create a local AMRMToken for it. Second, the AMRMProxy Context is now parse the localAMRMTokenKeyId from amrmToken, but should be from localAmrmToken. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-5836) NMToken passwd not checked in ContainerManagerImpl, so that malicious AM can fake the Token and kill containers of other apps at will
Botong Huang created YARN-5836: -- Summary: NMToken passwd not checked in ContainerManagerImpl, so that malicious AM can fake the Token and kill containers of other apps at will Key: YARN-5836 URL: https://issues.apache.org/jira/browse/YARN-5836 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Botong Huang Assignee: Botong Huang Priority: Minor When AM calls NM via stopContainers in ContainerManagementProtocol, the NMToken (generated by RM) is passed along via the user ugi. However currently ContainerManagerImpl is not validating this token correctly, specifically in authorizeGetAndStopContainerRequest in ContainerManagerImpl. Basically it blindly trusts the content in the NMTokenIdentifier without verifying the password (RM generated signature) in the NMToken, so that malicious AM can just fake the content in the NMTokenIdentifier and pass it to NMs. Moreover, currently even for plain text checking, when the appId doesn’t match, all it does is log it as a warning and continues to kill the container… For startContainers the NMToken is not checked correctly in authorizeUser as well, however the ContainerToken is verified properly by regenerating and comparing the password in verifyAndGetContainerTokenIdentifier, so that malicious AM cannot launch containers at will. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org