[ 
https://issues.apache.org/jira/browse/YARN-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426990#comment-15426990
 ] 

Subru Krishnan commented on YARN-3673:
--------------------------------------

Thanks [~jianhe] for reviewing the patch and your detailed feedback.

 bq. I think for this patch the FEDERATION_FAILOVER_ENABLED and 
FEDERATION_ENABLED flag are not needed

There is a nuance here that unlike the other failover providers which are only 
initialized when RM HA is enabled,{{FederationRMFailoverProxyProvider}} is used 
with both standalone RM and RM HA as the main purpose is to bypass _yarn-site_ 
and obtain the RM connection address from {{FederationStateStore}}. 
The FEDERATION_ENABLED flag indicates that irrespective of whether RM has HA 
enabled or not, we should configure the {{FederationRMFailoverProxyProvider}}.
When we have both Federation & RM HA, the FEDERATION_FAILOVER_ENABLED indicates 
that we should configure failover {{RetryPolicies}} as otherwise normal ones 
(like _retryForeverWithFixedSleep_) would be initialized. This is needed 
because we override HA enabled in conf as otherwise that triggers the code path 
of looking up RMs from xml using rmId index.

bq. Why do we need to store the tokens and then re-add the tokens?

This is to cover a corner case scenario which we hit during testing. IIRC, it 
was required for the UnmanagedAMs (used for transparently spanning a job across 
clusters) to reconnect to RM post failover. Without this, I was getting 
authentication error with missing token from the RM primary RM.
Do you think I should leave it or add it when we reach the e2e testing phase?

bq. It appears to me the FederationStateStoreFacade is not needed for this patch

Since we are using in-memory store for the test case, you are right. The issue 
is with the SQL based store (YARN-3663) which is what we want to deploy in 
production. We did some simulated scale testing with 10s of thousands of nodes 
and there was performance degradation without connection pooling. For pooling, 
you need Singleton as otherwise every proxy instance would create its own 
private pool and that's what {{FederationStateStoreFacade}} provides.

bq. Test: IIUC, I don't actually see FederationRMFailoverProxyProvider 
configured in the test, is it tested?

Yes! In the test, I create a proxy using {{FederationProxyProviderUtil}}:
{code}
ApplicationClientProtocol client = FederationProxyProviderUtil
                .createRMProxy(conf, ApplicationClientProtocol.class, 
subClusterId,
                    UserGroupInformation.getCurrentUser());
{code}

And {{FederationProxyProviderUtil}} always sets 
{{FederationRMFailoverProxyProvider}} as the failover provider:
{code}
// updating the conf with the refreshed RM addresses as proxy creations
          // are based out of conf
          private static void updateConf(Configuration conf,
              SubClusterId subClusterId) {
            conf.set(YarnConfiguration.FEDERATION_SUBCLUSTER_ID, 
subClusterId.getId());
            conf.setBoolean(YarnConfiguration.FEDERATION_ENABLED, true);
            conf.setClass(YarnConfiguration.CLIENT_FAILOVER_PROXY_PROVIDER,
                FederationRMFailoverProxyProvider.class, 
RMFailoverProxyProvider.class);
            // we will failover using FederationStateStore so set that and skip
            // traditional HA
            if (HAUtil.isHAEnabled(conf)) {
              conf.setBoolean(YarnConfiguration.FEDERATION_FAILOVER_ENABLED, 
true);
              conf.setBoolean(YarnConfiguration.RM_HA_ENABLED, false);
            }
{code}

This also is how FEDERATION_FAILOVER_ENABLED and FEDERATION_ENABLED flags are 
used.

And I had to do multiple rounds of debugging and step through 
{{FederationRMFailoverProxyProvider}} so many times while writing the tests, so 
most definitely it's being exercised :).


 



> Create a FailoverProxy for Federation services
> ----------------------------------------------
>
>                 Key: YARN-3673
>                 URL: https://issues.apache.org/jira/browse/YARN-3673
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Subru Krishnan
>            Assignee: Subru Krishnan
>         Attachments: YARN-3673-YARN-2915-v1.patch
>
>
> This JIRA proposes creating a failover proxy for Federation based on the 
> cluster membership information in the StateStore that can be used by both 
> Router & AMRMProxy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to