[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.

Aaron T. Myers (JIRA) Thu, 04 Aug 2011 13:55:51 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079584#comment-13079584
 ]


Aaron T. Myers commented on HDFS-1973:
--------------------------------------

h3. Client Failover overview

On failover between active and standby NNs, it's necessary for clients to be 
redirected to the new active NN. The goal of HDFS-1623 is to provide a 
framework for HDFS HA which can in fact support multiple underlying mechanisms. 
As such, the client failover approach should support multiple options.

h3. Cases to support

# Proxy-based client failover. Clients always communicate with an in-band proxy 
service which forwards all RPCs on to the correct NN. On failure, a process 
causes this proxy to begin sending requests to the now-active NN.
# Virtual IP-based client failover. Clients always connect to a hostname which 
resolves to a particular IP address. On failure of the active NN, a process is 
initiated to switch which NIC will receive packets intended for said IP address 
to the now-active NN. (From a client's perspective, this case is equivalent to 
case #1.)
# Zookeeper-based client failover. The URI to contact the active NN is stored 
in Zookeeper or some other highly-available service. Clients look up which NN 
to talk to by communicating with ZK to discern the currently active NN. On 
failure, some process causes the address stored in ZK to be changed to point to 
the now-active NN.
# Configuration-based client failover. Clients are configured with a set of NN 
addresses to try until an operation succeeds. This configuration might exist in 
client-side configuration files, or perhaps in DNS via a SRV record that lists 
the NNs with different priorities.

h3. Assumptions

This proposal assumes that NN fencing works, and that after a failover any 
standby NN is either unreachable or will throw a {{StandbyException}} on any 
RPC from a client. That is, a client will not possibly receive incorrect 
results if it chooses to contact the wrong NN. This proposal also presumes that 
there is no direct coordination required between any central failover 
coordinator and clients, i.e. there's an intermediate name resolution system of 
some sort (ZK, DNS, local configuration, etc.)

h3. Proposal

The commit of HADOOP-7380 already introduced a facility whereby an IPC 
{{RetryInvocationHandler}} can utilize a {{FailoverProxyProvider}} 
implementation to perform the appropriate client-side action in the event of 
failover. At the moment, the only implementation of a {{FailoverProxyProvider}} 
is the {{DefaultFailoverProxyProvider}}, which does nothing in the case of 
failover. HADOOP-7380 also added an {{@Idempotent}} annotation which can be 
used to identify which methods can be safely retried during a failover event.

What remains, then, is:

# To implement {{FailoverProxyProviders}} which can support the cases outlined 
above (and perhaps others).
# To provide a mechanism to select which {{FailoverProxyProvider}} 
implementation to use for a given HDFS URI.
# To annotate the appropriate HDFS {{ClientProtocol}} interface methods with 
the {{@Idempotent}} tag.

h4. {{FailoverProxyProvider}} implementations

Cases 1 and 2 above can be achieved by implementing a single 
{{FailoverProxyProvider}} which simply retries to reconnect to the previous 
hostname/IP address on failover. Cases 3 and 4 can be implemented as distinct 
custom {{FailoverProxyProviders}}.

h4. A mechanism to select the appropriate {{FailoverProxyProvider}} 
implementation

I propose we add a mechanism to configure a mapping from URI authority -> 
{{FailoverProxyProvider}} implementation. Absolute URIs which previously 
specified the NN host name will instead contain a logical cluster name (which 
might be chosen to be identical to one of the NN's host names) which will be 
used by the chosen {{FailoverProxyProvider}} to determine the appropriate host 
to connect to. Introducing the concept of a cluster name will be a useful 
abstraction in general if, for example, in the future someone develops a 
fully-distributed NN, the cluster name still applies.

On instantiation of a {{DFSClient}} (or other user of an HDFS URI, e.g. HFTP), 
the mapping would be checked to see if there's an entry for the given URI 
authority. If there is not, then a normal RPC client with connected socket to 
the given authority will be created as is done today with a 
{{DefaultProxyProvider}}. If there is an entry, then the authority will be 
treated as a logical cluster name, a {{FailoverProxyProvider}} of the correct 
type will be instantiated (via a factory class), and an RPC client will be 
created which utilizes this {{FailoverProxyProvider}}. The various 
{{FailoverProxyProvider}} implementations are responsible for their own 
configuration.

As a straw man example, consider the following configuration:

{code}
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>cluster1.foo.com</name>
  </property>

  <property>
    <name>dfs.ha.client.failover-method.cluster1.foo.com</name>
    <value>org.apache.hadoop.ha.ZookeeperFailoverProxyProvider</value>
  </property>
</configuration>
{code}

This would cause URIs which begin with {{hdfs://cluster1.foo.com}} to use the 
{{ZookeeperFailoverProxyProvider}}. Slash-relative URIs would also default to 
using this. An absolute URI which, for example, referenced an NN in another 
cluster (e.g. {{nn.cluster2.foo.com}}) which was not HA-enabled would default 
to using the {{DefaultFailoverProxyProvider}}.

h3. Questions

# I believe this scheme will work transparently with viewfs. Instead of 
configuring the mount table to communicate with a particular NN for a given 
portion of the name space, one would configure viewfs to use the logical 
cluster name, which when paired with the configuration from URI authority -> 
{{FailoverProxyProvider}} will cause the appropriate {{FailoverProxyProvider}} 
to be selected and the appropriate NN to be located. I'm no viewfs expert and 
so would love to hear any thoughts on this.
# Are there any desirable client failover mechanisms I'm forgetting about?
# I'm sure there are places (which I haven't fully identified yet) where host 
names are cached client side. Those may need to get changed as well.
# Federation already introduced a concept of a "cluster ID", but in Federation 
this is not intended to be user-facing. Should we combine this notion with the 
"logical cluster name" I described above?

> HA: HDFS clients must handle namenode failover and switch over to the new 
> active namenode.
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1973
>                 URL: https://issues.apache.org/jira/browse/HDFS-1973
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Suresh Srinivas
>            Assignee: Aaron T. Myers
>
> During failover, a client must detect the current active namenode failure and 
> switch over to the new active namenode. The switch over might make use of IP 
> failover or some thing more elaborate such as zookeeper to discover the new 
> active.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-1973) HA: HDFS clients must handle namenode failover and switch over to the new active namenode.

Reply via email to