I'm having an issue in client code where there are multiple clusters with HA 
namenodes involved. Example setup using Hadoop 2.3.0:

Cluster A with the following properties defined in core, hdfs, etc:

dfs.nameservices=clusterA
dfs.ha.namenodes.clusterA=nn1,nn2
dfs.namenode.rpc-address.clusterA.nn1=
dfs.namenode.http-address.clusterA.nn1=
dfs.namenode.rpc-address.clusterA.nn2=
dfs.namenode.http-address.clusterA.nn2=
dfs.client.failover.proxy.provider.clusterA=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

Cluster B has similar properties defined in its core-site.xml, hdfs-site.xml, 
etc.

Now, I want to be able to distcp from clusterA to clusterB. Regardless of which 
cluster I am executing this from, neither has all of the information. Looking 
at DFSClient and DataNode:

  - if I put both clusterA and clusterB into dfs.nameservices, then the 
datanodes will try to federate the blocks from both nameservices.
  - if I don't put both clusterA and clusterB into dfs.nameservices, then the 
client won't know how to resolve both namenodes for the nameservices in the 
distcp command.

 I'm wondering if I am missing a property or something that will allow me to 
define both nameservice on both clusters and have the datanodes for the cluster 
*not* try and federate. Looking at DataNode, it appears that it tries to 
connect to all namenodes defined and the first one that sets the clusterid 
wins. It seems that there should be a dfs.datanode.clusterid property that the 
datanode uses. This seems to line up with 'namenode -format -clusterid 
<cluster>' command when you have multiple nameservices. Am I missing something 
in the configuration that will allow me to do what I want? To get distcp to 
work I had to create a 3 set of configuration files just for the client to use.
                                          

Reply via email to