[jira] Commented: (HADOOP-692) Rack-aware Replica Placement

Sameer Paranjpye (JIRA) Tue, 14 Nov 2006 15:50:34 -0800

    [ 
http://issues.apache.org/jira/browse/HADOOP-692?page=comments#action_12449865 ] 
            
Sameer Paranjpye commented on HADOOP-692:
-----------------------------------------


>  Sould we consider a mobile network ...

No, that is not the concern. The concern is that in a large enough installation 
(1000s of nodes) there are going to be a several instances a week of  nodes 
going down and getting repaired. When the nodes return they may not be in the 
same racks or on the same switches. This probably does not happen by design, 
but mistakes are always possible.

> I would simply update network toplogy when a datanode registers or exits.

Certainly, that seems like the right approach. The question is how the topology 
is updated.
 
Is it by updating a central configuration file and having the namenode read it? 
This implies potentially updating configuration every time anything changes in 
a datacenter.

Is it by running timing experiments every time a datanode registers? This can 
be biased by transient network conditions.

Neither of the above seem like a productive use of admin or namenode time.

Instead of a network topology interface, why not have a network location 
interface on the Datanode.

public interface NetworkLocation {
  String[] getLocation();
}

This returns an array of hubs that characterize a nodes location on the 
network. This is probably an array of string of the form, <key>=<value>. So a 
possible output could be:

rack=r1, switch=s2, datacenter=d3, ... 

as many levels as are desirable. Also, network distances between nodes aren't 
the only things that are interesting. I think it's useful to distinguish 
between a rack and a switch because a rack is commonly a physical power domain.

Given this output from each Datanode we can then have a concrete implementation 
of NetworkTopology that simply tracks the membership of each hub. Finding the 
distance between two nodes is done by comparing their arrays of hubs and 
stopping where they differ.

> Allocating all blocks of a file to the same 3 racks limits the aggaregate 
> read bandwith. 

It limits the aggregate read bandwidth to that particular file only. Every file 
will have it's own set of 3 racks, so I don't think it affects overall 
filesystem bandwidth. On the other hand, it potentially gives a client  that is 
reading an entire file better locality.

> Users may specify its replica placement policy 

I agree with Doug. This seems like a subsequent feature.





> Rack-aware Replica Placement
> ----------------------------
>
>                 Key: HADOOP-692
>                 URL: http://issues.apache.org/jira/browse/HADOOP-692
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.8.0
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>             Fix For: 0.9.0
>
>
> This issue assumes that HDFS runs on a cluster of computers that spread 
> across many racks. Communication between two nodes on different racks needs 
> to go through switches. Bandwidth in/out of a rack may be less than the total 
> bandwidth of machines in the rack. The purpose of rack-aware replica 
> placement is to improve data reliability, availability, and network bandwidth 
> utilization. The basic idea is that each data node determines to which rack 
> it belongs at the startup time and notifies the name node of the rack id upon 
> registration. The name node maintains a rackid-to-datanode map and tries to 
> place replicas across racks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-692) Rack-aware Replica Placement

Reply via email to