[ 
http://issues.apache.org/jira/browse/HADOOP-692?page=comments#action_12448244 ] 
            
Hairong Kuang commented on HADOOP-692:
--------------------------------------

I plan to address the issue from three areas:

* Determine rack id
 
Each data node gets its rack id from the command line. Data node supports an 
option "-r <rack id>" or "—rack <rack id>". A rack id is a unique string 
representation of the rack that the data node belongs to. It usually consists 
of a name/ip plus its port number of the switch to which the data node directly 
connects. If the option is not set, the data node belongs to a default rack.

How to get the rack id is proprietary to each organization. So a mechanism 
needs to be provided when starting HDFS, for example, a script which prints the 
rack id on the screen. The output is feed to a data node when it starts. 

The rack id is then stored in DatanodeID  and sent to the name node as part of 
the registration information.

* Maintain rackid-to-datanodes map

A name node maintains a rack id to data node descriptors map that maps a rack 
id to a list of data nodes that belong to the rack. When the name node receives 
a registration message from a data node, it first check if the map already has 
an entry for the data node. If yes, it removes the old entry. It then adds a 
new entry. When the name node removes a data node, the data node entry in the 
map is also removed.

If all data nodes start without providing a rack id, the map contains one 
default rack id mapping to all the data nodes in the system. So HDFS will 
behave the same as it is now.

* Place replicas

A simple policy is to place replicas across racks. This prevents losing data 
when an entire rack fails and allows to make use of bandwidth from multiple 
racks when read a file. This policy evenly distribution of replicas in the 
cluster and thus makes it easy to balance load when a map/reduce job with the 
data as input is scheduled to run. However, the cost of the policy is the write 
expense that a block needs to write to multiple racks. One minor optimization 
is to place the one replica in the local node, where the writer is located. If 
not, place it in a different node at the local rack.

For the most common case when the replica factor is three,  another possible 
placement policy is to place one replica in the local node, place one in a 
different node at the local rack, and place one at a different rack. This 
policy cuts the out-of-rack write traffic and hence improves write performance. 
Because the chance of rack failure is far less than that of node failure, it 
does not effect the data reliability and availability much. But it reduces the 
aggregate bandwidth of network bandwith when read a file since now a block is 
placed in two racks rather than three. The replicas of a file do not evenly 
distribute across the racks evenly. One third is at one node, two thirds are on 
one rack, the other one third  is evenly distributed. But a map/reduce job 
still gets a chance to balance its load because each block has one replica that 
is placed in a random rack. Overall I feel that this placement policy has a 
good trade-off. It greatly improves write performance while not at a great cost 
of data reliability or read performance. 

> Rack-aware Replica Placement
> ----------------------------
>
>                 Key: HADOOP-692
>                 URL: http://issues.apache.org/jira/browse/HADOOP-692
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.8.0
>            Reporter: Hairong Kuang
>         Assigned To: Hairong Kuang
>             Fix For: 0.9.0
>
>
> This issue assumes that HDFS runs on a cluster of computers that spread 
> across many racks. Communication between two nodes on different racks needs 
> to go through switches. Bandwidth in/out of a rack may be less than the total 
> bandwidth of machines in the rack. The purpose of rack-aware replica 
> placement is to improve data reliability, availability, and network bandwidth 
> utilization. The basic idea is that each data node determines to which rack 
> it belongs at the startup time and notifies the name node of the rack id upon 
> registration. The name node maintains a rackid-to-datanode map and tries to 
> place replicas across racks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to