[ 
https://issues.apache.org/jira/browse/HADOOP-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515441
 ] 

Hairong Kuang commented on HADOOP-1652:
---------------------------------------

Some more thoughts for discussion...

1. Put the cluster in safe mode while rebalacing. This allows us to more 
aggressively schedules block moving tasks but it interrupts the current running 
of the cluster.
2. Spawn a seprate process on the client side to do all the scheduling work. A 
name node ships a snapshot of all data node descriptors & all blocks to the 
process in the begining. In the end, the process sends all the scheduled tasks 
back to the namenode. This approach does not interrupt namenode work but it 
requires shipping large amount of data from namenode in the beginning & then to 
namenode in the end.

> Rebalance data blocks when new data nodes added or data nodes become full
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-1652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1652
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.13.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.15.0
>
>
> When a new data node joins hdfs cluster, it does not hold much data. So any 
> map task assigned to the machine most likely does not read local data, thus 
> increasing the use of network bandwidth. On the other hand, when some data 
> nodes become full, new data blocks are placed on only non-full data nodes, 
> thus reducing their read parallelism. 
> This jira aims to find an approach to redistribute data blocks when imbalance 
> occurs in the cluster.  An solution should meet the following requirements:
> 1. It maintains data availablility guranteens in the sense that rebalancing 
> does not reduce the number of replicas that a block has or the number of 
> racks that the block resides.
> 2. An adminstrator should be able to invoke and interrupt rebalancing from a 
> command line.
> 3. Rebalancing should be throttled so that rebalancing does not cause a 
> namenode to be too busy to serve any incoming request or saturate the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to