Normally, when we put hbase and HDFS in the same cluster ( e.g., region server 
runs on the datenode ), we have a reasonably good data locality, as 
explained<http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html> 
by Lars. Also Work<https://issues.apache.org/jira/browse/HBASE-2896> has been 
done by Jonathan to address the startup situation.

There are scenarios where regions can be on a different machine from the 
machines that hold the underlying HFile blocks, at least for some period of 
time. This will have performance impact on whole table scan operation and map 
reduce job during that time.


1.       After load balancer moves the region and before compaction (thus 
generate HFile on the new region server ) on that region, HDFS block can be 
remote.

2.       When a new machine is added, or removed, Hbase's region assignment 
policy is different from HDFS's block reassignment policy.

3.       Even if there is no much hbase activity, HDFS can load balance HFile 
blocks as other non-hbase applications push other data to HDFS.

Lots has been or will be done in load balancer, as 
summarized<http://zhihongyu.blogspot.com/2011/04/load-balancer-in-hbase-090.html>
 by Ted. I am curious if HFile HDFS block locality should be used as another 
factor here.

Thanks.

Ming

Reply via email to