Hi, Read this:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html [...] > In the thread "Data distribution in HBase" , one of the people > mentioned that the data hosted by the Region Server may not > actually reside on the same machine > . So when asked for data , it fetches from the system > containing the data. Am I right? Why is the data hosted by a > "Region Server" doesn't lie on the same machine . Doesn't the > name name "Region Server" imply that it holds all > the regions it contains? Is it due to splits or restarting > the HBase ? No. HBase fetches file data from HDFS. Typically HBase region servers and HDFS DataNodes are run together on the same servers. But, the local DataNode may not have the block (they are randomly replicated across the cluster). > Also the same case Applies here I guess . When a map is run > on a Region Server, It's data may not actually lie on the same > machine If the data is in memstore or block cache, then it will be served from the same server. Otherwise, it depends if the HDFS DataNode colocated with the region server has a replica of the necessary blocks or not. Regardless, the client and the region server are on the same server, which improves performance due to data locality -- region locality. If the HDFS layer finds local block replicas, then all communication is local. HBase region servers periodically compact regions. (Of course the region must be updated by at least one write at some point to trigger an eventual compaction.) Compactions are a rewrite of region data. So in effect writes bring region data local. Region assignment is stable on cluster in steady state. This means over time that reads tend to find local replicas. Hope that helps, - Andy