I have couple of questions related to MapReduce over HBase
1. HBase guarantees data locality of store files and Regionserver only if it stays up for long. If there are too many region movements or the server has been recycled recently, there is a high probability that store file blocks are not local to the region server. But the getSplits command always return the RegionServer of the StoreFile. So in this scenario, MapReduce loses its data locality? 2. As the getSplits return only the RegionServer, the MR job is not aware of the multiple replicates of the StoreFile block. It only accesses one block (which is local if the point above is not applicable). This can constrain the MR processing as you cannot distribute the data processing in the best possible manner. Is this correct? 3. A guess - since the MR processing goes through the RegionServer, it may impact the RegionServer performance for other random operations? Thanks in advance, Hemant