I have couple of questions related to MapReduce over HBase

 

1. HBase guarantees data locality of store files and Regionserver only if
it stays up for long. If there are too many region movements or the server
has been recycled recently, there is a high probability that store file
blocks are not local to the region server.  But the getSplits command
always return the RegionServer of the StoreFile. So in this scenario,
MapReduce loses its data locality? 

 

2. As the getSplits return only the RegionServer, the MR job is not aware
of the multiple replicates of the StoreFile block. It only accesses one
block (which is local if the point above is not applicable). This can
constrain the MR processing as you cannot distribute the data processing
in the best possible manner. Is this correct? 

 

3. A guess - since the MR processing goes through the RegionServer, it may
impact the RegionServer performance for other random operations? 

 

Thanks in advance,

Hemant 

 

Reply via email to