Hi, I have some questions regarding HBase and locality issues - I'd appreciate some explanations and clarifications.
I understand HBase is built on top of HDFS. Say an HRegionServer creates a HStoreFile where it puts some column family content. Does HDFS split the file to multiple HDFS blocks and distributes them around bunch of machines ? If that's the case, when the region server needs to actually access the files, does HDFS underneath communicates remote machines to read the various blocks ? Doesn't it hurt performance since there is no locality in data access (region server actually works on remote blocks). Or is the HStoreFile implemented in some other way which writes it to the local disks of the region server node machine that owns it ? If so, then how ? Does this code overrides the HDFS behavior ? Another related question is about Map Reduce and HBase. When a MapReduce job runs on top of HBase - i.e. gets a table as an input. How does the MapReduce framework know how to schedule map tasks near data ? Does it have any knowledge of the actual location of the data pieces composing the table to be processed ? I'd be also glad to get pointers to the related source code (classes). Thanks for any information, Naama -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales." (Albert Einstein)
