Hello, As far as I understand Bulk Import functionality will not take into account the Data Locality question. MR job will create number of reducer tasks same as regions to write into, but it will not "advice" on which nodes to run these tasks. In that case Reducer task which writes HFiles of some region may not be physically located at the same node as RS that serves that region. The way HDFS writes data, there will be (likely) one full replica of bolcks of HFiles of this Region written on the node where Reducer task was run and other replicas (if replication >1) will be distributed randomly over the cluster. Thus, RS while serving data of that region will (most likely) not look at local data (data will be transferred from other datanodes). I.e. data locality will be broken.
Is this correct? If yes, I guess, if we could tell MR framework where (which nodes) to launch certain Reducer tasks, this would help us. I believe this is not possible with MR1, please correct me if I'm wrong. Perhaps, this is this possible with MR2? I assume there's no way to provide a "hint" to a NameNode where to place blocks of a new File too, right? Thank you, -- Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr