Re: Bulk Import & Data Locality

Ben Kim Wed, 18 Jul 2012 18:44:11 -0700

I added some Q&A's went with Lars. Hope this is somewhat related to your
data locality questions.


 >>>
>> >>> On Jun 15, 2012, at 6:56 AM, Ben Kim wrote:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> I've been posting questions in the mailing-list quiet often lately,
>> and
>> >>>> here goes another one about data locality
>> >>>> I read the excellent blog post about data locality that Lars George
>> wrote
>> >>>> at
>> http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
>> >>>>
>> >>>> I understand data locality in hbase as locating a region in a
>> >>> region-server
>> >>>> where most of its data blocks reside.
>> >>>
>> >>> The opposite is happening, i.e. the region server process triggers
>> for all
>> >>> data it writes to be located on the same physical machine.
>> >>>
>> >>>> So that way fast data access is guranteed when running a MR because
>> each
>> >>>> map/reduce task is run for each region in the tasktracker where the
>> >>> region
>> >>>> co-locates.
>> >>>
>> >>> Correct.
>> >>>
>> >>>> But what if the data blocks of the region are evenly spread over
>> multiple
>> >>>> region-servers?
>> >>>
>> >>> This will not happen, unless the original server fails. Then the
>> region is
>> >>> moved to another that now needs to do a lot of remote reads over the
>> >>> network. This is way there is work being done to allow for custom
>> placement
>> >>> policies in HDFS. That way you can store the entire region and all
>> copies
>> >>> as complete units on three data nodes. In case of a failure you can
>> then
>> >>> move the region to one of the two copies. This is not available yet
>> though,
>> >>> but it is being worked on (so I heard).
>> >>>
>> >>>> Does a MR task has to remotely access the data blocks from other
>> >>>> regionservers?
>> >>>
>> >>> For the above failure case, it would be the region server accessing
>> the
>> >>> remote data, yes.
>> >>>
>> >>>> How good is hbase locating datablocks where a region resides?
>> >>>
>> >>> That is again the wrong way around. HBase has no clue as to where
>> blocks
>> >>> reside, nor does it know that the file system in fact uses separate
>> blocks.
>> >>> HBase stores files, HDFS does the block magic underneath the hood, and
>> >>> transparent to HBase.
>> >>>
>> >>>> Also is it correct to say that if i set smaller data block size data
>> >>>> locality gets worse, and if data block size gets bigger  data
>> locality
>> >>> gets
>> >>>> better.
>> >>>
>> >>> This is not applicable here, I am assuming this stems from the above
>> >>> confusion about which system is handling the blocks, HBase or HDFS.
>> See
>> >>> above.
>> >>>
>> >>> HTH,
>> >>> Lars
>>
>



On Thu, Jul 19, 2012 at 6:39 AM, Cristofer Weber <
cristofer.we...@neogrid.com> wrote:

> Hi Alex,
>
> I ran one of our bulk import jobs with partial payload, without proceeding
> with major compaction, and you are right: Some hdfs blocks are in a
> different datanode.
>
> -----Mensagem original-----
> De: Alex Baranau [mailto:alex.barano...@gmail.com]
> Enviada em: quarta-feira, 18 de julho de 2012 12:46
> Para: hbase-u...@hadoop.apache.org; mapreduce-u...@hadoop.apache.org;
> hdfs-u...@hadoop.apache.org
> Assunto: Bulk Import & Data Locality
>
> Hello,
>
> As far as I understand Bulk Import functionality will not take into
> account the Data Locality question. MR job will create number of reducer
> tasks same as regions to write into, but it will not "advice" on which
> nodes to run these tasks. In that case Reducer task which writes HFiles of
> some region may not be physically located at the same node as RS that
> serves that region. The way HDFS writes data, there will be (likely) one
> full replica of bolcks of HFiles of this Region written on the node where
> Reducer task was run and other replicas (if replication >1) will be
> distributed randomly over the cluster. Thus, RS while serving data of that
> region will (most
> likely) not look at local data (data will be transferred from other
> datanodes). I.e. data locality will be broken.
>
> Is this correct?
>
> If yes, I guess, if we could tell MR framework where (which nodes) to
> launch certain Reducer tasks, this would help us. I believe this is not
> possible with MR1, please correct me if I'm wrong. Perhaps, this is this
> possible with MR2?
>
> I assume there's no way to provide a "hint" to a NameNode where to place
> blocks of a new File too, right?
>
> Thank you,
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>



-- 

*Benjamin Kim*
*benkimkimben at gmail*

Re: Bulk Import & Data Locality

Reply via email to