[jira] [Commented] (HBASE-13042) MR Job to export HFiles directly from an online cluster

Enis Soztutar (JIRA) Fri, 13 Feb 2015 11:32:43 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-13042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320618#comment-14320618
 ]


Enis Soztutar commented on HBASE-13042:
---------------------------------------

Here is an idea. Not sure it will help you or not. 

{{TableSnapshotInputFormat}} allows you to run any MR job directly over the 
snapshot. It also accepts key ranges, and eliminates regions out of the range. 

Without HBASE-13031 a snapshot is still a full table snapshot, but what you can 
do is: 

Decide on table ranges (lets say N ranges)
{code}
 for i  in 0..N
   (1) take snapshot 
   (2) use custom MR job to export the data (create hfiles for bulk load) over 
the snapshot for Range[i]
   (3) delete the snapshot 
{code}
You will only hold onto the single snapshot during (2), which you can control 
for how long it will take depending on the size of Range[i].


> MR Job to export HFiles directly from an online cluster
> -------------------------------------------------------
>
>                 Key: HBASE-13042
>                 URL: https://issues.apache.org/jira/browse/HBASE-13042
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Dave Latham
>
> We're looking at the best way to bootstrap a new remote cluster.  The source 
> cluster has a a large table of compressed data using more than 50% of the 
> HDFS capacity and we have a WAN link to the remote cluster.  Ideally we would 
> set up replication to a new table remotely, snapshot the source table, copy 
> the snapshot across, then bulk load it into the new table.  However the 
> amount of time to copy the data remotely is greater than the major compaction 
> interval so the source cluster would run out of storage.
> One approach is HBASE-13031 to allow the operators to snapshot and copy one 
> key range at a time.  Here's another idea:
> Create a MR job that tries to do a robust remote HFile copy directly:
>  * Each split is responsible for a key range.
>  * Map task lookups up that key range and maps it to a set of HDFS store 
> directories (one for each region/family)
>  * For each store:
>    ** List HFiles in store (needs to be less than 1000 files to guarantee 
> atomic listing)
>    ** Attempt to copy store files (copy in increasing size order to minimize 
> likelihood of compaction removing a file during copy)
>    ** If some of the files disappear (compaction), retry directory list / copy
>  * If any of the stores disappear (region split / merge) then retry map task 
> (and remap key range to stores)
> Or maybe there are some HBase locking mechanisms for a region or store that 
> would be better.  Otherwise the question is how often would compactions or 
> region splits force retries.
> Is this crazy? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13042) MR Job to export HFiles directly from an online cluster

Reply via email to