[ 
https://issues.apache.org/jira/browse/HBASE-13042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323310#comment-14323310
 ] 

Dave Latham commented on HBASE-13042:
-------------------------------------

Thanks, [~jmhsieh].  We are aware of ExportSnapshot, which has a bit of an 
easier task since its operating on a snapshot so if a file is compacted it will 
still be found in the archive.  The problem in this case is that keeping an 
entire snapshot archived for that amount of time would run out of storage space 
on this cluster.

It's looking more like we'll go with Andrew's suggestion of an Export 
(compressed), DistCp, Import (to HFile), then bulk load, for half the table 
each time.  It unfortunately requires some extra data copies, but the extra 
compression will get it to so that we can do it in only two passes (half the 
table each time) and cut down on the data transfer time.

> MR Job to export HFiles directly from an online cluster
> -------------------------------------------------------
>
>                 Key: HBASE-13042
>                 URL: https://issues.apache.org/jira/browse/HBASE-13042
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Dave Latham
>
> We're looking at the best way to bootstrap a new remote cluster.  The source 
> cluster has a a large table of compressed data using more than 50% of the 
> HDFS capacity and we have a WAN link to the remote cluster.  Ideally we would 
> set up replication to a new table remotely, snapshot the source table, copy 
> the snapshot across, then bulk load it into the new table.  However the 
> amount of time to copy the data remotely is greater than the major compaction 
> interval so the source cluster would run out of storage.
> One approach is HBASE-13031 to allow the operators to snapshot and copy one 
> key range at a time.  Here's another idea:
> Create a MR job that tries to do a robust remote HFile copy directly:
>  * Each split is responsible for a key range.
>  * Map task lookups up that key range and maps it to a set of HDFS store 
> directories (one for each region/family)
>  * For each store:
>    ** List HFiles in store (needs to be less than 1000 files to guarantee 
> atomic listing)
>    ** Attempt to copy store files (copy in increasing size order to minimize 
> likelihood of compaction removing a file during copy)
>    ** If some of the files disappear (compaction), retry directory list / copy
>  * If any of the stores disappear (region split / merge) then retry map task 
> (and remap key range to stores)
> Or maybe there are some HBase locking mechanisms for a region or store that 
> would be better.  Otherwise the question is how often would compactions or 
> region splits force retries.
> Is this crazy? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to