Dave Latham created HBASE-13042:
-----------------------------------

             Summary: MR Job to export HFiles directly from an online cluster
                 Key: HBASE-13042
                 URL: https://issues.apache.org/jira/browse/HBASE-13042
             Project: HBase
          Issue Type: New Feature
            Reporter: Dave Latham


We're looking at the best way to bootstrap a new remote cluster.  The source 
cluster has a a large table of compressed data using more than 50% of the HDFS 
capacity and we have a WAN link to the remote cluster.  Ideally we would set up 
replication to a new table remotely, snapshot the source table, copy the 
snapshot across, then bulk load it into the new table.  However the amount of 
time to copy the data remotely is greater than the major compaction interval so 
the source cluster would run out of storage.

One approach is HBASE-13031 to allow the operators to snapshot and copy one key 
range at a time.  Here's another idea:

Create a MR job that tries to do a robust remote HFile copy directly:
 - Each split is responsible for a key range.
 - Map task lookups up that key range and maps it to a set of HDFS store 
directories (one for each region/family)
 - For each store:
   - List HFiles in store (needs to be less than 1000 files to guarantee atomic 
listing)
   - Attempt to copy store files (copy in increasing size order to minimize 
likelihood of compaction removing a file during copy)
   - If some of the files disappear (compaction), retry directory list / copy
 - If any of the stores disappear (region split / merge) then retry map task 
(and remap key range to stores)

Or maybe there are some HBase locking mechanisms for a region or store that 
would be better.  Otherwise the question is how often would compactions or 
region splits force retries.

Is this crazy? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to