[ 
https://issues.apache.org/jira/browse/HBASE-13031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320401#comment-14320401
 ] 

Dave Latham commented on HBASE-13031:
-------------------------------------

The question is how to best bootstrap a new cluster from an old one that 
doesn't have enough disk space around to store an additional full table 
snapshot if major compaction creates new HFiles.

The general idea is start replicating to the new cluster.  Then take a 
snapshot, copy it to the new cluster, and bulk load it into the table.  Data is 
copied only a single time.  However, since the data is so large (~1PB 
compressed) the amount of time to copy it over a WAN link will be weeks, during 
which time the table will undergo major compaction (we are also investigating 
the major compaction schedule, but let's assume we can't afford to simply 
disable it for weeks), would end up doubling the storage usage on the source 
cluster which would fill it up.  If we can break the snapshot/copy up into 2 or 
4 chunks then it wouldn't be a problem.

Jesse asks "what good is a snapshot if it doesn't capture the state of an 
entire table?" It would certainly help with this use case, but I could also see 
it being used for efficient sampling (and exporting) of a portion of a table's 
data, or for tables using time ranges, map to a data for a period of time as an 
example.

Andrew suggests using Export.  That would require exporting to local dfs (into 
compressed sequence files) - DistCp the sequence files, then Import them into 
the destination cluster.  That's two extra data copies on the surface, not 
counting the write amplification of Importing data through the memstore / flush 
/ compaction process.  I'm not sure exactly what all the steps that Vladimir is 
suggesting, but it again sounds like extra data copies.  

This proposal seems like a small change to allow for an efficient bootstrap of 
a remote cluster under storage constraints.

> Ability to snapshot based on a key range
> ----------------------------------------
>
>                 Key: HBASE-13031
>                 URL: https://issues.apache.org/jira/browse/HBASE-13031
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: churro morales
>            Assignee: churro morales
>             Fix For: 2.0.0, 0.94.26, 1.1.0, 0.98.11
>
>
> Posted on the mailing list and seems like some people are interested.  A 
> little background for everyone.
> We have a very large table, we would like to snapshot and transfer the data 
> to another cluster (compressed data is always better to ship).  Our problem 
> lies in the fact it could take many weeks to transfer all of the data and 
> during that time with major compactions, the data stored in dfs has the 
> potential to double which would cause us to run out of disk space.
> So we were thinking about allowing the ability to snapshot a specific key 
> range.  
> Ideally I feel the approach is that the user would specify a start and stop 
> key, those would be associated with a region boundary.  If between the time 
> the user submits the request and the snapshot is taken the boundaries change 
> (due to merging or splitting of regions) the snapshot should fail.
> We would know which regions to snapshot and if those changed between when the 
> request was submitted and the regions locked, the snapshot could simply fail 
> and the user would try again, instead of potentially giving the user more / 
> less than what they had anticipated.  I was planning on storing the start / 
> stop key in the SnapshotDescription and from there it looks pretty straight 
> forward where we just have to change the verifier code to accommodate the key 
> ranges.  
> If this design sounds good to anyone, or if I am overlooking anything please 
> let me know.  Once we agree on the design, I'll write and submit the patches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to