Charles Mason wrote:
As far as I can tell there are two options:
...
2, Image the data stored on the HDFS cluster. Aren't there some big
issues with it not grabbing a consistent image as some updates won't
be flushed? Is there any way to force that, or to make it be
consistent some way, perhaps via snapshoting?
Yes (as others have said on this thread).
We need to add a means of snapshotting an hbase cluster sending a signal
to all members who on receipt flush their in-memory content to the
filesystem writing across the cluser some sort of snapshot label or a
manifest of all files that comprise the snapshot. Thereafter, I'd
imagine an administrator would start up a big MR job to do a distcp from
one filesystem to another out on some other cluster. HBASE-50 is the
pertinent issue.
Related, this proposed feature in HDFS looks like it would make
snapshotting HDFS a breeze:
https://issues.apache.org/jira/browse/HADOOP-3637.
St.Ack