HBase has many options for performing the backup of data stored in a table.
The "export" tool is described by O'Reilly (HBase, the definitive guide),
but also here [
http://blog.cloudera.com/blog/2013/11/approaches-to-backup-and-disaster-recovery-in-hbase/comment-page-1/#comment-63294]
as a way to perform hot and incremental backups on a table.

Essentially, the procedure consists in:
- performing the backup from tome 0 to time t1
- performing the backup from tome t1 to time t2
- ... and so on

Suppose we want to perform a incremental backup from t1 to t2.
Obviously the backup will start at a time t3 greater or equals to t2 and
finish at time t4.
An export-backup is a MapReduce job that essentially queries HBase in order
to retrieve data updated from time t1 to t2.

Now, suppose that a client starts writing a particular cell right before t2
and updates it continuously with a different value every second.

Fresh data is written to WAL (not checked by the export tool) and memstore
only, so, every time the client writes a different cell value, the old data
is lost (assuming we are not using data versioning).

This means that, if the clients overwrite the cell after t2 but before t3,
the backup process will not export a consistent snapshot made at time t2,
instead, the backup will contain the fresh data written after t2. This
could happen also with data written by the client after t3 and before t4
(i.e. when the backup is in progress).

In order to make the incremental (consistent) backup work, I see two
options:
- Enable (infinite) version history on every data written to HBase (to
avoid overriding in memstore)
- Disable compaction temporarily, force memstore flush (eg. with a
"snapshot" command), perform the backup with t2 being the snapshot time,
then re-enable compaction.

I don't know if the second option is feasible as I did not find a way to
disable compaction temporarily.

Is there any other, reliable, feasible option to execute hot +
consistent + incremental backups with HBase?

Nicola

Reply via email to