Ankit, thanks for starting this discussion. It'd be great to integrate
streaming of WAL edits to a backup destination. We've done this for years
internally at my company. It's critical to achieving only a few minutes of
RPO, but also complicated for us to maintain. Having it in hbase would
benefit all.

I commented on the doc, but my main point is to ensure that this gets
integrated into hbase-backups module. We could build it such that it could
be used separately, but I do think it should exist there and work natively
with those features. Before getting too deep here, it may make sense to
try reading through the docs/designs/code of hbase-backup so that it can be
integrated appropriately.

Admittedly the hbase-backup still has some rough edges that we've been
working on. It was originally designed a few years ago and then stalled,
and only recently renewed and integrated into our release branches. Having
more contributors in the area would be great, both in terms of this new
feature and in terms of integrating it into a cohesive solution and helping
clean up the code. I could imagine something like this:

- full backup weekly
- incremental backup daily
- continuous backup enabled with X days retention

In our experience, restoring a week's worth of WALs can be quite slow and
computationally expensive for a large cluster. Any disaster recovery plan
needs both RPO (acceptable data loss) and RTO (acceptable recovery time).
Continuous backup helps tackle RPO, while incremental backup helps tackle
RTO. Incremental backup is also more storage efficient than continuous,
because HFiles are more storage efficient than WALs. So one can decide what
sort of retention policy to have on their continuous backups -- maybe they
only need an RPO of minutes for a week, and then RPO of 1 day is ok. So
they can keep 1 week of WALs, 1 month of daily backups, etc.

On Thu, Sep 26, 2024 at 10:17 PM Ankit Singhal <ankitsingha...@gmail.com>
wrote:

> Hello everyone,
>
> We’ve been discussing an idea internally at Cloudera about implementing
> continuous backups using the replication workflow. The concept involves
> writing database edits to external storage for backup as soon as they’re
> written to the database, minimizing the gap between system failures and
> data availability. This approach would allow for recovery from accidental
> deletions, erroneous writes, or data corruption at any point in time.
>
> Additionally, it could serve as a cost-effective disaster recovery
> solution. While it offers a longer recovery time compared to a fully
> operational DR cluster, it significantly reduces the costs associated with
> running and maintaining a dedicated DR environment.
>
> The idea is still in its early stages, and we’re working through the finer
> details. However, we’ve created a document outlining the concept [1] and
> how it is gonna be different from current incremental backups.
>
> We’d greatly appreciate your feedback in the document: whether it’s about
> the viability of the idea, areas for improvement, or suggestions to
> simplify the approach
>
>
> [1]
>
> https://docs.google.com/document/d/1csQBMyM1mwpe4QpWkCbyqvsC9F5nUBr4ierOo8IuGpE/edit
>
>
> Regards,
>
> Ankit Singhal
>

Reply via email to