Ankit, thanks for starting this discussion. It'd be great to integrate streaming of WAL edits to a backup destination. We've done this for years internally at my company. It's critical to achieving only a few minutes of RPO, but also complicated for us to maintain. Having it in hbase would benefit all.
I commented on the doc, but my main point is to ensure that this gets integrated into hbase-backups module. We could build it such that it could be used separately, but I do think it should exist there and work natively with those features. Before getting too deep here, it may make sense to try reading through the docs/designs/code of hbase-backup so that it can be integrated appropriately. Admittedly the hbase-backup still has some rough edges that we've been working on. It was originally designed a few years ago and then stalled, and only recently renewed and integrated into our release branches. Having more contributors in the area would be great, both in terms of this new feature and in terms of integrating it into a cohesive solution and helping clean up the code. I could imagine something like this: - full backup weekly - incremental backup daily - continuous backup enabled with X days retention In our experience, restoring a week's worth of WALs can be quite slow and computationally expensive for a large cluster. Any disaster recovery plan needs both RPO (acceptable data loss) and RTO (acceptable recovery time). Continuous backup helps tackle RPO, while incremental backup helps tackle RTO. Incremental backup is also more storage efficient than continuous, because HFiles are more storage efficient than WALs. So one can decide what sort of retention policy to have on their continuous backups -- maybe they only need an RPO of minutes for a week, and then RPO of 1 day is ok. So they can keep 1 week of WALs, 1 month of daily backups, etc. On Thu, Sep 26, 2024 at 10:17 PM Ankit Singhal <ankitsingha...@gmail.com> wrote: > Hello everyone, > > We’ve been discussing an idea internally at Cloudera about implementing > continuous backups using the replication workflow. The concept involves > writing database edits to external storage for backup as soon as they’re > written to the database, minimizing the gap between system failures and > data availability. This approach would allow for recovery from accidental > deletions, erroneous writes, or data corruption at any point in time. > > Additionally, it could serve as a cost-effective disaster recovery > solution. While it offers a longer recovery time compared to a fully > operational DR cluster, it significantly reduces the costs associated with > running and maintaining a dedicated DR environment. > > The idea is still in its early stages, and we’re working through the finer > details. However, we’ve created a document outlining the concept [1] and > how it is gonna be different from current incremental backups. > > We’d greatly appreciate your feedback in the document: whether it’s about > the viability of the idea, areas for improvement, or suggestions to > simplify the approach > > > [1] > > https://docs.google.com/document/d/1csQBMyM1mwpe4QpWkCbyqvsC9F5nUBr4ierOo8IuGpE/edit > > > Regards, > > Ankit Singhal >