Vinayak Hegde created HBASE-29518:
-------------------------------------
Summary: Support Moving Bulkloaded Files to External Storage in
Continuous Backup
Key: HBASE-29518
URL: https://issues.apache.org/jira/browse/HBASE-29518
Project: HBase
Issue Type: Task
Components: backup&restore
Reporter: Vinayak Hegde
Assignee: Vinayak Hegde
Currently, bulkloaded files are copied to external storage (e.g., S3) as part
of incremental backup, but not during continuous backup. This leaves a gap in
disaster recovery scenarios, as bulkloaded data may remain only on the source
cluster. If the cluster storage is lost, those files are unrecoverable even if
WALs are available.
We had previously implemented bulkload handling in continuous backup but
reverted it due to performance concerns
(https://issues.apache.org/jira/browse/HBASE-29406). At that time, we assumed
bulkload operations needed to be applied in strict order with WAL edits, which
added complexity and overhead.
*Why we are reconsidering this now:*
* *High bulkload usage:* Many users regularly use bulkload (often at scale,
e.g., generating HFiles with Spark and bulkloading them) as their primary data
ingestion method.
* *Order independence:* Recent discussions confirmed that in HBase, the order
between WAL replay and bulkload operations does not matter, since all updates
(put/delete) are timestamp-based. This allows us to first replay all WAL edits,
then bulkload HFiles afterward, reducing complexity and performance impact.
* *Disaster recovery importance:* Storing all backup data, including
bulkloaded files, in an external location ensures recovery even if the entire
HDFS cluster is inaccessible or destroyed. Keeping backups off-cluster is a
best practice to protect against site-level failures.
*Proposed approach:*
* Update the continuous backup replication endpoint to copy bulkloaded files
to the backup location.
* Optimize performance through batching or asynchronous copying where possible.
* Restore workflow: replay WAL entries first, then bulkload HFiles from the
backup location.
*Benefits:*
* Ensures all ingested data is protected in the backup location.
* Eliminates dependency on the source cluster for recovery.
* Aligns continuous backup behavior with incremental backup for consistency.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)