[ 
https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273376#comment-16273376
 ] 

Josh Elser commented on HBASE-17852:
------------------------------------

bq. Operators who'd rather avoid reading logs and having to run repair tools 
are 'lazy'.

bq. Do not we still have hbck for this reason? Repair \[...\] which happens 
periodically in HBase cluster.

Let me also expand on this: I would consider "lazy" as a virtue for operators. 
The system should automatically handle as much as possible. There's a 
fundamental difference between what hbck is and what `hbase backup repair` is: 
HBCK is fixing things that inadvertently happen server-side (hopefully, only 
around bugs which has since been fixed) whereas hbase-backup are completely 
client-driven. For example, something as benign as a user ctrl-C'ing a backup 
because they mis-typed the backup name or table being backed up would cause the 
backup table to need a repair.

bq. This is the question actually, should we do repair automatically or we need 
to inform user, that there was abnormal failure of a last backup/merge/delete 
command and user need to run repair.

I was about to write that I thought it was a no-brainer to blindly run a repair 
as a part of the BackupDriver, but now I wonder about the following:

Take two administrators running backups, unaware of each other. Admin1 starts a 
backup on Table1. Before Admin1's backup finishes, Admin2 tries to do a backup 
on Table2. Could Admin2 preempt/fail Admin1's backup by running a {{hbase 
backup repair}} while Admin1 is using the system?

In other words: does {{hbase backup repair}} have the ability to differentiate 
between "user is currently executing a backup" and "stale state exists in the 
table from an aborted/unfinished operation"?

> Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental 
> backup)
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-17852
>                 URL: https://issues.apache.org/jira/browse/HBASE-17852
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>             Fix For: 2.0.0
>
>         Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, 
> HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, 
> HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, 
> HBASE-17852-v9.patch
>
>
> Design approach rollback-via-snapshot implemented in this ticket:
> # Before backup create/delete/merge starts we take a snapshot of the backup 
> meta-table (backup system table). This procedure is lightweight because meta 
> table is small, usually should fit a single region.
> # When operation fails on a server side, we handle this failure by cleaning 
> up partial data in backup destination, followed by restoring backup 
> meta-table from a snapshot. 
> # When operation fails on a client side (abnormal termination, for example), 
> next time user will try create/merge/delete he(she) will see error message, 
> that system is in inconsistent state and repair is required, he(she) will 
> need to run backup repair tool.
> # To avoid multiple writers to the backup system table (backup client and 
> BackupObserver's) we introduce small table ONLY to keep listing of bulk 
> loaded files. All backup observers will work only with this new tables. The 
> reason: in case of a failure during backup create/delete/merge/restore, when 
> system performs automatic rollback, some data written by backup observers 
> during failed operation may be lost. This is what we try to avoid.
> # Second table keeps only bulk load related references. We do not care about 
> consistency of this table, because bulk load is idempotent operation and can 
> be repeated after failure. Partially written data in second table does not 
> affect on BackupHFileCleaner plugin, because this data (list of bulk loaded 
> files) correspond to a files which have not been loaded yet successfully and, 
> hence - are not visible to the system 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to