[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338635#comment-16338635 ]
Vladimir Rodionov edited comment on HBASE-17852 at 1/25/18 2:43 AM: -------------------------------------------------------------------- [~appy] wrote: {quote}Of the top of my head, I think the main areas to touch upon are: - Make backups concurrent - Use procedure framework: Long-standing request. The procv2 framework has features like locking, queuing operations, etc. Replication is already moving to it. I don't see a reason why backup can't too. - Can't use CP hooks for incremental backup. Backup should/will become first class feature - more important and critical than Coprocessor. - There should be some basic access control, if only, limiting everything to ADMIN (like RS group recently did in HBASE-19483){quote} OK, h4. Concurrent backups It is doable, but ... # Will require transaction management support - it complicates implementations a lot. We will need to provide full isolation of operations and complex conflict resolutions on commit. And rollback? # Complicates testing, as well - a lot. Imagine all different possible collisions between create, merge, delete sessions What I suggest is a slightly different approach: # Make restore operations concurrent # Implement fair queuing for *create-merge-delete* sessions # *create-merge-restore* executions will be serialized (one-by-one), but from user's point of view they will run, kind of, in parallel. YES/NO h4. Use procedure framework Short answer - no. I will wait until procv2 becomes more mature and robust. I do not want to build new feature on a foundation of a new feature. Too risky in my opinion. NO h4. Can't use CP hooks for incremental backup Currently backup lives in a separate module and we would like to keep it there. There is no need for the tight integration of a HBase core and backup and therefore, CP is the only our option here. NO h4. Access control Currently, only ADMIN can run backups/restore/delete/merge operations, but we do not enforce this explicitly, so we should probably, do the access right check *before* starting critical operation. YES. [~appy], [~elserj] - comments? was (Author: vrodionov): [~appy] wrote: {quote} Of the top of my head, I think the main areas to touch upon are: - Make backups concurrent - Use procedure framework: Long-standing request. The procv2 framework has features like locking, queuing operations, etc. Replication is already moving to it. I don't see a reason why backup can't too. - Can't use CP hooks for incremental backup. Backup should/will become first class feature - more important and critical than Coprocessor. - There should be some basic access control, if only, limiting everything to ADMIN (like RS group recently did in HBASE-19483) {quote} OK, h4. Concurrent backups It is doable, but ... # Will require transaction management support - it complicates implementations a lot. We will need to provide full isolation of operations and complex conflict resolutions on commit. And rollback? # Complicates testing, as well - a lot. Imagine all different possible collisions between create, merge, delete sessions What I suggest is a slightly different approach: # Make restore operations concurrent # Implement fair queuing for *create-merge-delete* sessions # *create-merge-restore* executions will be serialized (one-by-one), but from user's point of view they will run, kind of, in parallel. YES/NO h4. Use procedure framework Short answer - no. I will wait until procv2 becomes more mature and robust. I do not want to build new feature on a foundation of a new feature. Too risky in my opinion. NO h4. Can't use CP hooks for incremental backup Currently backup lives in a separate module and we would like to keep it there. There is no need for the tight integration of a HBase core and backup and therefore, CP is the only our option here. NO h4. Access control Currently, only ADMIN can run backups/restore/delete/merge operations, but we do not enforce this explicitly, so we should probably, do the access right check *before* starting critical operation. YES. [~appy], [~elserj] - comments? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > ------------------------------------------------------------------------------------ > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task > Reporter: Vladimir Rodionov > Assignee: Vladimir Rodionov > Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)