[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344578#comment-16344578 ] Appy commented on HBASE-17852: -- bq. My intent was not to squash your opinions, but to avoid being blocked if you were not interested/busy as seemed might be the case. That's reasonable. Sorry for the delay on my part. [~elserj] i believe you. Everyone makes mistakes from time-to-time, i'm certain i must have done too. Always happy with "acknowledge, learn, and move past them" way. All's good (between us two). > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1631#comment-1631 ] Josh Elser commented on HBASE-17852: bq. I'm not messing with you, Appy. Check the push logs/comments on the other JIRA issue.. I swear to you that I did not push this until after I heard back from you. My guess is that this is due to me using git-am or cherry picking a commit from a local branch. My apologies, Appy. I am wrong. I apparently got impatient and pushed this because there was silence from the Dec 6th mention and the Jan 12th re-ping. My intent was not to squash your opinions, but to avoid being blocked if you were not interested/busy as seemed might be the case. If you have since changed your mind about the reduced patch hitting master, my offer to revert stands. My apologies again for arguing with you while in the wrong. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344425#comment-16344425 ] Vladimir Rodionov commented on HBASE-17852: --- Hmm, did I insult someone savagely, [~elserj]? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344416#comment-16344416 ] Josh Elser commented on HBASE-17852: [~vrodionov] that's also plenty shit-slinging from you too on the matter. Thanks. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344407#comment-16344407 ] Josh Elser commented on HBASE-17852: Also, again, if you want this reverted, please say so. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344404#comment-16344404 ] Josh Elser commented on HBASE-17852: bq. I did say it was okay to go in master, but that's like 4 days after the commit - 2018-01-16T12:46:19-0800 I'm not messing with you, [~appy]. Check the push logs/comments on the other JIRA issue.. I swear to you that I did not push this until after I heard back from you. My guess is that this is due to me using git-am or cherry picking a commit from a local branch. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344380#comment-16344380 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} that's precisely the reason why i can't trust you. {quote} You can start discussion about trust and respect in HBase community and I assure you I have a lot to say about. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344359#comment-16344359 ] Appy commented on HBASE-17852: -- If your mental radar doesn't tick-off in loud red alarms between the time of choosing the second approach and getting someone to commit it, that's precisely the reason why i can't trust you. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344341#comment-16344341 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} I did say it was okay to go in master, but that's like 4 days after the commit - 2018-01-16T12:46:19-0800 {quote} OK, there was an issue found during QA testing - HBASE-19568. It turned out that HBASE-17852 fixes the issue. Let us say I have had two options: # Find out which part of HBASE-17852 fixes the issue and create smaller HBASE-19568- specific patch # Apply HBASE-17852 patch directly (with some refactoring part stripped down) So I have chosen the latter one. Reasons: time, time, time. We can revert HBASE-19568 back if there are so many objections. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344327#comment-16344327 ] Appy commented on HBASE-17852: -- Commit date is 12th jan {noformat} commit a5601c8eac6bfcac7d869574547f505d44e49065 Author: Vladimir Rodionov AuthorDate: Wed Jan 10 16:26:09 2018 -0800 Commit: Josh Elser CommitDate: Fri Jan 12 13:13:17 2018 -0500 HBASE-19568: Restore of HBase table using incremental backup doesn't restore rows from an earlier incremental backup Signed-off-by: Josh Elser {noformat} I did say it was okay to go in master, but that's like 4 days after the commit - 2018-01-16T12:46:19-0800 > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344314#comment-16344314 ] Appy commented on HBASE-17852: -- Okay, f**k it, I really don't want to waste anymore of my time fighting some fight. It's obvious from events what happened here, and that it shouldn't have - makes me very sad and angry. I leave its further handling to PMC. At the very least, someone lost my basic trust and respect. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344310#comment-16344310 ] Josh Elser commented on HBASE-17852: bq. I'll can't believe that because I can't believe that.. [~appy], truly, boss, if you weren't giving your blessing on the fix going into master, say so and I'll revert it when next at a computer. I was operating under the assumption that we had time to address design and not look gift-contribtuion(horses) in the mouth. The rest of this is a product of some heavy-handedness about the busted Yetus after the JIRA upgrade. Not trying to tell you something different than what you think happened, did. Trying to express that I thought you were ok with this plan against master (not branch-2). > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344307#comment-16344307 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} Wasn't the said patch objected against committing by multiple members of the community? {quote} Calm down, [~appy]. We are not doing anything criminal here. The result of these two patches is what you have agreed on personally : https://issues.apache.org/jira/browse/HBASE-17852?focusedCommentId=16327774&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16327774 > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344302#comment-16344302 ] stack commented on HBASE-17852: --- {quote}The majority of this code (but not all) went into master in HBASE-19568 btw. {quote} The majority of 'HBASE-17852 Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)', a contentious issue, went into another commit named 'HBASE-19568 Restore of HBase table using incremental backup doesn't restore rows from an earlier incremental backup' with no outline of what made it and what did not, and no changeset explaination. There is no release note. The two JIRAs are not even linked. {quote}Nope, it turned out that this patch (HBASE-17852) also fixes the issue raised in HBASE-19568, that is why it was committed (with refactoring code stripped down). No conspiracy here. {quote} But hang on, now the patch here on 'fault tolerance' fixes issues over in the 'restore rows' issue, -HBASE-19568?- I can see how [~appy] might arrive at his assessment. On the 'declarations', the first offers options free of context or explanation. This one I find interesting: # Use procedure framework: Short answer - no. I will wait until procv2 becomes more mature and robust. I do not want to build new feature on a foundation of a new feature. Too risky in my opinion. NO when we are talking about a hbase3 (possibly) feature and when there is no alternative. Anyway, keeping it short. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344287#comment-16344287 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} he had random urge to delete all previous 9 patches from this jira {quote} No conspiracy here as well. I was not able to submit patch v10 due to some Apache Jira issues and had to remove all previous patches to be able to submit v10. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344272#comment-16344272 ] Appy commented on HBASE-17852: -- bq. There was nothing malicious intending to happen here I'll can't believe that because I can't believe that - he started fixing the other jira from clean slate and somehow mysteriously ended up with exact same diff as was here, and which we all were against. - he had random urge to delete all previous 9 patches from this jira, but not from phase1 jira HBASE-14030 or phase2 jira HBASE-14123, which both have like 40 patches each > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344258#comment-16344258 ] Appy commented on HBASE-17852: -- bq. Nope, it turned out that this patch (HBASE-17852) also fixes the issue raised in HBASE-19568, that is why it was committed (with refactoring code stripped down). Not a justification! Did you not use the patch in this jira to fix HBASE-19568? Wasn't the said patch objected against committing by multiple members of the community? Did you brought to anyone's attention, who raised the objections (me/stack/andrew/[~mdrob]), the fact that you were committing these changes. bq. No conspiracy here. Besides this, I thought that we have agreed on pushing this to the master branch and continue working on a critical changes after that? You really think that'd work? People can match timestamps, you committed 4 days before i even replied back! > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344231#comment-16344231 ] Josh Elser commented on HBASE-17852: {quote} HBASE-19568 had basically everything that was objected in the reviews here, why wasn't it brought to the attention of people who raised objections? The title/reason of that jira reason doesn't matter. I see it as a really sly move - going behind community and committed changes which were heavily objected against, by using separate jira. {quote} [~appy], let's take a step back, please. I called this out to your attention -- I was under the impression that, based on your earlier comment ([here|https://issues.apache.org/jira/browse/HBASE-17852?focusedCommentId=16327774&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16327774]) that you were OK of this implementation landing in master as-is. HBASE-19568 was used to commit to master (with what I thought was your blessing) while we continue to use this JIRA issue to flesh out design because of all of the discussion that has happened. If I misunderstood you or poorly asked you the question, let's take that over to HBASE-19568 and get a revert in place. There was nothing malicious intending to happen here. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344228#comment-16344228 ] Vladimir Rodionov commented on HBASE-17852: --- Nope, it turned out that this patch (HBASE-17852) also fixes the issue raised in HBASE-19568, that is why it was committed (with refactoring code stripped down). No conspiracy here. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344219#comment-16344219 ] Appy commented on HBASE-17852: -- Forget all the design discussion, that's not important anymore. HBASE-19568 had basically everything that was objected in the reviews here, why wasn't it brought to the attention of people who raised objections? The title/reason of that jira reason doesn't matter. I see it as a really sly move - going behind community and committed changes which were heavily objected against, by using separate jira. Ping reviewers of other jira: [~elserj] [~tedyu] Ping [~stack] [~apurtell] > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344163#comment-16344163 ] Vladimir Rodionov commented on HBASE-17852: --- I will quote myself {quote} I will rebase patch to the current master. The majority of this code (but not all) went into master in HBASE-19568 btw. {quote} > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344160#comment-16344160 ] Appy commented on HBASE-17852: -- I see only patch v10 in the attached files, and all it's doing is changing name of BackupSystemTable to BackupMetaTable. It's far from what the title says - "Add Fault Tolerance". What am i missing? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339615#comment-16339615 ] Vladimir Rodionov commented on HBASE-17852: --- [~appy] we have fully functional module already, but you suggest rewriting 20%-40% of code. That is why my response is so strong. As for procv2, I have heard a lot from other developers who worked on procv2-related bugs. Backup is not like table create, truncate, split etc - it is in its own league. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339026#comment-16339026 ] Appy commented on HBASE-17852: -- Man(lightly shaking head side-to-side)...such strong responses when we are trying to scope out needed work/design changes for a better B&R in 2.1. Please work with me here..smile. Why do you believe procv2 is new feature? It's being used for core HBase functionality - create, delete tables, etc since 1.2 release. What would make it mature & robust enough for B&R in your opinion? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338635#comment-16338635 ] Vladimir Rodionov commented on HBASE-17852: --- [~appy] wrote: {quote} Of the top of my head, I think the main areas to touch upon are: - Make backups concurrent - Use procedure framework: Long-standing request. The procv2 framework has features like locking, queuing operations, etc. Replication is already moving to it. I don't see a reason why backup can't too. - Can't use CP hooks for incremental backup. Backup should/will become first class feature - more important and critical than Coprocessor. - There should be some basic access control, if only, limiting everything to ADMIN (like RS group recently did in HBASE-19483) {quote} OK, h4. Concurrent backups It is doable, but ... # Will require transaction management support - it complicates implementations a lot. We will need to provide full isolation of operations and complex conflict resolutions on commit. And rollback? # Complicates testing, as well - a lot. Imagine all different possible collisions between create, merge, delete sessions What I suggest is a slightly different approach: # Make restore operations concurrent # Implement fair queuing for *create-merge-delete* sessions # *create-merge-restore* executions will be serialized (one-by-one), but from user's point of view they will run, kind of, in parallel. YES/NO h4. Use procedure framework Short answer - no. I will wait until procv2 becomes more mature and robust. I do not want to build new feature on a foundation of a new feature. Too risky in my opinion. NO h4. Can't use CP hooks for incremental backup Currently backup lives in a separate module and we would like to keep it there. There is no need for the tight integration of a HBase core and backup and therefore, CP is the only our option here. NO h4. Access control Currently, only ADMIN can run backups/restore/delete/merge operations, but we do not enforce this explicitly, so we should probably, do the access right check *before* starting critical operation. YES. [~appy], [~elserj] - comments? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336001#comment-16336001 ] Josh Elser commented on HBASE-17852: {quote}I'm fine with this landing in master. I'll try to take a thorough look at the code after 2.0 release (If i miss that, i'll consider myself ineligible for casting any +/- 1). {quote} Thanks Appy. Your input is appreciated. I think the direction you're proposing makes sense, but it might be premature to push this forward right now. I've been seeing some funkiness in branch-2 work around procv2. Letting it burn in on branch-2 first is probably a good idea. I'm glad we can help Vlad move forward now and revisit this later. I'm +1 on this one. Committing it now to master. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v10.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331092#comment-16331092 ] Hadoop QA commented on HBASE-17852: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 4m 56s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 19 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 26s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 35s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 6m 27s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 19s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} hbase-backup: The patch generated 0 new + 162 unchanged - 2 fixed = 162 total (was 164) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} The patch hbase-it passed checkstyle {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 57s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 24m 53s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 8s{color} | {color:green} hbase-backup in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 46s{color} | {color:green} hbase-it in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 69m 21s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-17852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12906516/HBASE-17852-v10.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 550f5f3def54 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 09ffbb5b68 | | maven | version: Apa
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329767#comment-16329767 ] Hadoop QA commented on HBASE-17852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} HBASE-17852 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.6.0/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-17852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12887458/HBASE-17852-v1.patch | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/11096/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This message was automatically generated. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v10.patch, > HBASE-17852-v2.patch, HBASE-17852-v3.patch, HBASE-17852-v4.patch, > HBASE-17852-v5.patch, HBASE-17852-v6.patch, HBASE-17852-v7.patch, > HBASE-17852-v8.patch, HBASE-17852-v9.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329648#comment-16329648 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} For now, what do you think are the biggest blockers for making procv2 + backup happen [~vrodionov]? {quote} If we could do procv2 implementation w/o getting into server <- backup dependency, then no blockers. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329642#comment-16329642 ] Vladimir Rodionov commented on HBASE-17852: --- I will rebase patch to the current master. The majority of this code (but not all) went into master in HBASE-19568 btw. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329597#comment-16329597 ] Hadoop QA commented on HBASE-17852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s{color} | {color:red} HBASE-17852 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.6.0/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-17852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12887458/HBASE-17852-v1.patch | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/11094/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This message was automatically generated. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329577#comment-16329577 ] Appy commented on HBASE-17852: -- I said hbase-backup --> hbase-server above because backup needs snapshot. Our dependencies are in a state of orgy right now, otherwise following would have been perfect shape to be in. !screenshot-1.png! That said, we should still be able to do procv2+backup without all the other refactoring. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch, screenshot-1.png > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329538#comment-16329538 ] Appy commented on HBASE-17852: -- Adding more, so the likely dependencies will end up being: hbase-backup --> hbase-server hbase-backup --> hbase-procedure B&R's functionalities will be implementations of Procedure/StateMachineProcedure and use masterServices.getMasterProcedureExecutor().submitProcedure() to get stuff done. I do see some deps issues, but we can come with solutions. One thing we should definitely try to stay away from is, merging the code back in hbase-server module. For now, what do you think are the biggest blockers for making procv2 + backup happen [~vrodionov]? You're right, we should definitely discuss concrete design/problems/solutions before staring with the refactoring. Can help with design review. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329494#comment-16329494 ] Appy commented on HBASE-17852: -- Replication is doing it, but it's already in hbase-server module so it's definitely not the ideal example. But I think its possible to do procv2 + backup without tight integration with hbase-server i.e. while keeping things in separate module. Won't be surprised if it requires some refactoring/small design improvements in proc2 code itself, but that'll be all for good. Maybe backup module become the poster face for "Building features with procv2" and we make replication do the same. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329466#comment-16329466 ] Vladimir Rodionov commented on HBASE-17852: --- So, we are returning back to procV2 and tight integration with hbase-server? [~appy], we used to have this before, but had to move everything from hbase-server more than a year ago by request from [~stack]. Therefore, I need [~stack] +1 on this plan before I start working on refactoring again. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327774#comment-16327774 ] Appy commented on HBASE-17852: -- I'm fine with this landing in master. I'll try to take a thorough look at the code after 2.0 release (If i miss that, i'll consider myself ineligible for casting any +/- 1). Of the top of my head, I think the main areas to touch upon are: - Make backups concurrent - Use procedure framework: Long-standing request. The procv2 framework has features like locking, queuing operations, etc. Replication is already moving to it. I don't see a reason why backup can't too. - Can't use CP hooks for incremental backup. Backup should/will become first class feature - more important and critical than Coprocessor. - There should be some basic access control, if only, limiting everything to ADMIN (like RS group recently did in HBASE-19483) > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov >Priority: Major > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324439#comment-16324439 ] Josh Elser commented on HBASE-17852: bq. I'd prefer to see this land in master, then we take the concept back to the drawing board and, with all of your help, we revisit this and come up with a design and implementation that works for concurrent backup sessions (as Vlad has this on the Phase4 roadmap already). Ping [~appy]. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280521#comment-16280521 ] Josh Elser commented on HBASE-17852: bq. Yea, retry would be good. File a JIRA? HBASE-19441. Tagged in the "Phase 4" umbrella. {quote} So, I'd say, there are many things implicitly broken with current design. Strong -1 on shipping it unless they are fixed. {quote} [~appy], while I appreciate the keen eye you're applying here, how can we move forward here? I know it's very frustrating for Vlad to have something that he's already built+tested be taken back to the drawing board abruptly. Do you truly feel like this feature is harmful as compared to what the current implementation is? I'd prefer to see this land in master, then we take the concept back to the drawing board and, with all of your help, we revisit this and come up with a design and implementation that works for concurrent backup sessions (as Vlad has this on the Phase4 roadmap already). > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275203#comment-16275203 ] Vladimir Rodionov commented on HBASE-17852: --- [~appy], # Only Admin user can run backups, therefore, there is no need to run multiple backups in parallel. Admin can run them in a single backup command. # Restore can be run in parallel with other commands. That is artificial limitation and can be removed easily. It means Admin can run backups session and multiple restore sessions in parallel. I personally, do not see or anticipate strong request to allow multiple backup sessions in parallel. I advise you to go through doc and you fill find and easy to work-around parallel sessions by combining them into single one, [~appy] # There is no issues with cross - RPC in backup case, because RPC call is a single hop and, hence, deadlock - free # Failure of BackupObserver to record bulk loaded file with result in bulk load failure - yes. *But I do not see an alternative here*, do you? We need to record *all bulk loaded file names and store them persistently before bulk load operation completes*. Do you have an idea, how can this be achieved, w/o failing bulk load itself and w/o touching hbase core code? The only thing I agree here is support for parallel deletes, merges and if we will introduce this support we can easily add multiple backup session support for free. I personally, was very impressed by you, guys, you spent so much time looking for design and implementation flaws, when time was running out literally, during this week. Good job. Why haven't you done this couple months before? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275107#comment-16275107 ] Appy commented on HBASE-17852: -- bq. To try to move the conversation forward, I tend to agree with Vlad that I don't seen an inherent problem with the rollback-via-snapshot implementation The inherent problem with rollback-via-snapshot approach is - one operation is taking "exclusive lock" on the backup meta table, and that too in a very weird way. It's weird because: 1) It behaves like exclusive lock in certain cases. (We only restore on failure, i.e. exclusion kicks in only on failures. That leads to waterfall of issues mentioned below.) 2) Some other operations on that table are following "exclusion" semantics (via locking a row), while others not. As a result of which we see so many problems: 1) Different table for incremental backup data: The problem is not that there's a different table, that's fine, but the reason which led to it. 2) You can't run any other command in parallel! No restores (data loss, services are down, everything is on fire, oh but there's a cron job taking backup, so i can't do zilch!?), no merges, no deletes (prod cluster, running out of space, i have to wait for backup before i can free up space?). That's just absurd. 3) Other successful commands are rolled back silently. If an operator add/remove/delete sets, they are gone if a totally different thing fails! 4) During restore, backup table goes offline, cron job attempts backup and fails. Others: - And then the issues around cross RS RPC from observer during bulk load. Was the alternative suggested yesterday considered in the design? Was there any alternative that was considered? - (Ref: Bulk loads) Backups are very important. But more important is user being able to load their data and use it. Preventing user to work with their data by putting backup in load path and failing everything if backup doesn't work is plain wrong. Find a different way to backup bulk load data without affecting core read/write paths. So, I'd say, there are many things implicitly broken with current design. Strong -1 on shipping it unless they are fixed. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274762#comment-16274762 ] Mike Drob commented on HBASE-17852: --- bq. This is much easier to fix, than concurrent backup sessions support, because restore does not access meta table. Restore doesn't need to update the set of backup files (to remove references to no longer referenced files?) If we backup, add data, incremental backup, add data, restore to first backup, add data, incremental backup this will all work correctly without the Restore having needed to update any backup state? Where do I look for how this works? bq. No, this is client -side operation. Can someone queue hbck? It's client-side... kind of. We're encouraging folks to automate these operations, comparing to hbck isn't the same. bq. "manual cleanup" is only running the hbase backup repair command. I don't feel like that is too onerous and goes back to my original feelings (acceptable limitation to get this in the hands of users). Yea, this is probably ok. I thought we still had a pretty hairy situation here. bq. Specifically, the client does a checkAndPut to specifics coordinates in the backup table and throws an exception when that fails. Remember that backups are client driven (per some design review from a long time ago), so queuing is tough to reason about (we have no "centralized" execution system to use). At a glance, it seems pretty straightforward to add some retry/backoff semantics to BackupSystemTable#startBackupExclusiveOperation(). Isn't exactly a "queue", but it would ease the pain you allude to. Yea, retry would be good. File a JIRA? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274744#comment-16274744 ] Josh Elser commented on HBASE-17852: bq. The current options appear to be wait until the backup finishes (maybe ok, depending on sizes/bandwidth/etc...) or cancel the nightly backup (very bad, especially if we have to do manual cleanup of things). "manual cleanup" is only running the {{hbase backup repair}} command. I don't feel like that is too onerous and goes back to my original feelings (acceptable limitation to get this in the hands of users). bq. Can an admin queue sessions? That would help the user experience quite a bit until we get parallel sessions. (Not that I'm suggesting that this is either necessary or sufficient; I would much rather see effort towards a proper solution rather than temporary workaround after workaround, but queued operations may be useful in other contexts.) Specifically, the client does a checkAndPut to specifics coordinates in the backup table and throws an exception when that fails. Remember that backups are client driven (per some design review from a long time ago), so queuing is tough to reason about (we have no "centralized" execution system to use). At a glance, it seems pretty straightforward to add some retry/backoff semantics to {{BackupSystemTable#startBackupExclusiveOperation()}}. Isn't exactly a "queue", but it would ease the pain you allude to. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274742#comment-16274742 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} The problem is that backups and restores cannot occur simultaneously. {quote} This is much easier to fix, than concurrent backup sessions support, because restore does not access meta table. {quote} Can an admin queue sessions? That would help the user experience quite a bit until we get parallel sessions. {quote} No, this is client -side operation. Can someone queue *hbck*? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274706#comment-16274706 ] Mike Drob commented on HBASE-17852: --- bq. because backup sessions always update shared records. This sounds like a design flaw. bq. what is the use case when Admin starts two sessions in parallel if he can run them serially Can an admin queue sessions? That would help the user experience quite a bit until we get parallel sessions. (Not that I'm suggesting that this is either necessary or sufficient; I would much rather see effort towards a proper solution rather than temporary workaround after workaround, but queued operations may be useful in other contexts.) > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274700#comment-16274700 ] Mike Drob commented on HBASE-17852: --- The problem is that backups and restores cannot occur simultaneously. Let's say that we have a hypothetical system set to backup nightly (via cron or some other non-interactive mechanism). While this full system backup is running, some problem is detected with a single table and it is determined that the correct course of action is to restore that table. Given that we base backup and restore operations on snapshots, this should be straightforward - the large backup can continue to run while a restore of the specific table (to the last known good state) is put in place without waiting for the backup to complete. The current options appear to be wait until the backup finishes (maybe ok, depending on sizes/bandwidth/etc...) or cancel the nightly backup (very bad, especially if we have to do manual cleanup of things). I think the position that I'm slowly arriving to is that we shouldn't be recommending nightly backups at all to folks - this is probably a use case better served by replication and having a wider variety of sinks available instead of only another HBase cluster (HBASE-18846 might help with this?). That said we would still need some kind of bulk restore wrappers. Let me think on this more... > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274659#comment-16274659 ] Josh Elser commented on HBASE-17852: bq. Hmm... I don't think we can publish backup/restore without HBASE-16391 in a 2.0 release. I'd like to have confidence that the feature is rock solid before telling users that it's ok to use, parallel operations seems like a major shortcoming to me. Let's dig in on this some more, [~mdrob]. B&R is much more of an "administrative function" as opposed to a "client feature". My general expectation would be that, most aggressively, HBase admins (a couple of people) would run incremental backups on the order of "hours", e.g. incremental backup every 8 hours . I could see the extremely paranoid wanting to do incremental backups every hour over some collection of tables which _could_ cause issues if we can only execute one backup operation at a time (I'm thinking along the lines of 3 backup sets, incremental backups every hour, merging of those backups every few hours, full backup every day, etc). As such, my opinion differs in that I don't see the lack of concurrent backup operations being a major impediment for "most" users. I completely agree with you that there will be some users in which this limitation would be problematic on what they want to use it, but, even for these edge cases, B&R without this would still have value to them. I think getting this feature into the hands of users (with the extremely clear caveats on current implementation) would actually better serve the feature than letting it fester more on JIRA. Thoughts? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274008#comment-16274008 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} When we switch away from restore-via-snapshot and have proper transactions, does that mean this extra table will go away? {quote} Yes, but you need to understand, that proper Tx management is a hard task in this case. It is even harder than classic Tx management. DB Tx got rollbacked automatically in case of a collision (updates to the same record), but we have to merge these updates correctly, because backup sessions always update shared records. Is it worth doing? Only Admin can run backups and what is the use case when Admin starts two sessions in parallel if he can run them serially? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273958#comment-16273958 ] Mike Drob commented on HBASE-17852: --- When we switch away from restore-via-snapshot and have proper transactions, does that mean this extra table will go away? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273955#comment-16273955 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} There were concerns above on cross RS rpc to write the paths, I was trying to think of easiest way of avoiding that. How about returning the map as part of response here and then issue rpc to master from client side. It's easy and safer to retry from client side if remote resource isn't available. I'd suggest going extra step, an easy one though - collect all paths on client side and do single put request. That'll give two benefits: Will make it transactional incremental backup If put fails repeatedly, you can either fail bulk load altogether, or throw error to user telling that these bulk loaded files failed to backup and that only full backup will include them. {quote} I will think about this and will get back to you, [~appy] shortly. Thanks, for suggestion. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273953#comment-16273953 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} What happens if during an ongoing backup, i create some backup sets, but then the backup fails? Snapshot restore will remove my backup sets? {quote} Yes. Any modifications to backup meta table *during* backup create/merge/delete session, which *fails* will be lost. It is the limitation currently. As a simple workaround, any updates (*backup sets operations only*) to backup meta table can be disabled during these sessions. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273901#comment-16273901 ] Appy commented on HBASE-17852: -- Few questions: Pardon me if my high level analysis of design is off. Is following correct description of current design? Start bulkload from client -> each RS gets its RPC for prepare and then do the actual bulkload --> Internally when bulk load is done,BackupObserver#postBulkLoadHFile writes paths to backup table. And to avoid full backup failures from affecting incremental backups (due to snapshot restore), you are putting bulk loaded paths data in a separate table, right? There were concerns above on cross RS rpc to write the paths, I was trying to think of easiest way of avoiding that. How about returning the [map as part of response here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L2251] and then issue rpc to master from client side. It's easy and safer to retry from client side if remote resource isn't available. I'd suggest going extra step, an easy one though - collect all paths on client side and do single put request. That'll give two benefits: - Will make it transactional incremental backup - If put fails repeatedly, you can either fail bulk load altogether, or throw error to user telling that these bulk loaded files failed to backup and that only full backup will include them. What happens if during an ongoing backup, i create some backup sets, but then the backup fails? Snapshot restore will remove my backup sets? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273856#comment-16273856 ] Mike Drob commented on HBASE-17852: --- Hmm... I don't think we can publish backup/restore without HBASE-16391 in a 2.0 release. I'd like to have confidence that the feature is rock solid before telling users that it's ok to use, parallel operations seems like a major shortcoming to me. Maybe this isn't the right JIRA to discuss this, apologies for stepping into the crossfire here. I left a few comments on the RB, will continue to look after reading more of the general design. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273661#comment-16273661 ] Vladimir Rodionov commented on HBASE-17852: --- Yes, concurrent backup support will require some code rewrite. Rollback - via -snapshot won't work in this case probably, but this is internal implementation details and they won't affect users - backward compatibility is a must here. We do not have any specific timeouts for backups - only those, low level HBase timeouts for RPC and distributed procedure ops. If they time out - backup fails. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273568#comment-16273568 ] Mike Drob commented on HBASE-17852: --- Thank you for pointing at the parent ticket, I missed that there was design doc in there. I'm worried about our current design choices for future concurrent backup design. Please correct my gaps in understanding here: Current approach, which is limited to single backup operation involves snapshot the backup state table (what you refer to as backup system table, but I think state is more appropriate term) and then if failure then we restore the state? In future are we going to have multiple tables in backup namespace for each table to be backed up so that we can have concurrent approaches? Otherwise the concurrent backup solution will be a complete rewrite I expect. Do backup operations have timeouts? I don't see them in the code, but could be looking at the wrong place. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273529#comment-16273529 ] Vladimir Rodionov commented on HBASE-17852: --- The approach described in this JIRA description and in the parent ticket, [~mdrob]. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273491#comment-16273491 ] Mike Drob commented on HBASE-17852: --- bq. not a generic questions: Why have you chosen this design approach, especially when this approach has been discussed many times with other developers before Can you point me at a design document that covers this? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273426#comment-16273426 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} I was about to write that I thought it was a no-brainer to blindly run a repair as a part of the BackupDriver, but now I wonder about the following: Take two administrators running backups, unaware of each other. Admin1 starts a backup on Table1. Before Admin1's backup finishes, Admin2 tries to do a backup on Table2. Could Admin2 preempt/fail Admin1's backup by running a hbase backup repair while Admin1 is using the system? In other words: does hbase backup repair have the ability to differentiate between "user is currently executing a backup" and "stale state exists in the table from an aborted/unfinished operation"? {quote} All operations are serialized. Admin 2 will fail and will be waiting until first operation is complete (successfully or not). Multiple parallel backup sessions support is on roadmap for 2.1 release: https://issues.apache.org/jira/browse/HBASE-16391. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273414#comment-16273414 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} I'm not sure what this is intended to prove. Sometimes I get patches right on the first try, sometimes it takes twenty tries. {quote} Nothing, actually except that when contributor submit a patch he expects comments/questions related to the code of a patch not a generic questions: Why have you chosen this design approach, especially when this approach has been discussed many times with other developers before. It is very hard and time consuming to explain everything from a very beginning for a person who wants to participate in a review, but is not familiar with the code. I have two committers on the feature [~te...@apache.org] and [~elserj] who have spent a lot of time working on the code. I trust them and although I appreciate help from other developers, I expect them to spend some time digging into the full feature code, before trying to review a particular patch (one of more than 100 already). This requires some commitment. Any question on the patch itself? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273399#comment-16273399 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} It seems like the feature is being moved out because it's incomplete... {quote} Really, what is missing? Have you read the doc? Everything described in the B&R documentation have been implemented and tested. I am running integration tests now on a scale, this is probably the last what developer is supposed to do before declaring feature fully complete? We are still at alpha stage, plenty time to harden the feature before beta-1 or 2. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273376#comment-16273376 ] Josh Elser commented on HBASE-17852: bq. Operators who'd rather avoid reading logs and having to run repair tools are 'lazy'. bq. Do not we still have hbck for this reason? Repair \[...\] which happens periodically in HBase cluster. Let me also expand on this: I would consider "lazy" as a virtue for operators. The system should automatically handle as much as possible. There's a fundamental difference between what hbck is and what `hbase backup repair` is: HBCK is fixing things that inadvertently happen server-side (hopefully, only around bugs which has since been fixed) whereas hbase-backup are completely client-driven. For example, something as benign as a user ctrl-C'ing a backup because they mis-typed the backup name or table being backed up would cause the backup table to need a repair. bq. This is the question actually, should we do repair automatically or we need to inform user, that there was abnormal failure of a last backup/merge/delete command and user need to run repair. I was about to write that I thought it was a no-brainer to blindly run a repair as a part of the BackupDriver, but now I wonder about the following: Take two administrators running backups, unaware of each other. Admin1 starts a backup on Table1. Before Admin1's backup finishes, Admin2 tries to do a backup on Table2. Could Admin2 preempt/fail Admin1's backup by running a {{hbase backup repair}} while Admin1 is using the system? In other words: does {{hbase backup repair}} have the ability to differentiate between "user is currently executing a backup" and "stale state exists in the table from an aborted/unfinished operation"? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273371#comment-16273371 ] Mike Drob commented on HBASE-17852: --- bq. The reason is the simplicity of the implementation. Is not this obvious? It's not obvious, hence the need for clarifying questions. We're all collaborators here, Vlad, not adversaries. I haven't reviewed the code, so in this instance I'm a messenger and attempted mediator. bq. Should I have spent time trying to implement Tx management instead? Maybe! bq. Did I answer original question? I thought that we are technical guys and we need technical answers. It seems that I was wrong. I'm reminded of advice I got early in my software engineer career - it's easy to write code, it's less easy to write correct code, and it is actively hard to know which code to write. The technical answer may have been obvious like you assert, but it's not a complete answer. Understanding how the operators will need to use this feature and how they will interact with it is important in building something that is useful to them. bq. User intervention is required only if user kills backup process or it dies on a client side, for some other reason. There's lots of reasons that a process might die on the client side. Seems we may disagree on the frequency here. bq. Do not we still have hbck for this reason? Sure, we can extend hbck to take care of these failures as well. Does it currently do so? I have no idea. Probably not, given that I don't think hbck works with hbase-2.0 due to AMv2. bq. Moving feature out of beta-1 only because someone does not like attitude of a contributor It seems like the feature is being moved out because it's incomplete... And some earlier comments: bq. But for lazy operators... Lazy operators are the best kind. They are the ones that automate things, the ones that prepare and test for failure so that they don't get called in the middle of the night, the ones that actually make sure that the ship stays sailing. bq. The patch is no 8 already I'm not sure what this is intended to prove. Sometimes I get patches right on the first try, sometimes it takes twenty tries. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273329#comment-16273329 ] Vladimir Rodionov commented on HBASE-17852: --- The reason is the simplicity of the implementation. Is not this obvious? Should I have spent time trying to implement Tx management instead? I doubt. Did I answer original question? I thought that we are technical guys and we need technical answers. It seems that I was wrong. User intervention is required only if user kills backup process or it dies on a client side, for some other reason. All cluster side failures get repaired automatically. I see nothing painful for users here, [~mdrob], especially when I will implement auto-repair feature. This is the question actually, should we do repair automatically or we need to inform user, that there was abnormal failure of a last backup/merge/delete command and user need to run repair. Do not we still have *hbck* for this reason? Repair all the s**t which happens periodically in HBase cluster. Moving feature out of beta-1 only because someone does not like *attitude of a contributor* means that something is not going well in HBase community. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273167#comment-16273167 ] Mike Drob commented on HBASE-17852: --- [~vrodionov] - I think Josh clarified Stack's question to explain why he posits that you "didn't answer the question" bq. That's the technical explanation for why it is implemented as such, but I think the spirit of the question is more: "what are the reasons for making this choice and is there something that could be done to make this less painful for users?" > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273141#comment-16273141 ] Vladimir Rodionov commented on HBASE-17852: --- My attitude? Nice. Maybe yours? I tried several times to explain you obvious things, but you still not getting them. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272169#comment-16272169 ] stack commented on HBASE-17852: --- Moving out of beta-1. I ask questions and get rubbish back. Contributor has wrong attitude. Operators who'd rather avoid reading logs and having to run repair tools are 'lazy'. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch, > HBASE-17852-v9.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272106#comment-16272106 ] Hadoop QA commented on HBASE-17852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 9s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 19 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 25s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 15s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 41s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 17s{color} | {color:red} hbase-backup: The patch generated 4 new + 179 unchanged - 18 fixed = 183 total (was 197) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 670 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 16s{color} | {color:red} The patch 384 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 17s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 54m 25s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 16s{color} | {color:green} hbase-backup in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 28s{color} | {color:green} hbase-it in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 86m 27s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-17852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12899932/HBASE-17852-v9.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux b272d49628c9 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Bui
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271978#comment-16271978 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} You don't answer the question {quote} What question? What does "corrupt" mean? Why do I need to restore meta table? I am afraid, I can't add anything else to my answers above. {quote} Don't follow. An operator sets up a cron job. Works great for a few days. Then it stops. Operator needs to figure that he has to run a repair. Operator sets up two cron jobs? Or cron probes first for breakage... {quote} Stops means fails. If cron job fails, operator will need to intervene, read logs, manuals and figure out that repair is required. Not a big deal, imo. We clearly log message, that repair tool has to be run. But for lazy operators I will add auto-repir mode of execution (see above ticket). > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271903#comment-16271903 ] stack commented on HBASE-17852: --- bq. Can you post your comments on RB, Stack? Traditionally it is the contributors' job keeping the feedback in order and making sure it all addressed whether in JIRA or RB. Not addressing reviewers feedback or dropping it w/o comment is a total no-no. bq. I have explained this many times already ... You don't answer the question. You just make asserts that we have to rollback w/o justification other than backups 'become corrupt' or a backup is only 'safe' if it completes? Sounds like it needs to be 'transactional' but you don't describe the transaction (correct me if I'm wrong). I don't get why a completed backup can't just write a completion marker to the backup table. W/o it the backup is corrupt/incomplete and we just move on. bq. Running backup repair automatically in case of a backup failure won't hurt and can be incorporated into cron job Don't follow. An operator sets up a cron job. Works great for a few days. Then it stops. Operator needs to figure that he has to run a repair. Operator sets up two cron jobs? Or cron probes first for breakage... > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271873#comment-16271873 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} f the standard-procedures would be to run a repair blindly, why can't this be encapsulated in BackupDriver? Making the user's life easier is certainly beneficial. {quote} I can add auto-repair mode of execution for create/merge/delete. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271843#comment-16271843 ] Josh Elser commented on HBASE-17852: I think the only outstanding code-review comment from [~stack] was consolidation of two log messages into one (other questions were "why the bulk load backup table" which I think we better understand now and the use of TableDescriptorBuilder which I had already dinged and Vlad has fixed). > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271835#comment-16271835 ] Josh Elser commented on HBASE-17852: {quote} bq.Why do this? Why not just mark the backup as corrupt and move on? (Why does an incomplete back-up freeze all backups – which you say above I'm trying to understand). I have explained this many times already ... Restoring meta table in case of a backup failure is a necessary step to make future backups possible. We write some data during backup create, which is safe only of backup succeeds, such as last WAL roll timestamp per table-per RS. If backup fails, this data becomes corrupt w/o restoring meta table from snapshot. {quote} That's the technical explanation for why it is implemented as such, but I think the spirit of the question is more: "what are the reasons for making this choice and is there something that could be done to make this less painful for users?" {quote} bq. What if its a cron job? Does this inability at moving on past failure make it so backup cannot be cron'd? Running backup repair automatically in case of a backup failure won't hurt and can be incorporated into cron job {quote} If the standard-procedures would be to run a repair blindly, why can't this be encapsulated in BackupDriver? Making the user's life easier is certainly beneficial. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269756#comment-16269756 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} Why do this? Why not just mark the backup as corrupt and move on? (Why does an incomplete back-up freeze all backups – which you say above I'm trying to understand). {quote} I have explained this many times already ... Restoring meta table in case of a backup failure is a necessary step to make future backups possible. We write some data during backup create, which is safe only of backup succeeds, such as last WAL roll timestamp per table-per RS. If backup fails, this data becomes corrupt w/o restoring meta table from snapshot. {quote} What if its a cron job? Does this inability at moving on past failure make it so backup cannot be cron'd? {quote} Running backup repair automatically in case of a backup failure won't hurt and can be incorporated into cron job {quote} If we weren't snapshotting/restoring the backup table, we wouldn't have to make a separate table to hold bulkloaded files? Is that so? (I'm not asking for a rewrite...). {quote} Yes, correct. {quote} I am asking questions to try and understand what is going on in here. When the response is terse or lean on info, I'm going to ask another question... and so on. As to whether snapshot/restore of the meta backup table is 'bad' or not, I'm still trying to understand why we would go to the extreme of offlining a whole table – even though rare when in error and then it seems, this offlining is making it so we have to add yet another table just to hold bulk loaded files... Pardon my being slow. {quote} Yes, the second table has been added long after the initial implementation was complete as a result of hardening bulk load support feature. You may consider this a s work-around, but it is pretty lightweight work-around. W/o snapshots, we have to make all the changes to meta table fully transactional ones. I think it is much harder. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269733#comment-16269733 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} What of my review comments are addressed in latest patch? {quote} Can you post your comments on RB, Stack? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269677#comment-16269677 ] stack commented on HBASE-17852: --- Write-up helps. Some questions. bq. When operation fails on a server side, we handle this failure by cleaning up partial data in backup destination, followed by restoring backup meta-table from a snapshot. Why do this? Why not just mark the backup as corrupt and move on? (Why does an incomplete back-up freeze all backups -- which you say above I'm trying to understand). bq. When operation fails on a client side (abnormal termination, for example), next time user will try create/merge/delete he(she) will see error message, that system is in inconsistent state and repair is required, he(she) will need to run backup repair tool. What if its a cron job? Does this inability at moving on past failure make it so backup cannot be cron'd? bq. To avoid multiple writers to the backup system table (backup client and BackupObserver's) we introduce small table ONLY to keep listing of bulk loaded files. If we weren't snapshotting/restoring the backup table, we wouldn't have to make a separate table to hold bulkloaded files? Is that so? (I'm not asking for a rewrite...). bq. Your are the only one who is objecting snapshot-based approach, but I am still waiting for a single argument why is this bad? I am asking questions to try and understand what is going on in here. When the response is terse or lean on info, I'm going to ask another question... and so on. As to whether snapshot/restore of the meta backup table is 'bad' or not, I'm still trying to understand why we would go to the extreme of offlining a whole table -- even though rare when in error and then it seems, this offlining is making it so we have to add yet another table just to hold bulk loaded files... Pardon my being slow. What of my review comments are addressed in latest patch? Thanks. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265492#comment-16265492 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} Let me just assume this stuff is handled, but a walk through of what happens when the backup table goes away in different scenarios would be good. Is the above answered? (Copied from earlier in this dialog). {quote} When backup meta table goes away, bulk load will continue because bul load observers do not write to main meta table. When second table (for bulk loaded files) gets offlined - bulk loading fails. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261687#comment-16261687 ] Hadoop QA commented on HBASE-17852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 19 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 6s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s{color} | {color:red} hbase-backup: The patch generated 3 new + 179 unchanged - 18 fixed = 182 total (was 197) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 26s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 47m 8s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 28s{color} | {color:green} hbase-backup in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s{color} | {color:green} hbase-it in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 75m 39s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-17852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12898743/HBASE-17852-v8.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux bd11d8af32a0 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 3b2b22b5fa | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | check
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261499#comment-16261499 ] Hadoop QA commented on HBASE-17852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 9s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 19 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 32s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 26s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s{color} | {color:red} hbase-backup: The patch generated 6 new + 187 unchanged - 10 fixed = 193 total (was 197) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 52s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 52m 1s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 59s{color} | {color:green} hbase-backup in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s{color} | {color:green} hbase-it in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 80m 52s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-17852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12898716/HBASE-17852-v7.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 13b57d82c6f1 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 984e0ecfc4 | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | ch
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260212#comment-16260212 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} My gut reaction is that the number of backups which would need to be retained in the system (e.g. rows in the hbase backup "system" table) would have to be quite large to even grow beyond a single region (many thousands to millions). As such, the snapshot restore isn't much more than grabbing the write lock and replacing some one data file and some Region metadata. This is on my list today to investigate confirm. {quote} Yes, [~elserj], you are right. Backup system table for vast majority of deployments will fit a single region. It is a metadata - not a data. Therefore, creation of snapshot and restoring from snapshot is a very lightweight operation. That is was a major reason I have chosen rollback-via-snapshot approach. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259911#comment-16259911 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} 2) recover from client side failure (and, probably, implicitly meant to include un-handled server-side failure conditions too). {quote} For 2. we have backup repair tool, client will be asked to run repair tool next time he/she will try to run backup/restore/merge/delete > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259504#comment-16259504 ] Josh Elser commented on HBASE-17852: bq. I am not asking for any particular implementation, to be clear. I'm just trying to understand and am having trouble digesting full restore of a meta table whatever the size or traffic on error. It strikes me as whack (You seem to at least agree it 'overkill') Got it. To clarify my previous message, by "overkill" I only mean "non-ideal". As in, there is likely a more complicated solution that could accomplish the same net-effect with less computation+time required. I didn't mean to say that I believed using a snapshot and table-restore is invalid or wrong. My gut reaction is that the number of backups which would need to be retained in the system (e.g. rows in the hbase backup "system" table) would have to be quite large to even grow beyond a single region (many thousands to millions). As such, the snapshot restore isn't much more than grabbing the write lock and replacing some one data file and some Region metadata. This is on my list today to investigate confirm. To try to move the conversation forward, I tend to agree with Vlad that I don't seen an inherent problem with the rollback-via-snapshot implementation. Architecturally, Vlad is using the snapshot feature exactly how it was intended to be used (shallow copy and restore of a table). bq. the idea to offline a system table and then restore from a snapshot on error with clients 'advised' to stop writing as some-sort of 2PC Let's revisit this again: in the parent JIRA issue, Vlad outlined two-cases. 1) Recover from a "server-side" failure and 2) recover from client side failure (and, probably, implicitly meant to include un-handled server-side failure conditions too). For #1, clients don't need to do anything special (specifically mentioned on the parent issue). Mutual exclusion is already built in to manage the serialized state in the backup "system" table. So, we're just looking at the cost of these steps. Offline+snapshot+online should be one of these rock-solid features of the system. For #2, we're in this situation that you outline. Per the concerns you raised about "coordination" (the special handshake, to use another metaphor), this seems mitigate-able via return code of the {{hbase backup}} and a prominent error message in this case. I don't know if either presently exist (Could you comment, [~vrodionov]?). Both of these are predicated on the mutual exclusion of multiple clients at a higher level. Obviously, a finer grain exclusion strategy is desirable for multiple reasons, but, given my current understanding, I don't see any fundamental problem with this approach. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257974#comment-16257974 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} This isn't helpful and, likely, directly harmful :\ {quote} What about "what the fuck Vlad", Josh? Is it harmful? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257972#comment-16257972 ] Vladimir Rodionov commented on HBASE-17852: --- There is nothing wrong with snapshot approach until someone proves it is wrong. I am waiting for your arguments, Stack. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257968#comment-16257968 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} Vlad, you seem to be doing your utmost to sabotage the delivery of this feature. {quote} Do you really believe in that? Josh is just too polite, in my opinion. He is trying to be good with you. I am just who I am. I am too straightforward, Stack. The only person who sabotage this feature here is you. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257930#comment-16257930 ] stack commented on HBASE-17852: --- bq. Stack, are you essentially asking why this isn't implemented on top of ProcV2? bq. I think at this point, it would be more productive if we can say more "there is something implicitly broken with this approach" instead of "there is a more elegant implementation to be had". I am not asking for any particular implementation, to be clear. I'm just trying to understand and am having trouble digesting full restore of a meta table whatever the size or traffic on error. It strikes me as whack (You seem to at least agree it 'overkill'). There seems to be no write-up on the approach here ahead of piecemeal code drops (w/o overarching description of what all is entailed) so only way to figure it as best as I can ascertain, is via this really pleasant back and forth w the author. Vlad, you seem to be doing your utmost to sabotage the delivery of this feature. The sort of answers you give us reviewers is one thing. Will operators who run into issues w/ this feature get the same treatment? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257920#comment-16257920 ] Josh Elser commented on HBASE-17852: {code} In hbase2, we have builders for the below instead... 1381 HTableDescriptor tableDesc = new HTableDescriptor(getTableNameForBulkLoadedData(conf)); {code} I had left a similar comment on RB. This was fixed in v6 (patchset 5 on RB). I think the majority of other changes were suggestions I had left on RB -- have not explicitly checked, just going off of the "issues" being resolved. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257919#comment-16257919 ] Josh Elser commented on HBASE-17852: bq. This is going to confuse. 'system' tables have a particular meaning in hbase. Should be easy enough to rename with your IDE of choice, right Vlad? Avoiding overloading terminology is always a good idea. "BackupMetadata" and "BackupBulkLoadFiles"? (just pitching ideas) bq. The snapshot/restore of a whole system table strikes me as a bunch of moving parts. I have to ask why we got such an extreme? 2PC is tough-enough w/o offlining/restore of whole meta table. During restore, all clients are frozen out or something so they can't pollute the restored version? Restore is not atomic, right? We couldn't have something like a row-per-backup with a success tag if all went well (I've not been following closely – pardon all the questions). Stack, are you essentially asking why this isn't implemented on top of ProcV2? I'm trying to read between the lines but am not sure if I'm inventing something that isn't there. There are definitely areas of the code in which the acknowledgement has already been made about a better implementation can be done. For example, clients _are_ "frozen out" right now from concurrent operations (a nod that backups, merges, and restores could be done concurrently). I think at this point, it would be more productive if we can say more "there is something implicitly broken with this approach" instead of "there is a more elegant implementation to be had". I don't think anyone is arguing against that. Yes, rolling back the entire backup "system" table is overkill (for what may sometimes be deleting a single row/column -- the ACTIVE_SNAPSHOT as mentioned in the parent) and would take much longer that it could necessarily need to. bq. You suggest I review code. I have been reviewing code. Thats how we got here. And thank you for that. I know your intentions are good. We're all ultimately working towards a common goal here. bq. Sure, you can start from very beginning, Stack. Go ahead. This isn't helpful and, likely, directly harmful :\ > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257737#comment-16257737 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} So, the idea to offline a system table and then restore from a snapshot on error with clients 'advised' to stop writing as some-sort of 2PC got buy-in from others? This is 'fault-tolerance'? Is there a write-up somewhere that explains why we have to offline and then restore a whole table (whatever its size) just because a particular op failed and how it is more simple and elegant than other soluntions (what others?), I'd like to read it. Otherwise, I just don't get it (neither will the operator whose cron job failed because backup table was gone when it ran). {quote} Stack, you just out of context right now, but I appreciate you want to spend so much time digging into my code once again. Thanks. Your are the only one who is objecting snapshot-based approach, but I am still waiting for a single argument why is this bad? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257729#comment-16257729 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} You suggest I review code. I have been reviewing code. Thats how we got here. {quote} Sure, you can start from very beginning, Stack. Go ahead. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257706#comment-16257706 ] Hadoop QA commented on HBASE-17852: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 6m 7s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 25s{color} | {color:red} hbase-backup: The patch generated 4 new + 69 unchanged - 6 fixed = 73 total (was 75) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 6m 23s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 64m 41s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 16s{color} | {color:green} hbase-backup in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 9s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}101m 53s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-17852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12898274/HBASE-17852-v6.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 5973dfc868e3 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / ca74ec7740 | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/9901/artifact/patchprocess/diff-checkstyle-hbase-backup.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/9901/testReport/ | | modules | C: hbase-backup U: hbase-backup | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/9901/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This message was automat
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257621#comment-16257621 ] stack commented on HBASE-17852: --- My comments above are not in RB. Were they addressed? Patches should include description. Helps reviewers and those trying to follow-behind. Yours have none. You don't use the suggested patch-making tool either in-spite of an earlier request. So, the idea to offline a system table and then restore from a snapshot on error with clients 'advised' to stop writing as some-sort of 2PC got buy-in from others? This is 'fault-tolerance'? Is there a write-up somewhere that explains why we have to offline and then restore a whole table (whatever its size) just because a particular op failed and how it is more simple and elegant than other soluntions (what others?), I'd like to read it. Otherwise, I just don't get it (neither will the operator whose cron job failed because backup table was gone when it ran). You suggest I review code. I have been reviewing code. Thats how we got here. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257590#comment-16257590 ] Vladimir Rodionov commented on HBASE-17852: --- You can find them as fixed on RB: https://reviews.apache.org/r/63155/ > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257580#comment-16257580 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} The snapshot/restore of a whole system table strikes me as a bunch of moving parts. {quote} That is only one backup system table. {quote} I have to ask why we got such an extreme? {quote} What is so extreme here? Snapshot of a system table? I consider this approach much more simple and elegant than others? {quote} During restore, all clients are frozen out or something so they can't pollute the restored version? {quote} Yes. During table restore operation, all clients (of this table) must be stopped. In theory, this is not a hard requirement - it is just an advice. But, we truncate table, before restore and this, definitely, may affect unexpectedly incoming writes. Any database system, which allows writes to a table during restore of a table? Stack, if you have doubts in the implementation, I suggest you to go over code and find places where you think the code has issues. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257572#comment-16257572 ] stack commented on HBASE-17852: --- bq. v6 addresses some of the RB comments Which comments were addressed? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257561#comment-16257561 ] stack commented on HBASE-17852: --- bq. That is why we take snapshot of a backup system table and restore this table from snapshot, previously taken, in case of a command (create/delete/merge) failure. Was this written up somewhere previously and the design shopped before others with buy-in? The snapshot/restore of a whole system table strikes me as a bunch of moving parts. I have to ask why we got such an extreme? 2PC is tough-enough w/o offlining/restore of whole meta table. During restore, all clients are frozen out or something so they can't pollute the restored version? Restore is not atomic, right? We couldn't have something like a row-per-backup with a success tag if all went well (I've not been following closely -- pardon all the questions). > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257502#comment-16257502 ] Vladimir Rodionov commented on HBASE-17852: --- Backup/Delete/Merge operations must be executed in a transactional manner. Backup system table keeps data (meta-data) which allows to run backups and others commands. During backup create, delete or merge, we update backup system table multiple times and do not want these updates to be partial ones (when operation fails), because *it will prevent further backups/deletes/merges after a failure*. That is why we take snapshot of a backup system table and restore this table from snapshot, previously taken, in case of a command (create/delete/merge) failure. By consistency of data I mean - no partial updates should be visible to a user after operation completes (either successfully or not). Partial updates in a backup system tables == corruption of a system table and MUST be avoided. When corruption happens - the only way to restore backup system is to truncate backup system table and re-run all backups in full mode. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16256202#comment-16256202 ] stack commented on HBASE-17852: --- Thanks for the pointer. I'd not read it previous. It does not answer my question though, "Why would you restore a backup system table from a snapshot when a 'backup' fails? Backups are of user-space tables. How does this impinge on the backup 'system' table?" "in case if operation fails to restore meta - data consistency in a backup system table..." Yeah, which operation? Which meta? A backup meta? What consistency needs to be maintained in the backup table? > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16256193#comment-16256193 ] Vladimir Rodionov commented on HBASE-17852: --- Please, refer to a parent ticket for description what we perform in case of a failure https://issues.apache.org/jira/browse/HBASE-15227 In a few words, we take backup system table snapshot before backup/merge/delete/ and restore this table from snapshot back in case if operation fails to restore meta - data consistency in a backup system table > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16256183#comment-16256183 ] stack commented on HBASE-17852: --- bq. When backup fails, we restore backup system table from snapshot. Why would you restore a backup system table from a snapshot when a 'backup' fails? Backups are of user-space tables. How does this impinge on the backup 'system' table? bq. If Observers write to the same table as general backup operation, some data from Observers may be lost when we restore table from snapshot. I thought, I explained that. Where? Is there a writeup on how this all works? (It is not in the user-guide) bq. They are system from the point of view of a user. checkSystemTable checks backup system table. This is going to confuse. 'system' tables have a particular meaning in hbase. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16256126#comment-16256126 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} In the patch, it is called a ' Backup System table name for bulk loaded files' ... but it is not a system table? Is that so? And this is different from the BackupSystemTable which also is not a system table, right? BackupSystemTable does a checkSystemTable(); It is checking system tables, or not? {quote} They are system from the point of view of a user. checkSystemTable checks backup system table. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16256124#comment-16256124 ] Vladimir Rodionov commented on HBASE-17852: --- {quote} How do general backup ops and bulk loaded files effect each other? {quote} When backup fails, we restore backup system table from snapshot. If Observers write to the same table as general backup operation, some data from Observers may be lost when we restore table from snapshot. I thought, I explained that. Second table keeps only bulk load related references. We do not care about consistency of this table, because bulk load is idempotent operation and can be repeated after failure. Partially written data in second table does not affect on BackupHFileCleaner plugin, because this data (list of bulk loaded files) correspond to a files which are *have not been loaded yet successfully* and, hence - are not visible to the system > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)