[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020849#comment-16020849 ] Hudson commented on HBASE-17938: FAILURE: Integrated in Jenkins build HBase-HBASE-14614 #244 (See [https://builds.apache.org/job/HBase-HBASE-14614/244/]) HBASE-17938 General fault - tolerance framework for backup/restore (tedyu: rev 305ffcb04025ea6f7880e9961120d309f55bf8ba) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupAdminImpl.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupCommands.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/FullTableBackupClient.java * (add) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/BackupClientFactory.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/BackupDriver.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupSystemTable.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/BackupRestoreConstants.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalTableBackupClient.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/TableBackupClient.java * (add) hbase-server/src/test/java/org/apache/hadoop/hbase/backup/TestFullBackupWithFailures.java > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch, HBASE-17938-v4.patch, HBASE-17938-v5.patch, > HBASE-17938-v6.patch, HBASE-17938-v7.patch, HBASE-17938-v8.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008530#comment-16008530 ] Hudson commented on HBASE-17938: FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #2998 (See [https://builds.apache.org/job/HBase-Trunk_matrix/2998/]) HBASE-17938 General fault - tolerance framework for backup/restore (tedyu: rev 305ffcb04025ea6f7880e9961120d309f55bf8ba) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupCommands.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/FullTableBackupClient.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/TableBackupClient.java * (add) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/BackupClientFactory.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/BackupRestoreConstants.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/BackupDriver.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupAdminImpl.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupSystemTable.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalTableBackupClient.java * (add) hbase-server/src/test/java/org/apache/hadoop/hbase/backup/TestFullBackupWithFailures.java > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch, HBASE-17938-v4.patch, HBASE-17938-v5.patch, > HBASE-17938-v6.patch, HBASE-17938-v7.patch, HBASE-17938-v8.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006711#comment-16006711 ] Vladimir Rodionov commented on HBASE-17938: --- Test failures are unrelated. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch, HBASE-17938-v4.patch, HBASE-17938-v5.patch, > HBASE-17938-v6.patch, HBASE-17938-v7.patch, HBASE-17938-v8.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005837#comment-16005837 ] Hadoop QA commented on HBASE-17938: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 41s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 45s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 38s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 29s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 59s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 57m 4s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 52s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 109m 58s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 40s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 197m 32s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Timed out junit tests | org.apache.hadoop.hbase.snapshot.TestMobSecureExportSnapshot | | | org.apache.hadoop.hbase.snapshot.TestSecureExportSnapshot | | | org.apache.hadoop.hbase.snapshot.TestMobExportSnapshot | | | org.apache.hadoop.hbase.snapshot.TestExportSnapshot | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:757bf37 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867473/HBASE-17938-v8.patch | | JIRA Issue | HBASE-17938 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 1408fe7d28f1 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / c5cc81d | | Default Java | 1.8.0_131 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/6757/artifact/patchprocess/patch-unit-hbase-server.txt | | unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/6757/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/6757/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/6757/console | | Powered by | Apache Yetus 0.3.0 http://yetus.apache.org
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005657#comment-16005657 ] Ted Yu commented on HBASE-17938: Can you submit the patch for QA run ? > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch, HBASE-17938-v4.patch, HBASE-17938-v5.patch, > HBASE-17938-v6.patch, HBASE-17938-v7.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003970#comment-16003970 ] Ted Yu commented on HBASE-17938: For XXTableBackupClient, you can make the failStageIf() no op. In test(s), create test client which extends XXTableBackupClient with failStageIf() that does fault injection. This would reduce code duplication. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch, HBASE-17938-v4.patch, HBASE-17938-v5.patch, > HBASE-17938-v6.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003833#comment-16003833 ] Vladimir Rodionov commented on HBASE-17938: --- We have discussed this already. If some step fail during failBackup execution, user will be notified of a failure and advised to run repair tool manually. I will fix the wording of IOException in case if operation fails in repair phase (failBackup) > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch, HBASE-17938-v4.patch, HBASE-17938-v5.patch, > HBASE-17938-v6.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003820#comment-16003820 ] Ted Yu commented on HBASE-17938: In cleanupAndRestoreBackupSystem(), {code} if (type == BackupType.FULL) { deleteSnapshots(conn, backupInfo, conf); cleanupExportSnapshotLog(conf); } restoreBackupTable(conn, conf); {code} What if deleteSnapshots() throws exception ? restoreBackupTable() would be skipped. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch, HBASE-17938-v4.patch, HBASE-17938-v5.patch, > HBASE-17938-v6.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995868#comment-15995868 ] Ted Yu commented on HBASE-17938: Somehow review board doesn't accept my review comments. {code} conn = ConnectionFactory.createConnection(getConf()); {code} Where is conn released ? {code} firstBackup = savedStartCode == null || Long.parseLong(savedStartCode) == 0L; {code} What if a second client comes and sees the savedStartCode as zero (written by line 130) ? There is duplicate code between executeForTesting() and execute(). Extract common code. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch, HBASE-17938-v4.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993471#comment-15993471 ] Vladimir Rodionov commented on HBASE-17938: --- OK. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993305#comment-15993305 ] Ted Yu commented on HBASE-17938: Vlad: Mind adding unit test which exercises the newly added code ? > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15991883#comment-15991883 ] Devaraj Das commented on HBASE-17938: - I agree.. Let's keep it simple for now. Once we get more experience, we can enhance as needed. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15991862#comment-15991862 ] Vladimir Rodionov commented on HBASE-17938: --- {quote} Currently fault handling is coarse grained (see cleanupAndRestoreBackupSystem). I suggest investigating fine grained approach where last successful sub-step is recorded so that subsequent run can omit unnecessary work. {quote} The same as above. Complex algorithms are proved to be fragile. There is no need for additional complexity here. Repair operation is very fast (deletes some files and restores table from snapshot) > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15991857#comment-15991857 ] Vladimir Rodionov commented on HBASE-17938: --- {quote} Assuming IOE may come out of each of the calls above, shouldn't a state machine be designed for more robustness ? {quote} No needs to overcomplicate the feature. If system repair fails in the middle, user will get notified and will have a chance to fix everything by running repair tool manually. Repair operation is idempotent and can be run as many times as we need. Did I address your concerns, [~tedyu]? > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15991816#comment-15991816 ] Ted Yu commented on HBASE-17938: I did look at the review board 5 days ago - my latest comments were not addressed. Currently fault handling is coarse grained (see cleanupAndRestoreBackupSystem). I suggest investigating fine grained approach where last successful sub-step is recorded so that subsequent run can omit unnecessary work. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15991233#comment-15991233 ] Vladimir Rodionov commented on HBASE-17938: --- [~tedyu], when you have a time please take a look at the above RB submission. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985724#comment-15985724 ] Vladimir Rodionov commented on HBASE-17938: --- https://reviews.apache.org/r/58757/ > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985614#comment-15985614 ] Ted Yu commented on HBASE-17938: {code} +throw new IOException("Active session found, aborted command execution"); {code} Include information about the first session in message. {code} + if (type == BackupType.FULL) { + deleteSnapshots(conn, backupInfo, conf); + cleanupExportSnapshotLog(conf); + } + restoreBackupTable(conn, conf); + deleteBackupTableSnapshot(conn, conf); + // clean up the uncompleted data at target directory if the ongoing backup has already entered + // the copy phase + // For incremental backup, DistCp logs will be cleaned with the targetDir. + cleanupTargetDir(backupInfo, conf); {code} Assuming IOE may come out of each of the calls above, shouldn't a state machine be designed for more robustness ? Please put the next patch on review board. > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch, HBASE-17938-v2.patch, > HBASE-17938-v3.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-17938) General fault - tolerance framework for backup/restore operations
[ https://issues.apache.org/jira/browse/HBASE-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983725#comment-15983725 ] Ted Yu commented on HBASE-17938: {code} +snapshotBackupSystemTable(); {code} backup table is not in hbase namespace anymore, right ? Rename the above method and other new methods. {code} -LOG.debug("when deleting snapshot " + snapshotName, ioe); +LOG.error("when deleting snapshot " + snapshotName, ioe); {code} What's the action for the above error ? > General fault - tolerance framework for backup/restore operations > - > > Key: HBASE-17938 > URL: https://issues.apache.org/jira/browse/HBASE-17938 > Project: HBase > Issue Type: Sub-task >Reporter: Vladimir Rodionov >Assignee: Vladimir Rodionov > Fix For: 2.0.0 > > Attachments: HBASE-17938-v1.patch > > > The framework must take care of all general types of failures during backup/ > restore and restore system to the original state in case of a failure. > That won't solve all the possible issues but we have a separate JIRAs for > them as a sub-tasks of HBASE-15277 -- This message was sent by Atlassian JIRA (v6.3.15#6346)