[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13571002#comment-13571002 ] Hudson commented on HBASE-7643: --- Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See [https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438973) Result = FAILURE mbertozzi : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563643#comment-13563643 ] Lars Hofhansl commented on HBASE-7643: -- You going to commit [~mbertozzi]? HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563648#comment-13563648 ] Matteo Bertozzi commented on HBASE-7643: committed to trunk and 0.94, thanks guys for the review! HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563675#comment-13563675 ] Hudson commented on HBASE-7643: --- Integrated in HBase-TRUNK #3806 (See [https://builds.apache.org/job/HBase-TRUNK/3806/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438972) Result = FAILURE mbertozzi : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563683#comment-13563683 ] Hudson commented on HBASE-7643: --- Integrated in HBase-0.94 #782 (See [https://builds.apache.org/job/HBase-0.94/782/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438973) Result = FAILURE mbertozzi : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563691#comment-13563691 ] Hudson commented on HBASE-7643: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #377 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/377/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438972) Result = FAILURE mbertozzi : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563733#comment-13563733 ] Hudson commented on HBASE-7643: --- Integrated in HBase-0.94-security #99 (See [https://builds.apache.org/job/HBase-0.94-security/99/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438973) Result = FAILURE mbertozzi : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562883#comment-13562883 ] Jesse Yates commented on HBASE-7643: Looks good to me too. Annoying that this isn't covered by the master not doing recursive delete of the directory (which fails if there are things underneath)... grr, race conditions. Thanks matteo! HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13561500#comment-13561500 ] Hadoop QA commented on HBASE-7643: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566268/HBASE-7653-p4-v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4156//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4156//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4156//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4156//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4156//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4156//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4156//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4156//console This message is automatically generated. HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562136#comment-13562136 ] Lars Hofhansl commented on HBASE-7643: -- Patch looks good. Curious about the RETRIES=6, how did you come up with 6? Should be very rare condition to I would through something like 3 is better. Looking at all the deletingXXXWithoutArchiving. Should we rename these all for deletingXXXWithoutArchivingForTests? These should only ever be called during the tests, right? HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562143#comment-13562143 ] Matteo Bertozzi commented on HBASE-7643: I wanted a while (true) just to be sure, but unless you're spinning like in the test just one retry should be fine. I'll change it to 3. some of the deletingXXXWithoutArchiving() are used in the main code, but mostly to perform the delete empty directory from the original location. HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562251#comment-13562251 ] Matteo Bertozzi commented on HBASE-7643: noticed that the deletingXXXWithoutArchiving() are private, so not used by tests HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562269#comment-13562269 ] Lars Hofhansl commented on HBASE-7643: -- Sorry, what I meant to say is that the conditions for these to be called only occur in tests. HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562312#comment-13562312 ] Hadoop QA commented on HBASE-7643: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566425/HBASE-7653-p4-v7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.TestZooKeeper Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4173//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4173//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4173//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4173//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4173//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4173//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4173//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4173//console This message is automatically generated. HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations *
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562461#comment-13562461 ] Lars Hofhansl commented on HBASE-7643: -- +1 on v7. Mr. [~jesse_yates], wanna have a look too? HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560939#comment-13560939 ] Hadoop QA commented on HBASE-7643: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566153/HBASE-7653-p4-v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.TestLocalHBaseCluster Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4149//console This message is automatically generated. HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something,
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13561143#comment-13561143 ] Matteo Bertozzi commented on HBASE-7643: committed p4-v5 to the snapshot branch, to have more coverage (jenkins, test rig, ...) I'll commit it to trunk in a couple of days if everything is fine and there're no objections. HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13561355#comment-13561355 ] Lars Hofhansl commented on HBASE-7643: -- Feel free to commit to 0.94 as well. HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira