[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963432#comment-14963432 ] Dave Marion commented on HDFS-8486: --- Thanks for the quick response! > DN startup may cause severe data loss > - > > Key: HDFS-8486 > URL: https://issues.apache.org/jira/browse/HDFS-8486 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 0.23.1, 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Blocker > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.7.1 > > Attachments: HDFS-8486-branch-2.6.02.patch, > HDFS-8486-branch-2.6.addendum.patch, HDFS-8486-branch-2.6.patch, > HDFS-8486.patch, HDFS-8486.patch > > > A race condition between block pool initialization and the directory scanner > may cause a mass deletion of blocks in multiple storages. > If block pool initialization finds a block on disk that is already in the > replica map, it deletes one of the blocks based on size, GS, etc. > Unfortunately it _always_ deletes one of the blocks even if identical, thus > the replica map _must_ be empty when the pool is initialized. > The directory scanner starts at a random time within its periodic interval > (default 6h). If the scanner starts very early it races to populate the > replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963398#comment-14963398 ] Dave Marion commented on HDFS-8486: --- Does this also affect 2.5.0? If so, can someone provide a patch for it? The branch-2.6 patches don't apply cleanly and the code is different. > DN startup may cause severe data loss > - > > Key: HDFS-8486 > URL: https://issues.apache.org/jira/browse/HDFS-8486 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 0.23.1, 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Blocker > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.7.1 > > Attachments: HDFS-8486-branch-2.6.02.patch, > HDFS-8486-branch-2.6.addendum.patch, HDFS-8486-branch-2.6.patch, > HDFS-8486.patch, HDFS-8486.patch > > > A race condition between block pool initialization and the directory scanner > may cause a mass deletion of blocks in multiple storages. > If block pool initialization finds a block on disk that is already in the > replica map, it deletes one of the blocks based on size, GS, etc. > Unfortunately it _always_ deletes one of the blocks even if identical, thus > the replica map _must_ be empty when the pool is initialized. > The directory scanner starts at a random time within its periodic interval > (default 6h). If the scanner starts very early it races to populate the > replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963429#comment-14963429 ] Arpit Agarwal commented on HDFS-8486: - 2.5.0 is not affected. > DN startup may cause severe data loss > - > > Key: HDFS-8486 > URL: https://issues.apache.org/jira/browse/HDFS-8486 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 0.23.1, 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Blocker > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.7.1 > > Attachments: HDFS-8486-branch-2.6.02.patch, > HDFS-8486-branch-2.6.addendum.patch, HDFS-8486-branch-2.6.patch, > HDFS-8486.patch, HDFS-8486.patch > > > A race condition between block pool initialization and the directory scanner > may cause a mass deletion of blocks in multiple storages. > If block pool initialization finds a block on disk that is already in the > replica map, it deletes one of the blocks based on size, GS, etc. > Unfortunately it _always_ deletes one of the blocks even if identical, thus > the replica map _must_ be empty when the pool is initialized. > The directory scanner starts at a random time within its periodic interval > (default 6h). If the scanner starts very early it races to populate the > replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963431#comment-14963431 ] Dave Marion commented on HDFS-8486: --- Thanks for the quick response! > DN startup may cause severe data loss > - > > Key: HDFS-8486 > URL: https://issues.apache.org/jira/browse/HDFS-8486 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 0.23.1, 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Blocker > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.7.1 > > Attachments: HDFS-8486-branch-2.6.02.patch, > HDFS-8486-branch-2.6.addendum.patch, HDFS-8486-branch-2.6.patch, > HDFS-8486.patch, HDFS-8486.patch > > > A race condition between block pool initialization and the directory scanner > may cause a mass deletion of blocks in multiple storages. > If block pool initialization finds a block on disk that is already in the > replica map, it deletes one of the blocks based on size, GS, etc. > Unfortunately it _always_ deletes one of the blocks even if identical, thus > the replica map _must_ be empty when the pool is initialized. > The directory scanner starts at a random time within its periodic interval > (default 6h). If the scanner starts very early it races to populate the > replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707163#comment-14707163 ] Chris Nauroth commented on HDFS-8486: - +1 for the addendum patch. Thank you, Arpit. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Labels: 2.6.1-candidate Fix For: 2.6.1, 2.7.1 Attachments: HDFS-8486-branch-2.6.02.patch, HDFS-8486-branch-2.6.addendum.patch, HDFS-8486-branch-2.6.patch, HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707204#comment-14707204 ] Arpit Agarwal commented on HDFS-8486: - Thanks Chris, pushed to branch-2.6. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Labels: 2.6.1-candidate Fix For: 2.6.1, 2.7.1 Attachments: HDFS-8486-branch-2.6.02.patch, HDFS-8486-branch-2.6.addendum.patch, HDFS-8486-branch-2.6.patch, HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660889#comment-14660889 ] Arpit Agarwal commented on HDFS-8486: - Merged for 2.6.1. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Labels: 2.6.1-candidate Fix For: 2.6.1, 2.7.1 Attachments: HDFS-8486-branch-2.6.02.patch, HDFS-8486-branch-2.6.patch, HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14658729#comment-14658729 ] Xiaoyu Yao commented on HDFS-8486: -- Thanks Arpit. The branch-2.6 patch LGTM, +1. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Labels: 2.6.1-candidate Fix For: 2.7.1 Attachments: HDFS-8486-branch-2.6.02.patch, HDFS-8486-branch-2.6.patch, HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659030#comment-14659030 ] Arpit Agarwal commented on HDFS-8486: - Thanks [~xyao], will hold off committing for a couple of days in case there are additional comments. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Labels: 2.6.1-candidate Fix For: 2.7.1 Attachments: HDFS-8486-branch-2.6.02.patch, HDFS-8486-branch-2.6.patch, HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633828#comment-14633828 ] Daryn Sharp commented on HDFS-8486: --- Public service notice: * _Every restart of a 2.6.x or 2.7.0 DN incurs a risk of unwanted block deletion_. * Apply this patch if you are running a pre-2.7.1 release. I previously attributed this as an ancient bug but it's new to 2.6. HDFS-2560 did start the scanner too early but the race caused a benign log warning. In 2.6, HDFS-6931 made an unrelated change that introduced the faulty (mass) deletion logic. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570632#comment-14570632 ] Hudson commented on HDFS-8486: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #217 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/217/]) HDFS-8486. DN startup may cause severe data loss (Daryn Sharp via Colin P. McCabe) (cmccabe: rev 03fb5c642589dec4e663479771d0ae1782038b63) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570644#comment-14570644 ] Hudson commented on HDFS-8486: -- FAILURE: Integrated in Hadoop-Yarn-trunk #947 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/947/]) HDFS-8486. DN startup may cause severe data loss (Daryn Sharp via Colin P. McCabe) (cmccabe: rev 03fb5c642589dec4e663479771d0ae1782038b63) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570909#comment-14570909 ] Hudson commented on HDFS-8486: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #206 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/206/]) HDFS-8486. DN startup may cause severe data loss (Daryn Sharp via Colin P. McCabe) (cmccabe: rev 03fb5c642589dec4e663479771d0ae1782038b63) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570862#comment-14570862 ] Hudson commented on HDFS-8486: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2145 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2145/]) HDFS-8486. DN startup may cause severe data loss (Daryn Sharp via Colin P. McCabe) (cmccabe: rev 03fb5c642589dec4e663479771d0ae1782038b63) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570823#comment-14570823 ] Hudson commented on HDFS-8486: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #215 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/215/]) HDFS-8486. DN startup may cause severe data loss (Daryn Sharp via Colin P. McCabe) (cmccabe: rev 03fb5c642589dec4e663479771d0ae1782038b63) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570814#comment-14570814 ] Hudson commented on HDFS-8486: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2163 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2163/]) HDFS-8486. DN startup may cause severe data loss (Daryn Sharp via Colin P. McCabe) (cmccabe: rev 03fb5c642589dec4e663479771d0ae1782038b63) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569547#comment-14569547 ] Colin Patrick McCabe commented on HDFS-8486: Since the only change I was requesting was adding the {{\@VisibleForTesting}} annotation, and since this fix is so critical, I'm going to commit it now and file a follow-on to add the annotation. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569562#comment-14569562 ] Hudson commented on HDFS-8486: -- FAILURE: Integrated in Hadoop-trunk-Commit #7944 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7944/]) HDFS-8486. DN startup may cause severe data loss (Daryn Sharp via Colin P. McCabe) (cmccabe: rev 03fb5c642589dec4e663479771d0ae1782038b63) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567998#comment-14567998 ] Hadoop QA commented on HDFS-8486: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 10s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 28s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 30s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 50s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 12s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 16s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 162m 13s | Tests failed in hadoop-hdfs. | | | | 204m 13s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.server.namenode.TestFileTruncate | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12736613/HDFS-8486.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 63e3fee | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/11190/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/11190/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/11190/console | This message was automatically generated. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568046#comment-14568046 ] Colin Patrick McCabe commented on HDFS-8486: Great find, [~daryn]. And nice work fixing it... as usual. It sounds like this change: {code} --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java @@ -1370,9 +1370,9 @@ void initBlockPool(BPOfferService bpos) throws IOException { // failures. checkDiskError(); -initDirectoryScanner(conf); data.addBlockPool(nsInfo.getBlockPoolID(), conf); blockScanner.enableBlockPoolId(bpos.getBlockPoolId()); +initDirectoryScanner(conf); {code} should be sufficient to avoid the problem for the non-federation case, since the {{FsDatasetSpi#addBlockPool}} code path will do the initial scan even before the {{DirectoryScanner}} is created. The change to {{selectReplicaToDelete}} should guard against the problem in the federation case, by never deleting a replica just because we already have a replica with the same path in the set. It's a nice robustness improvement. bq. Note I found writing a unit test to be extremely difficult. The BlockPoolSlice ctor has numerous side-effects. I instead split out part of duplicate resolution into a static method (sigh, makes future mocking impossible). Hmm... it seems like you could create a mock for {{BlockPoolSlice#resolveDuplicateReplicas}}, which is the only caller of the static method. For that reason, perhaps we should add {{\@VisibleForTesting}} to {{selectReplicaToDelete}}? +1 pending that change. Great work, Daryn. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563827#comment-14563827 ] Hadoop QA commented on HDFS-8486: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 11s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 50s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 13s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 20s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 18s | Pre-build of native portion | | {color:green}+1{color} | hdfs tests | 161m 56s | Tests passed in hadoop-hdfs. | | | | 209m 11s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12735927/HDFS-8486.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 7ebe80e | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/11153/artifact/patchprocess/whitespace.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/11153/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/11153/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/11153/console | This message was automatically generated. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Attachments: HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563196#comment-14563196 ] Daryn Sharp commented on HDFS-8486: --- What you'll notice is a spike in corrupt blocks that tapers down. What's going on is the DN's block report included all the blocks it deleted. Over the next 6 hours, the slice scanner slowly detects missing blocks and reports them as corrupt. After 6 hours, the directory scanner detects and mass removes all the missing blocks. In the 6 hour window, the NN does not know the block is under-replicated and it continues to send clients to the DN. Will file a separate bug for the DN not informing the NN when it's missing a block it thought it had. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561250#comment-14561250 ] Daryn Sharp commented on HDFS-8486: --- A subtle reordering of method invocation appears to be the source of the bug. DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)