[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2015-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026360#comment-15026360
 ] 

刘喆 commented on HDFS-7208:
--

Code in comment seems ugly, so I have added a patch file.

> NN doesn't schedule replication when a DN storage fails
> ---
>
> Key: HDFS-7208
> URL: https://issues.apache.org/jira/browse/HDFS-7208
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.6.0
>
> Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, 
> HDFS-7208-AdMaster.patch, HDFS-7208.patch
>
>
> We found the following problem. When a storage device on a DN fails, NN 
> continues to believe replicas of those blocks on that storage are valid and 
> doesn't schedule replication.
> A DN has 12 storage disks. So there is one blockReport for each storage. When 
> a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
> dfs.datanode.failed.volumes.tolerated is configured to be > 0, NN still 
> considers that DN healthy.
> 1. A disk failed. All blocks of that disk are removed from DN dataset.
>  
> {noformat}
> 2014-10-04 02:11:12,626 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
> replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
> /data/disk6/dfs/current
> {noformat}
> 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
> remove the DN and the replicas from the BlocksMap. In addition, blockReport 
> doesn't provide the diff given that is done per storage.
> {noformat}
> 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
> datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
> ipcPort=50020, 
> storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
>  DataNode failed volumes:/data/disk6/dfs/current
> {noformat}
> 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2015-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026339#comment-15026339
 ] 

刘喆 commented on HDFS-7208:
--

We meet the same problem, but we have a very simple path that works.  We can 
treat it as the datanode deleted the related blocks, so we only need one line 
to fix it.


diff --git 
a/hadoop/adh/src/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
 b/hadoop/adh/src/hadoop-hdfs-proje
index 3320c65..7a10072 100644
--- 
a/hadoop/adh/src/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
+++ 
b/hadoop/adh/src/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
@@ -1332,6 +1332,8 @@ public void checkDataDir() throws DiskErrorException {
   + " on failed volume " + 
fv.getCurrentDir().getAbsolutePath());
   ib.remove();
   removedBlocks++;
+  datanode.notifyNamenodeDeletedBlock(new ExtendedBlock(bpid, 
b.getBlockId()), b.getStorageUuid());
 }
   }
 }

> NN doesn't schedule replication when a DN storage fails
> ---
>
> Key: HDFS-7208
> URL: https://issues.apache.org/jira/browse/HDFS-7208
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.6.0
>
> Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch
>
>
> We found the following problem. When a storage device on a DN fails, NN 
> continues to believe replicas of those blocks on that storage are valid and 
> doesn't schedule replication.
> A DN has 12 storage disks. So there is one blockReport for each storage. When 
> a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
> dfs.datanode.failed.volumes.tolerated is configured to be > 0, NN still 
> considers that DN healthy.
> 1. A disk failed. All blocks of that disk are removed from DN dataset.
>  
> {noformat}
> 2014-10-04 02:11:12,626 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
> replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
> /data/disk6/dfs/current
> {noformat}
> 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
> remove the DN and the replicas from the BlocksMap. In addition, blockReport 
> doesn't provide the diff given that is done per storage.
> {noformat}
> 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
> datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
> ipcPort=50020, 
> storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
>  DataNode failed volumes:/data/disk6/dfs/current
> {noformat}
> 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-11-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196752#comment-14196752
 ] 

Chris Nauroth commented on HDFS-7208:
-

The new test cannot work correctly on Windows.  See HDFS-7355 for a full 
explanation and a trivial patch to skip the test on Windows.

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Ming Ma
Assignee: Ming Ma
 Fix For: 2.6.0

 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173639#comment-14173639
 ] 

Hudson commented on HDFS-7208:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #713 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/713/])
HDFS-7208. NN doesn't schedule replication when a DN storage fails.  
Contributed by Ming Ma (szetszwo: rev 41980c56d3c01d7a0ddc7deea2d89b7f28026722)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeStorageInfo.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeStorage.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Ming Ma
Assignee: Ming Ma
 Fix For: 2.6.0

 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173759#comment-14173759
 ] 

Hudson commented on HDFS-7208:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1903 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1903/])
HDFS-7208. NN doesn't schedule replication when a DN storage fails.  
Contributed by Ming Ma (szetszwo: rev 41980c56d3c01d7a0ddc7deea2d89b7f28026722)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeStorageInfo.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeStorage.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java


 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Ming Ma
Assignee: Ming Ma
 Fix For: 2.6.0

 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173769#comment-14173769
 ] 

Hudson commented on HDFS-7208:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1928 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1928/])
HDFS-7208. NN doesn't schedule replication when a DN storage fails.  
Contributed by Ming Ma (szetszwo: rev 41980c56d3c01d7a0ddc7deea2d89b7f28026722)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeStorage.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeStorageInfo.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Ming Ma
Assignee: Ming Ma
 Fix For: 2.6.0

 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-15 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172940#comment-14172940
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7208:
---

 The latest patch addresses all your comments, except for the allAlive one. 
 The reason is the patch handles deadnode separately from the failedStorage.

We need to change allAlive.  Otherwise, the while loop won't work if there is 
only failed storage.  Of course, we also need to update the if-condition for 
dead datanode.  Here is my suggestion:
{code}
while (!allAlive) {
  ...
  allAlive = dead == null  failedStorage == null;
  if (dead != null) {
...
  }
  ...
}
{code}

We should also call namesystem.checkSafeMode() in removeBlocksAssociatedTo(..).

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7208-2.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173216#comment-14173216
 ] 

Hadoop QA commented on HDFS-7208:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675133/HDFS-7208-3.patch
  against trunk revision 0af1a2b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.mover.TestStorageMover

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8435//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8435//console

This message is automatically generated.

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173312#comment-14173312
 ] 

Hudson commented on HDFS-7208:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6271 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6271/])
HDFS-7208. NN doesn't schedule replication when a DN storage fails.  
Contributed by Ming Ma (szetszwo: rev 41980c56d3c01d7a0ddc7deea2d89b7f28026722)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeStorage.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeStorageInfo.java


 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Ming Ma
Assignee: Ming Ma
 Fix For: 2.6.0

 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-15 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173320#comment-14173320
 ] 

Ming Ma commented on HDFS-7208:
---

Thanks Daryn for the input and Nicholas for the review and the commit.

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Ming Ma
Assignee: Ming Ma
 Fix For: 2.6.0

 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-14 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171928#comment-14171928
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7208:
---

Hi Ming, thanks for working on this.  The patch looks pretty good.  Some 
comments:

- For heartbeatedSinceRegistration == false, let's check failed storage anyway, 
i.e. no need to compare storageMap.size()  reports.length.
- The method removeBlocksOnDatanodeStorage(..) does not use anything in 
DatanodeManager.  We may move the code to 
BlockManager.removeBlocksAssociatedTo(..).
- In HeartbeatManager.heartbeatCheck(), allAlive should be changed to allAlive 
= dead == null  failedStorage == null.
- In DatanodeDescriptor.updateFailedStorage(..), check if a storage was already 
failed.  Log and update the state only if it was not already failed.
- HeartbeatManager.register(..) also calls 
DatanodeDescriptor.updateHeartbeat(..).  So setting 
heartbeatedSinceRegistration = true in updateHeartbeat(..) is wrong.  Need to 
fix it.


 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-14 Thread cho ju il (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171994#comment-14171994
 ] 

cho ju il commented on HDFS-7208:
-

What is the version of the bug occurred?
My cluster version is 2.4.1.
Can I apply the patch without service down-time ?

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7208-2.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172029#comment-14172029
 ] 

Hadoop QA commented on HDFS-7208:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674917/HDFS-7208-2.patch
  against trunk revision 0260231.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-hdfs-project/hadoop-hdfs 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8429//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8429//console

This message is automatically generated.

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7208-2.patch, HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170375#comment-14170375
 ] 

Hadoop QA commented on HDFS-7208:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674633/HDFS-7208.patch
  against trunk revision 178bc50.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication
  org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8413//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8413//console

This message is automatically generated.

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: HDFS-7208.patch


 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-10 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167154#comment-14167154
 ] 

Ming Ma commented on HDFS-7208:
---

Thanks, Daryn. We can do #3, but want to put the approaches in the following 
ways, in part to clarify the design of heterogeneous storage. [~arpitagarwal] 
and others might have more input here. Note that 
dfs.datanode.failed.volumes.tolerated  0 in the discussion.

1. Have DN eventually deliver failed storage notification. Prior to 
heterogeneous storage, NN detects missing replicas on the failed storage via 
BR. So if we use BR to report failed storage, we are on par in terms of time to 
detect metrics. However, we have to make sure DN eventually deliver the failed 
storage notification in all cases. hotswap is one scenario. Here is another 
scenario, a) A storage fails. b) DN restarts prior to the next BR. c) DN 
couldn't send BR after restart as it excluded the failed storage during 
startup. To address this, we can persist storage ids that DN need to BR on, 
perhaps on other healthy storages.

2. Have DN timely deliver failed storage notification. DN provides 
StorageReport via HB. With this NN could detect failed storage much faster. 
This will greatly improve time to detect metrics. But this requires HB to take 
the FSNS write lock. We can make it async without FSNS write lock. This can be 
done in a separate jira.

3. Time out on DN storage notification. Similar to how NN use HB to manage DN, 
we can have HB for each storage. There should be some max time out value of 
notification for any given storage. But if the design of heterogeneous storage 
is to allow a DN to use different BR intervals for different storages, we could 
potentially have much larger value of BR for a given storage.

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma

 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-08 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163169#comment-14163169
 ] 

Ming Ma commented on HDFS-7208:
---

We can work around it by setting dfs.datanode.failed.volumes.tolerated to zero 
so that as long as there is one disk failure, NN will remove that DN. For the 
fix, there are several possible approaches.

1. Have DN notify NN via DatanodeProtocol.reportBadBlocks for these blocks.
2. Modify DatanodeProtocol.errorReport so that DN can pass storage id to NN.
3. Have DN send blockReport for this failed storage so that NN can detect that.

Appreciate any suggestions.

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma

 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-10-08 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163615#comment-14163615
 ] 

Daryn Sharp commented on HDFS-7208:
---

I think #3 will be the minimally invasive change.  Should just need to send an 
empty report for the failed storage.

 NN doesn't schedule replication when a DN storage fails
 ---

 Key: HDFS-7208
 URL: https://issues.apache.org/jira/browse/HDFS-7208
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma

 We found the following problem. When a storage device on a DN fails, NN 
 continues to believe replicas of those blocks on that storage are valid and 
 doesn't schedule replication.
 A DN has 12 storage disks. So there is one blockReport for each storage. When 
 a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
 dfs.datanode.failed.volumes.tolerated is configured to be  0, NN still 
 considers that DN healthy.
 1. A disk failed. All blocks of that disk are removed from DN dataset.
  
 {noformat}
 2014-10-04 02:11:12,626 WARN 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
 replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
 /data/disk6/dfs/current
 {noformat}
 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
 remove the DN and the replicas from the BlocksMap. In addition, blockReport 
 doesn't provide the diff given that is done per storage.
 {noformat}
 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
 datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
 ipcPort=50020, 
 storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
  DataNode failed volumes:/data/disk6/dfs/current
 {noformat}
 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)