[ 
https://issues.apache.org/jira/browse/HDFS-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315064#comment-14315064
 ] 

Colin Patrick McCabe edited comment on HDFS-7722 at 2/10/15 10:14 PM:
----------------------------------------------------------------------

FsDatasetSpi#checkDataDir is really checking volumes, and returning a list of 
failed volumes.  But it returns a {{Set<File>}} instead, and then we look up 
the {{FsVolume}} objects again based on that.  Why?  Why not simply pass back a 
list of {{FsVolumeRef}} objects that we could use directly?

Similarly, {{FsVolumeList#checkDirs}} should be returning a list of 
{{FsVolumeRef}}, not {{FsVolumeImpl}}.

We should be rate-limiting {{FsVolumeList#checkDirs}} as well.  Please remember 
that this scans *all* files on a volume, which is an expensive operation.  If 
we hit a bunch of I/O errors, we could end up calling checkDirs over and over 
within the same minute or two.  We should rate-limit this so that we ignore 
calls to checkDirs on a volume that happen more than once every 10 minutes or 
something.

{code}
+        // Disable the volume from the service
 +      asyncDiskService.removeVolume(fv.getCurrentDir());
{code}

Why are we duplicating all the logic of {{FsDatasetImpl#removeVolumes}} here?  
If we want {{checkDirs}} to remove a volume, it should just call that function. 
 That would also avoid the need to have {{DataNode#checkDiskError}} do anything 
but simply call {{FsDatasetSpi#checkDirs}}.


was (Author: cmccabe):
FsDatasetSpi#checkDataDir is really checking volumes, and returning a list of 
failed volumes.  But it returns a {{Set<File>}} instead, and then we look up 
the {{FsVolume}} objects again based on that.  Why?  Why not simply pass back a 
list of {{FsVolumeRef}} objects that we could use directly?

Similarly, {{FsVolumeList#checkDirs}} should be returning a list of 
{{FsVolumeRef}}, not {{FsVolumeImpl}}.

We should be rate-limiting {{FsVolumeList#checkDirs}} as well.  Please remember 
that this scans *all* files on a volume, which is an expensive operation.  If 
we hit a bunch of I/O operations, we could end up calling checkDirs over and 
over within the same minute or two.  We should rate-limit this so that we 
ignore calls to checkDirs on a volume that happen more than once every 10 
minutes or something.

{code}
+        // Disable the volume from the service
 +      asyncDiskService.removeVolume(fv.getCurrentDir());
{code}

Why are we duplicating all the logic of {{FsDatasetImpl#removeVolumes}} here?  
If we want {{checkDirs}} to remove a volume, it should just call that function. 
 That would also avoid the need to have {{DataNode#checkDiskError}} do anything 
but simply call {{FsDatasetSpi#checkDirs}}.

> DataNode#checkDiskError should also remove Storage when error is found.
> -----------------------------------------------------------------------
>
>                 Key: HDFS-7722
>                 URL: https://issues.apache.org/jira/browse/HDFS-7722
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>         Attachments: HDFS-7722.000.patch
>
>
> When {{DataNode#checkDiskError}} found disk errors, it removes all block 
> metadatas from {{FsDatasetImpl}}. However, it does not removed the 
> corresponding {{DataStorage}} and {{BlockPoolSliceStorage}}. 
> The result is that, we could not directly run {{reconfig}} to hot swap the 
> failure disks without changing the configure file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to