[jira] [Commented] (HDFS-14993) checkDiskError doesn't work during datanode startup

Stephen O'Donnell (Jira) Thu, 21 Nov 2019 09:12:32 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979430#comment-16979430
 ]


Stephen O'Donnell commented on HDFS-14993:
------------------------------------------

If I chown and chmod a directory so it is not writeable, then I get this stack 
on startup:

{code}
2019-11-21 15:52:44,383 INFO datanode.DataNode: registered UNIX signal handlers 
for [TERM, HUP, INT]
2019-11-21 15:52:44,529 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2019-11-21 15:52:44,647 INFO checker.ThrottledAsyncChecker: Scheduling a check 
for [DISK]file:/tmp/hadoop-sodonnell/dfs/data
2019-11-21 15:52:44,654 INFO checker.ThrottledAsyncChecker: Scheduling a check 
for [DISK]file:/tmp/hadoop-sodonnell/dfs/data2
2019-11-21 15:52:44,680 WARN checker.StorageLocationChecker: Exception checking 
StorageLocation [DISK]file:/tmp/hadoop-sodonnell/dfs/data
ExitCodeException exitCode=1: chmod: Unable to change file mode on 
/private/tmp/hadoop-sodonnell/dfs/data: Operation not permitted

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)
        at org.apache.hadoop.util.Shell.run(Shell.java:901)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:867)
        at 
org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:550)
        at 
org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:531)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:553)
        at 
org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:234)
        at 
org.apache.hadoop.util.DiskChecker.checkDirInternal(DiskChecker.java:141)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:116)
        at 
org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:239)
        at 
org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:52)
        at 
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$1.call(ThrottledAsyncChecker.java:142)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-11-21 15:52:44,682 ERROR datanode.DataNode: Exception in secureMain
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes 
- current valid volumes: 1, volumes configured: 2, volumes failed: 1, volume 
failures tolerated: 0
        at 
org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:233)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2836)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2749)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2793)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2937)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2961)
2019-11-21 15:52:44,683 INFO util.ExitUtil: Exiting with status 1: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes 
- current valid volumes: 1, volumes configured: 2, volumes failed: 1, volume 
failures tolerated: 0
{code}

So it looks like a disk check is scheduled already by 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2836)
 which finds its way into DiskChecker:

{code}
  private static void checkDirInternal(LocalFileSystem localFS, Path dir,
                                       FsPermission expected)
  throws DiskErrorException, IOException {
    mkdirsWithExistsAndPermissionCheck(localFS, dir, expected);
    checkAccessByFileMethods(localFS.pathToFile(dir));
  }
{code}

The call checkAccessByFileMethods then runs a read and write test.

Rather than moving the call to checkDiskError() after the addBlockPool() call, 
I wonder if we need it there at all?

> checkDiskError doesn't work during datanode startup
> ---------------------------------------------------
>
>                 Key: HDFS-14993
>                 URL: https://issues.apache.org/jira/browse/HDFS-14993
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Yang Yun
>            Assignee: Yang Yun
>            Priority: Major
>         Attachments: HDFS-14993.patch, HDFS-14993.patch
>
>
> the function checkDiskError() is called before addBlockPool, but list 
> bpSlices is empty this time. So the function check() in FsVolumeImpl.java 
> does nothing.
> @Override
> public VolumeCheckResult check(VolumeCheckContext ignored)
>  throws DiskErrorException {
>  // TODO:FEDERATION valid synchronization
>  for (BlockPoolSlice s : bpSlices.values()) {
>  s.checkDirs();
>  }
>  return VolumeCheckResult.HEALTHY;
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14993) checkDiskError doesn't work during datanode startup

Reply via email to