[ https://issues.apache.org/jira/browse/HDFS-14333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787159#comment-16787159 ]
Hadoop QA commented on HDFS-14333: ---------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 26s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 30s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 56s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 1 new + 476 unchanged - 1 fixed = 477 total (was 477) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 52s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 6 new + 441 unchanged - 0 fixed = 447 total (was 441) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 27s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}101m 56s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}160m 24s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | HDFS-14333 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12961595/HDFS-14333.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux b03a350958b9 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 1bc282e | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | javac | https://builds.apache.org/job/PreCommit-HDFS-Build/26427/artifact/out/diff-compile-javac-hadoop-hdfs-project_hadoop-hdfs.txt | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/26427/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/26427/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/26427/testReport/ | | Max. process+thread count | 3157 (vs. ulimit of 10000) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26427/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Datanode fails to start if any disk has errors during Namenode registration > --------------------------------------------------------------------------- > > Key: HDFS-14333 > URL: https://issues.apache.org/jira/browse/HDFS-14333 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.3.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Fix For: 3.3.0 > > Attachments: HADOOP-16119.poc.patch, HDFS-14333.001.patch, > HDFS-14333.002.patch, HDFS-14333.003.patch > > > This is closely related to HDFS-9908, where it was reported that a datanode > would fail to start if an IO error occurred on a single disk when running du > during Datanode registration. That Jira was closed due to HADOOP-12973 which > refactored how du is called and prevents any exception being thrown. However > this problem can still occur if the volume has errors (eg permission or > filesystem corruption) when the disk is scanned to load all the replicas. The > method chain is: > DataNode.initBlockPool -> FSDataSetImpl.addBlockPool -> > FSVolumeList.getAllVolumesMap -> Throws exception which goes unhandled. > The DN logs will contain a stack trace for the problem volume, so the > workaround is to remove the volume from the DN config and the DN will start, > but the logs are a little confusing, so its always not obvious what the issue > is. > These are the cut down logs from an occurrence of this issue. > {code} > 2019-03-01 08:58:24,830 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-240961797-x.x.x.x-1392827522027 on volume > /data/18/dfs/dn/current... > ... > 2019-03-01 08:58:27,029 WARN org.apache.hadoop.fs.CachingGetSpaceUsed: Could > not get disk usage information > ExitCodeException exitCode=1: du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir215': > Permission denied > du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir213': > Permission denied > du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir97/subdir25': > Permission denied > at org.apache.hadoop.util.Shell.runCommand(Shell.java:601) > at org.apache.hadoop.util.Shell.run(Shell.java:504) > at org.apache.hadoop.fs.DU$DUShell.startRefresh(DU.java:61) > at org.apache.hadoop.fs.DU.refresh(DU.java:53) > at > org.apache.hadoop.fs.CachingGetSpaceUsed.init(CachingGetSpaceUsed.java:84) > at > org.apache.hadoop.fs.GetSpaceUsed$Builder.build(GetSpaceUsed.java:166) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:145) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:881) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:412) > ... > 2019-03-01 08:58:27,043 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time > taken to scan block pool BP-240961797-x.x.x.x-1392827522027 on > /data/18/dfs/dn/current: 2202ms > {code} > So we can see a du error occurred, was logged but not re-thrown (due to > HADOOP-12973) and the blockpool scan completed. However then in the 'add > replicas to map' logic, we got another exception stemming from the same > problem: > {code} > 2019-03-01 08:58:27,564 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding > replicas to map for block pool BP-240961797-x.x.x.x-1392827522027 on volume > /data/18/dfs/dn/current... > ... > 2019-03-01 08:58:31,155 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught > exception while adding replicas from /data/18/dfs/dn/current. Will throw > later. > java.io.IOException: Invalid directory or I/O error occurred for dir: > /data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir215 > at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1167) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:445) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.getVolumeMap(BlockPoolSlice.java:342) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:861) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:191) > < The message 2019-03-01 08:59:00,989 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to > add replicas to map for block pool BP-240961797-x.x.x.x-1392827522027 on > volume xxx did not appear for this volume as it failed > > {code} > The exception is re-thrown, so the DN fails registration and then retries. > Then it finds all volumes already locked and exits with a 'all volumes > failed' error. > I believe we should handle the failing volume like a runtime volume failure > and only abort the DN if too many volumes have failed. > I will post a patch for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org