[ https://issues.apache.org/jira/browse/HDFS-14333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786299#comment-16786299 ]
Hadoop QA commented on HDFS-14333: ---------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 15s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 52s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 210 unchanged - 0 fixed = 212 total (was 210) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 75 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 14s{color} | {color:red} The patch 19849 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 77m 34s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 83m 17s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 32s{color} | {color:red} The patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}206m 59s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized | | | hadoop.hdfs.server.namenode.ha.TestConsistentReadsObserver | | | hadoop.hdfs.web.TestWebHdfsTimeouts | | | hadoop.hdfs.server.datanode.TestBPOfferService | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | HDFS-14333 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12961460/HDFS-14333.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 8500782c4bdc 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 45f976f | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/26417/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/26417/artifact/out/whitespace-eol.txt | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/26417/artifact/out/whitespace-tabs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/26417/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/26417/testReport/ | | asflicense | https://builds.apache.org/job/PreCommit-HDFS-Build/26417/artifact/out/patch-asflicense-problems.txt | | Max. process+thread count | 3501 (vs. ulimit of 10000) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26417/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Datanode fails to start if any disk has errors during Namenode registration > --------------------------------------------------------------------------- > > Key: HDFS-14333 > URL: https://issues.apache.org/jira/browse/HDFS-14333 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.3.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14333.001.patch, HDFS-14333.002.patch > > > This is closely related to HDFS-9908, where it was reported that a datanode > would fail to start if an IO error occurred on a single disk when running du > during Datanode registration. That Jira was closed due to HADOOP-12973 which > refactored how du is called and prevents any exception being thrown. However > this problem can still occur if the volume has errors (eg permission or > filesystem corruption) when the disk is scanned to load all the replicas. The > method chain is: > DataNode.initBlockPool -> FSDataSetImpl.addBlockPool -> > FSVolumeList.getAllVolumesMap -> Throws exception which goes unhandled. > The DN logs will contain a stack trace for the problem volume, so the > workaround is to remove the volume from the DN config and the DN will start, > but the logs are a little confusing, so its always not obvious what the issue > is. > These are the cut down logs from an occurrence of this issue. > {code} > 2019-03-01 08:58:24,830 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-240961797-x.x.x.x-1392827522027 on volume > /data/18/dfs/dn/current... > ... > 2019-03-01 08:58:27,029 WARN org.apache.hadoop.fs.CachingGetSpaceUsed: Could > not get disk usage information > ExitCodeException exitCode=1: du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir215': > Permission denied > du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir213': > Permission denied > du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir97/subdir25': > Permission denied > at org.apache.hadoop.util.Shell.runCommand(Shell.java:601) > at org.apache.hadoop.util.Shell.run(Shell.java:504) > at org.apache.hadoop.fs.DU$DUShell.startRefresh(DU.java:61) > at org.apache.hadoop.fs.DU.refresh(DU.java:53) > at > org.apache.hadoop.fs.CachingGetSpaceUsed.init(CachingGetSpaceUsed.java:84) > at > org.apache.hadoop.fs.GetSpaceUsed$Builder.build(GetSpaceUsed.java:166) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:145) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:881) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:412) > ... > 2019-03-01 08:58:27,043 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time > taken to scan block pool BP-240961797-x.x.x.x-1392827522027 on > /data/18/dfs/dn/current: 2202ms > {code} > So we can see a du error occurred, was logged but not re-thrown (due to > HADOOP-12973) and the blockpool scan completed. However then in the 'add > replicas to map' logic, we got another exception stemming from the same > problem: > {code} > 2019-03-01 08:58:27,564 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding > replicas to map for block pool BP-240961797-x.x.x.x-1392827522027 on volume > /data/18/dfs/dn/current... > ... > 2019-03-01 08:58:31,155 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught > exception while adding replicas from /data/18/dfs/dn/current. Will throw > later. > java.io.IOException: Invalid directory or I/O error occurred for dir: > /data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir215 > at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1167) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:445) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.getVolumeMap(BlockPoolSlice.java:342) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:861) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:191) > < The message 2019-03-01 08:59:00,989 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to > add replicas to map for block pool BP-240961797-x.x.x.x-1392827522027 on > volume xxx did not appear for this volume as it failed > > {code} > The exception is re-thrown, so the DN fails registration and then retries. > Then it finds all volumes already locked and exits with a 'all volumes > failed' error. > I believe we should handle the failing volume like a runtime volume failure > and only abort the DN if too many volumes have failed. > I will post a patch for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org