[ https://issues.apache.org/jira/browse/HDFS-14333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786404#comment-16786404 ]
Wei-Chiu Chuang commented on HDFS-14333: ---------------------------------------- Looks mostly good. {code:java} @Test(timeout=5000) {code} let's use a larger timeout so it doesn't time out artificially on a busier box (our precommit is notoriously overloaded). This test takes roughly 4-5 seconds on my laptop and I think it's not unreasonable to set 15000 ms, given the complexity of the test. Could you replace the following {code:java} try { // Wait until the datanode is fully started GenericTestUtils.waitFor(new Supplier<Boolean>() { @Override public Boolean get() { return dn.isDatanodeFullyStarted(); } }, 100, 2000); } catch (TimeoutException e) { fail("Datanode did not get fully started"); } {code} with {code:java} cluster.waitDataNodeFullyStarted(dn); {code} (make it a public method) I don't like the following test. It's too fragile. {code:java} try { // Then wait to ensure it stays up GenericTestUtils.waitFor(new Supplier<Boolean>() { @Override public Boolean get() { return !dn.isDatanodeUp(); } }, 500, 2000); } catch (TimeoutException e) { // Do nothing, this is the expected outcome } assertTrue(dn.isDatanodeUp()); {code} > Datanode fails to start if any disk has errors during Namenode registration > --------------------------------------------------------------------------- > > Key: HDFS-14333 > URL: https://issues.apache.org/jira/browse/HDFS-14333 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.3.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Fix For: 3.3.0 > > Attachments: HADOOP-16119.poc.patch, HDFS-14333.001.patch, > HDFS-14333.002.patch > > > This is closely related to HDFS-9908, where it was reported that a datanode > would fail to start if an IO error occurred on a single disk when running du > during Datanode registration. That Jira was closed due to HADOOP-12973 which > refactored how du is called and prevents any exception being thrown. However > this problem can still occur if the volume has errors (eg permission or > filesystem corruption) when the disk is scanned to load all the replicas. The > method chain is: > DataNode.initBlockPool -> FSDataSetImpl.addBlockPool -> > FSVolumeList.getAllVolumesMap -> Throws exception which goes unhandled. > The DN logs will contain a stack trace for the problem volume, so the > workaround is to remove the volume from the DN config and the DN will start, > but the logs are a little confusing, so its always not obvious what the issue > is. > These are the cut down logs from an occurrence of this issue. > {code} > 2019-03-01 08:58:24,830 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-240961797-x.x.x.x-1392827522027 on volume > /data/18/dfs/dn/current... > ... > 2019-03-01 08:58:27,029 WARN org.apache.hadoop.fs.CachingGetSpaceUsed: Could > not get disk usage information > ExitCodeException exitCode=1: du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir215': > Permission denied > du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir213': > Permission denied > du: cannot read directory > `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir97/subdir25': > Permission denied > at org.apache.hadoop.util.Shell.runCommand(Shell.java:601) > at org.apache.hadoop.util.Shell.run(Shell.java:504) > at org.apache.hadoop.fs.DU$DUShell.startRefresh(DU.java:61) > at org.apache.hadoop.fs.DU.refresh(DU.java:53) > at > org.apache.hadoop.fs.CachingGetSpaceUsed.init(CachingGetSpaceUsed.java:84) > at > org.apache.hadoop.fs.GetSpaceUsed$Builder.build(GetSpaceUsed.java:166) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:145) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:881) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:412) > ... > 2019-03-01 08:58:27,043 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time > taken to scan block pool BP-240961797-x.x.x.x-1392827522027 on > /data/18/dfs/dn/current: 2202ms > {code} > So we can see a du error occurred, was logged but not re-thrown (due to > HADOOP-12973) and the blockpool scan completed. However then in the 'add > replicas to map' logic, we got another exception stemming from the same > problem: > {code} > 2019-03-01 08:58:27,564 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding > replicas to map for block pool BP-240961797-x.x.x.x-1392827522027 on volume > /data/18/dfs/dn/current... > ... > 2019-03-01 08:58:31,155 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught > exception while adding replicas from /data/18/dfs/dn/current. Will throw > later. > java.io.IOException: Invalid directory or I/O error occurred for dir: > /data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir215 > at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1167) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:445) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.getVolumeMap(BlockPoolSlice.java:342) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:861) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:191) > < The message 2019-03-01 08:59:00,989 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to > add replicas to map for block pool BP-240961797-x.x.x.x-1392827522027 on > volume xxx did not appear for this volume as it failed > > {code} > The exception is re-thrown, so the DN fails registration and then retries. > Then it finds all volumes already locked and exits with a 'all volumes > failed' error. > I believe we should handle the failing volume like a runtime volume failure > and only abort the DN if too many volumes have failed. > I will post a patch for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org