[ https://issues.apache.org/jira/browse/HDFS-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092245#comment-15092245 ]
Wei-Chiu Chuang commented on HDFS-9631: --------------------------------------- Thanks [~kihwal] for the analysis. Yes, it does look like it got stuck in safe mode. I'll add more logs to figure out what went wrong. > Restarting namenode after deleting a directory with snapshot will fail > ---------------------------------------------------------------------- > > Key: HDFS-9631 > URL: https://issues.apache.org/jira/browse/HDFS-9631 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 3.0.0 > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > > I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite > frequently. > {noformat} > FAILED: > org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot > Error Message: > Timed out waiting for Mini HDFS Cluster to start > Stack Trace: > java.io.IOException: Timed out waiting for Mini HDFS Cluster to start > at > org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024) > at > org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82) > {noformat} > These tests ({{testParentDirWithUCFileDeleteWithSnapshot}}, > {{testOpenFilesWithRename}}, {{testWithCheckpoint}}) are unable to reconnect > to the namenode after restart. It looks like the reconnection failed due to > an EOFException when BPServiceActor sends a heartbeat. > {noformat} > 2016-01-07 23:25:43,678 [main] WARN hdfs.MiniDFSCluster > (MiniDFSCluster.java:waitClusterUp(1338)) - Waiting for the Mini HDFS Cluster > to start... > 2016-01-07 23:25:44,679 [main] WARN hdfs.MiniDFSCluster > (MiniDFSCluster.java:waitClusterUp(1338)) - Waiting for the Mini HDFS Cluster > to start... > 2016-01-07 23:25:44,720 [DataNode: > [[[DISK]file:/home/weichiu/hadoop2/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/, > [DISK]file: > /home/weichiu/hadoop2/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data2/]] > heartbeating to localhost/127.0.0.1:60472] WARN datanode > .DataNode (BPServiceActor.java:offerService(752)) - IOException in > offerService > java.io.EOFException: End of File Exception between local host is: > "weichiu.vpc.cloudera.com/172.28.211.219"; destination host is: > "localhost":6047 > 2; :; For more details see: http://wiki.apache.org/hadoop/EOFException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:793) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:766) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1385) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy18.sendHeartbeat(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:154) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:557) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:660) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:851) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1110) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1005) > {noformat} > It appears that these three tests all call {{doWriteAndAbort()}}, which > creates files and then abort, and then set the parent directory with a > snapshot, and then delete the parent directory. > Interestingly, if the parent directory does not have a snapshot, the tests > will not fail. Additionally, if the parent directory is not deleted, the > tests will not fail. > The following test will fail intermittently: > {code:java} > public void testDeleteParentDirWithSnapShot() throws Exception { > Path path = new Path("/test"); > fs.mkdirs(path); > fs.allowSnapshot(path); > Path file = new Path("/test/test/test2"); > FSDataOutputStream out = fs.create(file); > for (int i = 0; i < 2; i++) { > long count = 0; > while (count < 1048576) { > out.writeBytes("hell"); > count += 4; > } > } > ((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet > .of(SyncFlag.UPDATE_LENGTH)); > Path file2 = new Path("/test/test/test3"); > FSDataOutputStream out2 = fs.create(file2); > for (int i = 0; i < 2; i++) { > long count = 0; > while (count < 1048576) { > out2.writeBytes("hell"); > count += 4; > } > } > ((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet > .of(SyncFlag.UPDATE_LENGTH)); > fs.createSnapshot(path, "s1"); > // delete parent directory > fs.delete(new Path("/test/test"), true); > cluster.restartNameNode(); > } > {code} > I am not sure if it's a test case issue, or something to do with snapshots. -- This message was sent by Atlassian JIRA (v6.3.4#6332)