[ 
https://issues.apache.org/jira/browse/HDFS-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-9631:
----------------------------------
    Description: 
I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite 
frequently. 
{noformat}
FAILED:  
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot

Error Message:
Timed out waiting for Mini HDFS Cluster to start

Stack Trace:
java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82)
{noformat}
These tests ({{testParentDirWithUCFileDeleteWithSnapshot}}, 
{{testOpenFilesWithRename}}, {{testWithCheckpoint}}) are unable to reconnect to 
the namenode after restart. It looks like the reconnection failed due to an 
EOFException when BPServiceActor sends a heartbeat.
{noformat}
2016-01-07 23:25:43,678 [main] WARN  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1338)) - Waiting for the Mini HDFS Cluster 
to start...
2016-01-07 23:25:44,679 [main] WARN  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitClusterUp(1338)) - Waiting for the Mini HDFS Cluster 
to start...
2016-01-07 23:25:44,720 [DataNode: 
[[[DISK]file:/home/weichiu/hadoop2/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/,
 [DISK]file:
/home/weichiu/hadoop2/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data2/]]
  heartbeating to localhost/127.0.0.1:60472] WARN  datanode
.DataNode (BPServiceActor.java:offerService(752)) - IOException in offerService
java.io.EOFException: End of File Exception between local host is: 
"weichiu.vpc.cloudera.com/172.28.211.219"; destination host is: "localhost":6047
2; :; For more details see:  http://wiki.apache.org/hadoop/EOFException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:793)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:766)
        at org.apache.hadoop.ipc.Client.call(Client.java:1452)
        at org.apache.hadoop.ipc.Client.call(Client.java:1385)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy18.sendHeartbeat(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:154)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:557)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:660)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:851)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1110)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1005)
{noformat}

It appears that these three tests all call {{doWriteAndAbort()}}, which creates 
files and then abort, and then set the parent directory with a snapshot, and 
then delete the parent directory. 

Interestingly, if the parent directory does not have a snapshot, the tests will 
not fail. Additionally, if the parent directory is not deleted, the tests will 
not fail.

The following test will fail intermittently:
{code:java}
public void testDeleteParentDirWithSnapShot() throws Exception {
    Path path = new Path("/test");
    fs.mkdirs(path);
    fs.allowSnapshot(path);
    Path file = new Path("/test/test/test2");
    FSDataOutputStream out = fs.create(file);
    for (int i = 0; i < 2; i++) {
      long count = 0;
      while (count < 1048576) {
        out.writeBytes("hell");
        count += 4;
      }
    }
    ((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
        .of(SyncFlag.UPDATE_LENGTH));

    Path file2 = new Path("/test/test/test3");
    FSDataOutputStream out2 = fs.create(file2);
    for (int i = 0; i < 2; i++) {
      long count = 0;
      while (count < 1048576) {
        out2.writeBytes("hell");
        count += 4;
      }
    }
    ((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
        .of(SyncFlag.UPDATE_LENGTH));

    fs.createSnapshot(path, "s1");
    // delete parent directory
    fs.delete(new Path("/test/test"), true);
    cluster.restartNameNode();
  }
{code}

I am not sure if it's a test case issue, or something to do with snapshots.

  was:
I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite 
frequently. 
These tests ({{testParentDirWithUCFileDeleteWithSnapshot}}, 
{{testOpenFilesWithRename}}, {{testWithCheckpoint}}) are unable to reconnect to 
the namenode after restart. It looks like the reconnection failed due to an 
EOFException when BPServiceActor sends a heartbeat.
{noformat}
FAILED:  
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot

Error Message:
Timed out waiting for Mini HDFS Cluster to start

Stack Trace:
java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82)
{noformat}

It appears that these three tests all call {{doWriteAndAbort()}}, which creates 
files and then abort, and then set the parent directory with a snapshot, and 
then delete the parent directory. 

Interestingly, if the parent directory does not have a snapshot, the tests will 
not fail.

The following test will fail intermittently:
{code:java}
public void testDeleteParentDirWithSnapShot() throws Exception {
    Path path = new Path("/test");
    fs.mkdirs(path);
    fs.allowSnapshot(path);
    Path file = new Path("/test/test/test2");
    FSDataOutputStream out = fs.create(file);
    for (int i = 0; i < 2; i++) {
      long count = 0;
      while (count < 1048576) {
        out.writeBytes("hell");
        count += 4;
      }
    }
    ((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
        .of(SyncFlag.UPDATE_LENGTH));

    Path file2 = new Path("/test/test/test3");
    FSDataOutputStream out2 = fs.create(file2);
    for (int i = 0; i < 2; i++) {
      long count = 0;
      while (count < 1048576) {
        out2.writeBytes("hell");
        count += 4;
      }
    }
    ((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
        .of(SyncFlag.UPDATE_LENGTH));

    fs.createSnapshot(path, "s1");
    // delete parent directory
    fs.delete(new Path("/test/test"), true);
    cluster.restartNameNode();
  }
{code}

I am not sure if it's a test case issue, or something to do with snapshots.


> Restarting namenode after deleting a directory with snapshot will fail
> ----------------------------------------------------------------------
>
>                 Key: HDFS-9631
>                 URL: https://issues.apache.org/jira/browse/HDFS-9631
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>
> I found a number of {{TestOpenFilesWithSnapshot}} tests failed quite 
> frequently. 
> {noformat}
> FAILED:  
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot
> Error Message:
> Timed out waiting for Mini HDFS Cluster to start
> Stack Trace:
> java.io.IOException: Timed out waiting for Mini HDFS Cluster to start
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.waitClusterUp(MiniDFSCluster.java:1345)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:2024)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.restartNameNode(MiniDFSCluster.java:1985)
>       at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testParentDirWithUCFileDeleteWithSnapShot(TestOpenFilesWithSnapshot.java:82)
> {noformat}
> These tests ({{testParentDirWithUCFileDeleteWithSnapshot}}, 
> {{testOpenFilesWithRename}}, {{testWithCheckpoint}}) are unable to reconnect 
> to the namenode after restart. It looks like the reconnection failed due to 
> an EOFException when BPServiceActor sends a heartbeat.
> {noformat}
> 2016-01-07 23:25:43,678 [main] WARN  hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:waitClusterUp(1338)) - Waiting for the Mini HDFS Cluster 
> to start...
> 2016-01-07 23:25:44,679 [main] WARN  hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:waitClusterUp(1338)) - Waiting for the Mini HDFS Cluster 
> to start...
> 2016-01-07 23:25:44,720 [DataNode: 
> [[[DISK]file:/home/weichiu/hadoop2/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/,
>  [DISK]file:
> /home/weichiu/hadoop2/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data2/]]
>   heartbeating to localhost/127.0.0.1:60472] WARN  datanode
> .DataNode (BPServiceActor.java:offerService(752)) - IOException in 
> offerService
> java.io.EOFException: End of File Exception between local host is: 
> "weichiu.vpc.cloudera.com/172.28.211.219"; destination host is: 
> "localhost":6047
> 2; :; For more details see:  http://wiki.apache.org/hadoop/EOFException
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:793)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:766)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1452)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1385)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at com.sun.proxy.$Proxy18.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:154)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:557)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:660)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:851)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1110)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1005)
> {noformat}
> It appears that these three tests all call {{doWriteAndAbort()}}, which 
> creates files and then abort, and then set the parent directory with a 
> snapshot, and then delete the parent directory. 
> Interestingly, if the parent directory does not have a snapshot, the tests 
> will not fail. Additionally, if the parent directory is not deleted, the 
> tests will not fail.
> The following test will fail intermittently:
> {code:java}
> public void testDeleteParentDirWithSnapShot() throws Exception {
>     Path path = new Path("/test");
>     fs.mkdirs(path);
>     fs.allowSnapshot(path);
>     Path file = new Path("/test/test/test2");
>     FSDataOutputStream out = fs.create(file);
>     for (int i = 0; i < 2; i++) {
>       long count = 0;
>       while (count < 1048576) {
>         out.writeBytes("hell");
>         count += 4;
>       }
>     }
>     ((DFSOutputStream) out.getWrappedStream()).hsync(EnumSet
>         .of(SyncFlag.UPDATE_LENGTH));
>     Path file2 = new Path("/test/test/test3");
>     FSDataOutputStream out2 = fs.create(file2);
>     for (int i = 0; i < 2; i++) {
>       long count = 0;
>       while (count < 1048576) {
>         out2.writeBytes("hell");
>         count += 4;
>       }
>     }
>     ((DFSOutputStream) out2.getWrappedStream()).hsync(EnumSet
>         .of(SyncFlag.UPDATE_LENGTH));
>     fs.createSnapshot(path, "s1");
>     // delete parent directory
>     fs.delete(new Path("/test/test"), true);
>     cluster.restartNameNode();
>   }
> {code}
> I am not sure if it's a test case issue, or something to do with snapshots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to