[jira] [Comment Edited] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt

Yiqun Lin (JIRA) Tue, 30 Aug 2016 04:26:25 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15448770#comment-15448770
 ]


Yiqun Lin edited comment on HDFS-6532 at 8/30/16 11:24 AM:
-----------------------------------------------------------

I looked into this issue again and I might find the root cause.

As [~kihwal] had mentioned, the failed case will not print the following infos
{code}
(TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception
java.io.IOException: Failing write. Tried pipeline recovery 5 times without 
success.
{code}
Instead of that, the failed case will print infos like these:
{code}
(TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception
java.io.InterruptedIOException: Interrupted while waiting for data to be 
acknowledged by pipeline
        at 
org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:775)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:697)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:778)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:755)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
{code}

That means the program has returned before do the recover pipeline operations 
sometimes. The related codes:
{code:title=DataStreamer.java|borderStyle=solid}
  private boolean processDatanodeOrExternalError() throws IOException {
    if (!errorState.hasDatanodeError() && !shouldHandleExternalError()) {
      return false;
    }
    LOG.debug("start process datanode/external error, {}", this);
    // If the response has not closed, this method will just return 
    if (response != null) {
      LOG.info("Error Recovery for " + block +
          " waiting for responder to exit. ");
      return true;
    }
    closeStream();
    ...
{code}

I looked into the code and I thought there was a bug to cause that, the related 
codes:
{code:title=DataStreamer.java|borderStyle=solid}
  public void run() {
    long lastPacket = Time.monotonicNow();
    TraceScope scope = null;
    while (!streamerClosed && dfsClient.clientRunning) {
      // if the Responder encountered an error, shutdown Responder
      if (errorState.hasError() && response != null) {
        try {
          response.close();
          response.join();
          response = null;
        } catch (InterruptedException e) {
          // If interruptedException happens, the response will not be set to 
null
          LOG.warn("Caught exception", e);
        }
      }
      // Here need add a finally block to set response as null
      ...
{code}
I think we should move the line {{response = null;}} into {{finally}} block.

Finally attach a patch for this. This test has failed intermitly for a long 
time, hope my patch can make sense. 
Softly ping [~xiaochen], [~kihwal] and [~yzhangal] for the comments.

Thanks.


was (Author: linyiqun):
I looked into this issue again and I might find the root cause.

As [~kihwal] had mentioned, the failed case will not print the following infos
{code}
(TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception
java.io.IOException: Failing write. Tried pipeline recovery 5 times without 
success.
{code}

That means the program has returned before do the recover pipeline operations 
sometimes. The related codes:
{code:title=DataStreamer.java|borderStyle=solid}
  private boolean processDatanodeOrExternalError() throws IOException {
    if (!errorState.hasDatanodeError() && !shouldHandleExternalError()) {
      return false;
    }
    LOG.debug("start process datanode/external error, {}", this);
    // If the response has not closed, this method will just return 
    if (response != null) {
      LOG.info("Error Recovery for " + block +
          " waiting for responder to exit. ");
      return true;
    }
    closeStream();
    ...
{code}

I looked into the code and I thought there was a bug to cause that, the related 
codes:
{code:title=DataStreamer.java|borderStyle=solid}
  public void run() {
    long lastPacket = Time.monotonicNow();
    TraceScope scope = null;
    while (!streamerClosed && dfsClient.clientRunning) {
      // if the Responder encountered an error, shutdown Responder
      if (errorState.hasError() && response != null) {
        try {
          response.close();
          response.join();
          response = null;
        } catch (InterruptedException e) {
          // If interruptedException happens, the response will not be set to 
null
          LOG.warn("Caught exception", e);
        }
      }
      // Here need add a finally block to set response as null
      ...
{code}
I think we should move the line {{response = null;}} into {{finally}} block.

Finally attach a patch for this. This test has failed intermitly for a long 
time, hope my patch can make sense. 
Softly ping [~xiaochen], [~kihwal] and [~yzhangal] for the comments.

Thanks.

> Intermittent test failure 
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6532
>                 URL: https://issues.apache.org/jira/browse/HDFS-6532
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.4.0
>            Reporter: Yongjun Zhang
>            Assignee: Yiqun Lin
>         Attachments: HDFS-6532.001.patch, 
> TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml
>
>
> Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had 
> the following failure. Local rerun is successful
> {code}
> Regression
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
> Failing for the past 1 build (Since Failed#1774 )
> Took 50 sec.
> Error Message
> test timed out after 50000 milliseconds
> Stacktrace
> java.lang.Exception: test timed out after 50000 milliseconds
>       at java.lang.Object.wait(Native Method)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98)
>       at 
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133)
> {code}
> See relevant exceptions in log
> {code}
> 2014-06-14 11:56:15,283 WARN  datanode.DataNode 
> (BlockReceiver.java:verifyChunks(404)) - Checksum error in block 
> BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from 
> /127.0.0.1:41708
> org.apache.hadoop.fs.ChecksumException: Checksum error: 
> DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112
>       at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353)
>       at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
>       at java.lang.Thread.run(Thread.java:662)
> 2014-06-14 11:56:15,285 WARN  datanode.DataNode 
> (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): 
> java.io.IOException: Shutting down writer and responder due to a checksum 
> error in received data. The error response has been sent upstream.
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199)
>       at java.lang.Thread.run(Thread.java:662)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt

Reply via email to