[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15448770#comment-15448770 ]
Yiqun Lin edited comment on HDFS-6532 at 8/30/16 11:24 AM: ----------------------------------------------------------- I looked into this issue again and I might find the root cause. As [~kihwal] had mentioned, the failed case will not print the following infos {code} (TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception java.io.IOException: Failing write. Tried pipeline recovery 5 times without success. {code} Instead of that, the failed case will print infos like these: {code} (TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception java.io.InterruptedIOException: Interrupted while waiting for data to be acknowledged by pipeline at org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:775) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:697) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:778) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:755) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) {code} That means the program has returned before do the recover pipeline operations sometimes. The related codes: {code:title=DataStreamer.java|borderStyle=solid} private boolean processDatanodeOrExternalError() throws IOException { if (!errorState.hasDatanodeError() && !shouldHandleExternalError()) { return false; } LOG.debug("start process datanode/external error, {}", this); // If the response has not closed, this method will just return if (response != null) { LOG.info("Error Recovery for " + block + " waiting for responder to exit. "); return true; } closeStream(); ... {code} I looked into the code and I thought there was a bug to cause that, the related codes: {code:title=DataStreamer.java|borderStyle=solid} public void run() { long lastPacket = Time.monotonicNow(); TraceScope scope = null; while (!streamerClosed && dfsClient.clientRunning) { // if the Responder encountered an error, shutdown Responder if (errorState.hasError() && response != null) { try { response.close(); response.join(); response = null; } catch (InterruptedException e) { // If interruptedException happens, the response will not be set to null LOG.warn("Caught exception", e); } } // Here need add a finally block to set response as null ... {code} I think we should move the line {{response = null;}} into {{finally}} block. Finally attach a patch for this. This test has failed intermitly for a long time, hope my patch can make sense. Softly ping [~xiaochen], [~kihwal] and [~yzhangal] for the comments. Thanks. was (Author: linyiqun): I looked into this issue again and I might find the root cause. As [~kihwal] had mentioned, the failed case will not print the following infos {code} (TestCrcCorruption.java:testCorruptionDuringWrt(140)) - Got expected exception java.io.IOException: Failing write. Tried pipeline recovery 5 times without success. {code} That means the program has returned before do the recover pipeline operations sometimes. The related codes: {code:title=DataStreamer.java|borderStyle=solid} private boolean processDatanodeOrExternalError() throws IOException { if (!errorState.hasDatanodeError() && !shouldHandleExternalError()) { return false; } LOG.debug("start process datanode/external error, {}", this); // If the response has not closed, this method will just return if (response != null) { LOG.info("Error Recovery for " + block + " waiting for responder to exit. "); return true; } closeStream(); ... {code} I looked into the code and I thought there was a bug to cause that, the related codes: {code:title=DataStreamer.java|borderStyle=solid} public void run() { long lastPacket = Time.monotonicNow(); TraceScope scope = null; while (!streamerClosed && dfsClient.clientRunning) { // if the Responder encountered an error, shutdown Responder if (errorState.hasError() && response != null) { try { response.close(); response.join(); response = null; } catch (InterruptedException e) { // If interruptedException happens, the response will not be set to null LOG.warn("Caught exception", e); } } // Here need add a finally block to set response as null ... {code} I think we should move the line {{response = null;}} into {{finally}} block. Finally attach a patch for this. This test has failed intermitly for a long time, hope my patch can make sense. Softly ping [~xiaochen], [~kihwal] and [~yzhangal] for the comments. Thanks. > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > ------------------------------------------------------------------------------------------ > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client > Affects Versions: 2.4.0 > Reporter: Yongjun Zhang > Assignee: Yiqun Lin > Attachments: HDFS-6532.001.patch, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 50000 milliseconds > Stacktrace > java.lang.Exception: test timed out after 50000 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org