[ 
https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449965#comment-15449965
 ] 

Xiao Chen commented on HDFS-6532:
---------------------------------

Thanks Yiqun for working on this. It does look like we can reuse the 
{{closeResponder}} method in the loop, but I don't think that's the root cause 
here.

Taking the failure log in attachment as an example, the test is supposed to end 
quickly (around 15:41:58) after 5 times failure on checksum error. But somehow 
it did not, and hangs there until the 50 seconds test timeout is reached. After 
test timeout, junit interrupts all threads which is what we see in the last 3 
messages (around 15:42:43).

I looked into this too, and still think this is some error on triggering / 
handling the interrupt after the 5th checksum error. Don't have any concrete 
progress though.
{noformat}
2016-08-20 15:41:58,084 INFO  datanode.DataNode 
(DataXceiver.java:writeBlock(835)) - opWriteBlock 
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 received exception 
java.io.IOException: Terminating due to a checksum error.java.io.IOException: 
Unexpected checksum mismatch while writing 
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 from /127.0.0.1:49059
2016-08-20 15:41:58,084 ERROR datanode.DataNode (DataXceiver.java:run(273)) - 
127.0.0.1:52977:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:49059 dst: /127.0.0.1:52977
java.io.IOException: Terminating due to a checksum error.java.io.IOException: 
Unexpected checksum mismatch while writing 
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 from /127.0.0.1:49059
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:606)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
        at java.lang.Thread.run(Thread.java:745)
2016-08-20 15:41:58,258 INFO  BlockStateChange 
(BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 
127.0.0.1:51819 to delete [blk_1073741825_1002]
2016-08-20 15:41:58,258 INFO  BlockStateChange 
(BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 
127.0.0.1:39731 to delete [blk_1073741825_1002]
2016-08-20 15:41:58,258 INFO  BlockStateChange 
(BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 
127.0.0.1:52977 to delete [blk_1073741825_1002]
2016-08-20 15:41:59,235 INFO  BlockStateChange (InvalidateBlocks.java:add(116)) 
- BLOCK* InvalidateBlocks: add blk_1073741825_1001 to 127.0.0.1:49498
2016-08-20 15:41:59,238 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling 
blk_1073741825_1002 file 
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
 for deletion
2016-08-20 15:41:59,240 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:run(295)) - Deleted 
BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file 
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
2016-08-20 15:41:59,378 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling 
blk_1073741825_1002 file 
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data9/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
 for deletion
2016-08-20 15:41:59,378 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:run(295)) - Deleted 
BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file 
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data9/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
2016-08-20 15:41:59,698 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling 
blk_1073741825_1002 file 
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data17/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
 for deletion
2016-08-20 15:41:59,698 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:run(295)) - Deleted 
BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file 
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data17/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825
2016-08-20 15:42:01,259 INFO  BlockStateChange 
(BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 
127.0.0.1:49498 to delete [blk_1073741825_1001]
2016-08-20 15:42:02,098 INFO  impl.FsDatasetImpl 
(FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released.
2016-08-20 15:42:02,232 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling 
blk_1073741825_1001 file 
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-1703495320-172.17.0.1-1471707714371/current/rbw/blk_1073741825
 for deletion
2016-08-20 15:42:02,233 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:run(295)) - Deleted 
BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1001 file 
/tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-1703495320-172.17.0.1-1471707714371/current/rbw/blk_1073741825
2016-08-20 15:42:03,051 INFO  impl.FsDatasetImpl 
(FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released.
2016-08-20 15:42:03,056 INFO  impl.FsDatasetImpl 
(FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released.
2016-08-20 15:42:25,287 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(179)) - Rescanning after 30000 milliseconds
2016-08-20 15:42:25,288 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(202)) - Scanned 0 directive(s) and 0 block(s) 
in 1 millisecond(s).
2016-08-20 15:42:43,521 WARN  hdfs.DFSClient 
(DFSOutputStream.java:closeResponder(865)) - Caught exception 
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1281)
        at java.lang.Thread.join(Thread.java:1355)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:863)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeInternal(DFSOutputStream.java:831)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:827)
2016-08-20 15:42:43,522 INFO  datanode.DataNode 
(BlockReceiver.java:receiveBlock(935)) - Exception for 
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005
java.nio.channels.ClosedByInterruptException
        at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
        at java.lang.Thread.run(Thread.java:745)
2016-08-20 15:42:43,523 INFO  datanode.DataNode 
(DataXceiver.java:writeBlock(835)) - opWriteBlock 
BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 received exception 
java.nio.channels.ClosedByInterruptException
2016-08-20 15:42:43,523 ERROR datanode.DataNode (DataXceiver.java:run(273)) - 
127.0.0.1:35247:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:44749 dst: /127.0.0.1:35247
java.nio.channels.ClosedByInterruptException
        at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
        at java.lang.Thread.run(Thread.java:745)
2016-08-20 15:42:43,523 INFO  hdfs.DFSClient 
(TestCrcCorruption.java:testCorruptionDuringWrt(143)) - Got expected exception
java.io.InterruptedIOException: Interrupted while waiting for data to be 
acknowledged by pipeline
        at 
org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2442)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2420)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2582)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2541)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at 
org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:139)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
2016-08-20 15:42:43,523 INFO  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:shutdown(1744)) - Shutting down the Mini HDFS Cluster
{noformat}

> Intermittent test failure 
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6532
>                 URL: https://issues.apache.org/jira/browse/HDFS-6532
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.4.0
>            Reporter: Yongjun Zhang
>            Assignee: Yiqun Lin
>         Attachments: HDFS-6532.001.patch, 
> TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml
>
>
> Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had 
> the following failure. Local rerun is successful
> {code}
> Regression
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
> Failing for the past 1 build (Since Failed#1774 )
> Took 50 sec.
> Error Message
> test timed out after 50000 milliseconds
> Stacktrace
> java.lang.Exception: test timed out after 50000 milliseconds
>       at java.lang.Object.wait(Native Method)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98)
>       at 
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133)
> {code}
> See relevant exceptions in log
> {code}
> 2014-06-14 11:56:15,283 WARN  datanode.DataNode 
> (BlockReceiver.java:verifyChunks(404)) - Checksum error in block 
> BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from 
> /127.0.0.1:41708
> org.apache.hadoop.fs.ChecksumException: Checksum error: 
> DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112
>       at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353)
>       at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
>       at java.lang.Thread.run(Thread.java:662)
> 2014-06-14 11:56:15,285 WARN  datanode.DataNode 
> (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): 
> java.io.IOException: Shutting down writer and responder due to a checksum 
> error in received data. The error response has been sent upstream.
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199)
>       at java.lang.Thread.run(Thread.java:662)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to