[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449965#comment-15449965 ]
Xiao Chen commented on HDFS-6532: --------------------------------- Thanks Yiqun for working on this. It does look like we can reuse the {{closeResponder}} method in the loop, but I don't think that's the root cause here. Taking the failure log in attachment as an example, the test is supposed to end quickly (around 15:41:58) after 5 times failure on checksum error. But somehow it did not, and hangs there until the 50 seconds test timeout is reached. After test timeout, junit interrupts all threads which is what we see in the last 3 messages (around 15:42:43). I looked into this too, and still think this is some error on triggering / handling the interrupt after the 5th checksum error. Don't have any concrete progress though. {noformat} 2016-08-20 15:41:58,084 INFO datanode.DataNode (DataXceiver.java:writeBlock(835)) - opWriteBlock BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 received exception java.io.IOException: Terminating due to a checksum error.java.io.IOException: Unexpected checksum mismatch while writing BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 from /127.0.0.1:49059 2016-08-20 15:41:58,084 ERROR datanode.DataNode (DataXceiver.java:run(273)) - 127.0.0.1:52977:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:49059 dst: /127.0.0.1:52977 java.io.IOException: Terminating due to a checksum error.java.io.IOException: Unexpected checksum mismatch while writing BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 from /127.0.0.1:49059 at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:606) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745) 2016-08-20 15:41:58,258 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 127.0.0.1:51819 to delete [blk_1073741825_1002] 2016-08-20 15:41:58,258 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 127.0.0.1:39731 to delete [blk_1073741825_1002] 2016-08-20 15:41:58,258 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 127.0.0.1:52977 to delete [blk_1073741825_1002] 2016-08-20 15:41:59,235 INFO BlockStateChange (InvalidateBlocks.java:add(116)) - BLOCK* InvalidateBlocks: add blk_1073741825_1001 to 127.0.0.1:49498 2016-08-20 15:41:59,238 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 for deletion 2016-08-20 15:41:59,240 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(295)) - Deleted BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data5/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 2016-08-20 15:41:59,378 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data9/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 for deletion 2016-08-20 15:41:59,378 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(295)) - Deleted BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data9/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 2016-08-20 15:41:59,698 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data17/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 for deletion 2016-08-20 15:41:59,698 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(295)) - Deleted BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1002 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data17/current/BP-1703495320-172.17.0.1-1471707714371/current/finalized/subdir0/subdir0/blk_1073741825 2016-08-20 15:42:01,259 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3667)) - BLOCK* BlockManager: ask 127.0.0.1:49498 to delete [blk_1073741825_1001] 2016-08-20 15:42:02,098 INFO impl.FsDatasetImpl (FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released. 2016-08-20 15:42:02,232 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(217)) - Scheduling blk_1073741825_1001 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-1703495320-172.17.0.1-1471707714371/current/rbw/blk_1073741825 for deletion 2016-08-20 15:42:02,233 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(295)) - Deleted BP-1703495320-172.17.0.1-1471707714371 blk_1073741825_1001 file /tmp/run_tha_test5KJcML/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-1703495320-172.17.0.1-1471707714371/current/rbw/blk_1073741825 2016-08-20 15:42:03,051 INFO impl.FsDatasetImpl (FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released. 2016-08-20 15:42:03,056 INFO impl.FsDatasetImpl (FsVolumeList.java:waitVolumeRemoved(286)) - Volume reference is released. 2016-08-20 15:42:25,287 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(179)) - Rescanning after 30000 milliseconds 2016-08-20 15:42:25,288 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2016-08-20 15:42:43,521 WARN hdfs.DFSClient (DFSOutputStream.java:closeResponder(865)) - Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:863) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeInternal(DFSOutputStream.java:831) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:827) 2016-08-20 15:42:43,522 INFO datanode.DataNode (BlockReceiver.java:receiveBlock(935)) - Exception for BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745) 2016-08-20 15:42:43,523 INFO datanode.DataNode (DataXceiver.java:writeBlock(835)) - opWriteBlock BP-1703495320-172.17.0.1-1471707714371:blk_1073741826_1005 received exception java.nio.channels.ClosedByInterruptException 2016-08-20 15:42:43,523 ERROR datanode.DataNode (DataXceiver.java:run(273)) - 127.0.0.1:35247:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:44749 dst: /127.0.0.1:35247 java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745) 2016-08-20 15:42:43,523 INFO hdfs.DFSClient (TestCrcCorruption.java:testCorruptionDuringWrt(143)) - Got expected exception java.io.InterruptedIOException: Interrupted while waiting for data to be acknowledged by pipeline at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2442) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2420) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2582) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2541) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:139) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) 2016-08-20 15:42:43,523 INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1744)) - Shutting down the Mini HDFS Cluster {noformat} > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > ------------------------------------------------------------------------------------------ > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client > Affects Versions: 2.4.0 > Reporter: Yongjun Zhang > Assignee: Yiqun Lin > Attachments: HDFS-6532.001.patch, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 50000 milliseconds > Stacktrace > java.lang.Exception: test timed out after 50000 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org