[ https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Udit Mehrotra updated HDFS-15837: --------------------------------- Description: We are seeing cases on HDFS blocks being marked as *bad* after the initial block receive fails during *update pipeline* followed by *HDFS* *recovery* for the block failing as well. Here is the life cycle of a block *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt: 1. The block creation starts at name node as part of *update pipeline* process: {noformat} 2021-01-25 03:41:17,335 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) success{noformat} 2. The block receiver on the data node fails with a socket timeout exception, and so do the retries: {noformat} 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.226.26:56294 remote=/172.21.246.239:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): IOException in BlockReceiver.run(): java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) at java.lang.Thread.run(Thread.java:748){noformat} 3. The data node terminates the receiving of block and initiates recovery: {noformat} 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] terminating 2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 src: /172.21.240.31:60430 dest: /172.21.226.26:50010 2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover RBW replica BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578{noformat} 4. The recovery on the datanode for *{{blk_1342440165_272630578}}* fails with: {noformat} 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): Exception for BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): opWriteBlock BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 received exception java.nio.channels.ClosedByInterruptException 2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK operation src: /172.21.240.31:45918 dst: /172.21.226.26:50010 java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288) at java.lang.Thread.run(Thread.java:748){noformat} 5. This is followed by the resetting of bytes on the data node to {{3072}} the correct length, which is an expected behavior based on https://issues.apache.org/jira/browse/hdfs-11472: {noformat} 2021-01-25 03:42:40,007 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten, blk_1342440165_272630578, RBW getNumBytes() = 3072 getBytesOnDisk() = 2678 getVisibleLength()= 3072 getVolume() = /mnt/hdfs/current getBlockFile() = /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165 bytesAcked=3072 bytesOnDisk=2678 2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at /172.21.244.117:34354 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]): Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220 src: /172.21.244.117:34354 dest: /172.21.226.26:50010 2021-01-25 03:42:40,013 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Resetting bytesOnDisk to match blockDataLength (=3072) for replica ReplicaBeingWritten, blk_1342440165_272630578, RBW getNumBytes() = 3072 getBytesOnDisk() = 2678 getVisibleLength()= 3072 getVolume() = /mnt/hdfs/current getBlockFile() = /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165 bytesAcked=3072 bytesOnDisk=2678{noformat} 6. Since recovery failed, the Name Node restarts the {{updatePipeline}} for the block by increasing the generation: {noformat} 2021-01-25 03:42:40,034 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 8020): updatePipeline(blk_1342440165_272630578, newGS=272645069, newLength=3072, newNodes=[172.21.226.26:50010], client=DFSClient_NONMAPREDUCE_390070548_42) 2021-01-25 03:42:40,034 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069) success{noformat} Notice how the correct *{{newLength}}* in sent in the above log from Name Node i.e. *{{3072}}*. 7. The data node then receives the new generation block of this block correctly: {noformat} 2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest: /172.21.226.26:50010, bytes: 3554, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_390070548_42, offset: 0, srvID: 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration: 240052941459 2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE terminating{noformat} But here notice how bytes: *3554* which is different from what was sent from the Name Node i.e. *3072*. 8. The data node then does the successful downstream replication into another node for this block: {noformat} 2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (BP-908477295-172.21.224.178-1606768078949 heartbeating to /172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010, datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949) Starting thread to transfer BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to 172.21.234.181:50010 2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58): DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 (numBytes=3554) to /172.21.234.181:5001{noformat} 9. However, immediately after successful transmitting and replication the namenode declares the block as corrupt: {noformat} 2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server handler 46 on 8020): *DIR* reportBadBlocks for block: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on datanode: 172.21.226.26:50010{noformat} was: We are seeing cases on HDFS blocks being marked as *bad* after the initial block receive fails during *update pipeline* followed by *HDFS* *recovery* for the block failing as well. Here is the life cycle of a block *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt: 1. The block creation starts at name node as part of *update pipeline* process: {noformat} 2021-01-25 03:41:17,335 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) success{noformat} 2. The block receiver on the data node fails with a socket timeout exception, and so do the retries: {noformat} 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.226.26:56294 remote=/172.21.246.239:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): IOException in BlockReceiver.run(): java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) at java.lang.Thread.run(Thread.java:748){noformat} 3. The data node terminates the receiving of block and initiates recovery: {noformat} 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] terminating 2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 src: /172.21.240.31:60430 dest: /172.21.226.26:50010 2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover RBW replica BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578{noformat} 4. The recovery on the datanode for *{{blk_1342440165_272630578}}* fails with: {noformat} 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): Exception for BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): opWriteBlock BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 received exception java.nio.channels.ClosedByInterruptException 2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK operation src: /172.21.240.31:45918 dst: /172.21.226.26:50010 java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288) at java.lang.Thread.run(Thread.java:748){noformat} 5. This is followed by the resetting of bytes on the data node to {{3072}} the correct length, which is an expected behavior based on https://issues.apache.org/jira/browse/hdfs-11472: {noformat} 2021-01-25 03:42:40,007 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten, blk_1342440165_272630578, RBW getNumBytes() = 3072 getBytesOnDisk() = 2678 getVisibleLength()= 3072 getVolume() = /mnt/hdfs/current getBlockFile() = /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165 bytesAcked=3072 bytesOnDisk=2678 2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at /172.21.244.117:34354 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]): Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220 src: /172.21.244.117:34354 dest: /172.21.226.26:50010 2021-01-25 03:42:40,013 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Resetting bytesOnDisk to match blockDataLength (=3072) for replica ReplicaBeingWritten, blk_1342440165_272630578, RBW getNumBytes() = 3072 getBytesOnDisk() = 2678 getVisibleLength()= 3072 getVolume() = /mnt/hdfs/current getBlockFile() = /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165 bytesAcked=3072 bytesOnDisk=2678{noformat} 6. Since recovery failed, the Name Node restarts the {{updatePipeline}} for the block by increasing the generation: {noformat} 2021-01-25 03:42:40,034 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 8020): updatePipeline(blk_1342440165_272630578, newGS=272645069, newLength=3072, newNodes=[172.21.226.26:50010], client=DFSClient_NONMAPREDUCE_390070548_42) 2021-01-25 03:42:40,034 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069) success{noformat} Notice how the correct *{{newLength}}* in sent in the above log from Name Node i.e. *{{3072}}*. 7. The data node then receives the new generation block of this block correctly: {noformat} 2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest: /172.21.226.26:50010, bytes: 3554, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_390070548_42, offset: 0, srvID: 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration: 240052941459 2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE terminating{noformat} But here notice how bytes: *3554* which is different from what was sent from the Name Node i.e. *3072*. 8. The data node then does the successful downstream replication into another node for this block: {noformat} 2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (BP-908477295-172.21.224.178-1606768078949 heartbeating to /172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010, datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949) Starting thread to transfer BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to 172.21.234.181:50010 2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58): DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 (numBytes=3554) to /172.21.234.181:5001{noformat} 9. However, immediately after successful transmitting and replication the namenode declares the block as corrupt: {noformat} 2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server handler 46 on 8020): *DIR* reportBadBlocks for block: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on datanode: 172.21.226.26:50010{noformat} > Incorrect bytes causing block corruption after update pipeline and recovery > failure > ----------------------------------------------------------------------------------- > > Key: HDFS-15837 > URL: https://issues.apache.org/jira/browse/HDFS-15837 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs > Affects Versions: 2.8.5 > Reporter: Udit Mehrotra > Priority: Major > > We are seeing cases on HDFS blocks being marked as *bad* after the initial > block receive fails during *update pipeline* followed by *HDFS* *recovery* > for the block failing as well. Here is the life cycle of a block > *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt: > 1. The block creation starts at name node as part of *update pipeline* > process: > {noformat} > 2021-01-25 03:41:17,335 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on > 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) > success{noformat} > 2. The block receiver on the data node fails with a socket timeout exception, > and so do the retries: > {noformat} > 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): > PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] > java.net.SocketTimeoutException: 65000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/172.21.226.26:56294 remote=/172.21.246.239:50010] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305) > at java.lang.Thread.run(Thread.java:748) > 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode > (PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): > IOException in BlockReceiver.run(): > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) > at java.lang.Thread.run(Thread.java:748) > 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): > PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) > at java.lang.Thread.run(Thread.java:748){noformat} > 3. The data node terminates the receiving of block and initiates recovery: > {noformat} > 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): > PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] > terminating > 2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at > /172.21.240.31:60430 [Receiving block > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): > Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 > src: /172.21.240.31:60430 dest: /172.21.226.26:50010 > 2021-01-25 03:42:22,529 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at > /172.21.240.31:60430 [Receiving block > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover > RBW replica > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578{noformat} > 4. The recovery on the datanode for *{{blk_1342440165_272630578}}* fails with: > {noformat} > 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at > /172.21.240.31:45918 [Receiving block > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): > Exception for > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288) > at java.lang.Thread.run(Thread.java:748) > 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at > /172.21.240.31:45918 [Receiving block > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): > opWriteBlock > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 received > exception java.nio.channels.ClosedByInterruptException > 2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at > /172.21.240.31:45918 [Receiving block > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): > ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK > operation src: /172.21.240.31:45918 dst: /172.21.226.26:50010 > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288) > at java.lang.Thread.run(Thread.java:748){noformat} > > 5. This is followed by the resetting of bytes on the data node to {{3072}} > the correct length, which is an expected behavior based on > https://issues.apache.org/jira/browse/hdfs-11472: > {noformat} > 2021-01-25 03:42:40,007 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at > /172.21.240.31:60430 [Receiving block > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At > ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten, > blk_1342440165_272630578, RBW > getNumBytes() = 3072 > getBytesOnDisk() = 2678 > getVisibleLength()= 3072 > getVolume() = /mnt/hdfs/current > getBlockFile() = > /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165 > bytesAcked=3072 > bytesOnDisk=2678 > 2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client > DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at > /172.21.244.117:34354 [Receiving block > BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]): > Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220 > src: /172.21.244.117:34354 dest: /172.21.226.26:50010 > 2021-01-25 03:42:40,013 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at > /172.21.240.31:60430 [Receiving block > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): > Resetting bytesOnDisk to match blockDataLength (=3072) for replica > ReplicaBeingWritten, blk_1342440165_272630578, RBW > getNumBytes() = 3072 > getBytesOnDisk() = 2678 > getVisibleLength()= 3072 > getVolume() = /mnt/hdfs/current > getBlockFile() = > /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165 > bytesAcked=3072 > bytesOnDisk=2678{noformat} > 6. Since recovery failed, the Name Node restarts the {{updatePipeline}} for > the block by increasing the generation: > {noformat} > 2021-01-25 03:42:40,034 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on > 8020): updatePipeline(blk_1342440165_272630578, newGS=272645069, > newLength=3072, newNodes=[172.21.226.26:50010], > client=DFSClient_NONMAPREDUCE_390070548_42) > 2021-01-25 03:42:40,034 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on > 8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069) > success{noformat} > Notice how the correct *{{newLength}}* in sent in the above log from Name > Node i.e. *{{3072}}*. > 7. The data node then receives the new generation block of this block > correctly: > {noformat} > 2021-01-25 03:46:40,066 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, > type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest: > /172.21.226.26:50010, bytes: 3554, op: HDFS_WRITE, cliID: > DFSClient_NONMAPREDUCE_390070548_42, offset: 0, srvID: > 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration: > 240052941459 > 2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, > type=LAST_IN_PIPELINE): PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, > type=LAST_IN_PIPELINE terminating{noformat} > But here notice how bytes: *3554* which is different from what was sent from > the Name Node i.e. *3072*. > 8. The data node then does the successful downstream replication into another > node for this block: > {noformat} > 2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (BP-908477295-172.21.224.178-1606768078949 heartbeating to > /172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010, > datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075, > infoSecurePort=0, ipcPort=50020, > storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949) > Starting thread to transfer > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to > 172.21.234.181:50010 > 2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58): > DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 > (numBytes=3554) to /172.21.234.181:5001{noformat} > 9. However, immediately after successful transmitting and replication the > namenode declares the block as corrupt: > {noformat} > 2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server > handler 46 on 8020): *DIR* reportBadBlocks for block: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on > datanode: 172.21.226.26:50010{noformat} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org