[ 
https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HDFS-15837:
---------------------------------
    Description: 
We are seeing cases on HDFS blocks being marked as *bad* after the initial 
block receive fails during *update pipeline* followed by *HDFS* *recovery* for 
the block failing as well. Here is the life cycle of a block 
*{{blk_1342440165_272630578}}* that was ultimately marked as corrupt:

1. The block creation starts at name node as part of *update pipeline* process:
{noformat}
2021-01-25 03:41:17,335 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 
8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) 
success{noformat}
2. The block receiver on the data node fails with a socket timeout exception, 
and so do the retries:
{noformat}
2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
    at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
    at java.lang.Thread.run(Thread.java:748)

2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
IOException in BlockReceiver.run(): 
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
    at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
    at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
    at java.lang.Thread.run(Thread.java:748)

2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
    at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
    at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
    at java.lang.Thread.run(Thread.java:748){noformat}
3. The data node terminates the receiving of block and initiates recovery:
{noformat}
2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] 
terminating

2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:60430 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Receiving 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 src: 
/172.21.240.31:60430 dest: /172.21.226.26:50010

2021-01-25 03:42:22,529 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:60430 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover 
RBW replica 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578{noformat}
4. The recovery on the datanode for *{{blk_1342440165_272630578}}* fails with:
{noformat}
2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:45918 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): Exception 
for BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578
java.nio.channels.ClosedByInterruptException
    at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
    at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
    at java.lang.Thread.run(Thread.java:748)

2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:45918 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): 
opWriteBlock BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 
received exception java.nio.channels.ClosedByInterruptException
2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:45918 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): 
ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK 
operation  src: /172.21.240.31:45918 dst: /172.21.226.26:50010
java.nio.channels.ClosedByInterruptException
    at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
    at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
    at java.lang.Thread.run(Thread.java:748){noformat}
 

5. This is followed by the resetting of bytes on the data node to {{3072}} the 
correct length, which is an expected behavior based on 
https://issues.apache.org/jira/browse/hdfs-11472:
{noformat}
2021-01-25 03:42:40,007 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:60430 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At 
ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten, 
blk_1342440165_272630578, RBW
  getNumBytes()     = 3072
  getBytesOnDisk()  = 2678
  getVisibleLength()= 3072
  getVolume()       = /mnt/hdfs/current
  getBlockFile()    = 
/mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
  bytesAcked=3072
  bytesOnDisk=2678
2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client 
DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at 
/172.21.244.117:34354 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]): Receiving 
BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220 src: 
/172.21.244.117:34354 dest: /172.21.226.26:50010

2021-01-25 03:42:40,013 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:60430 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Resetting 
bytesOnDisk to match blockDataLength (=3072) for replica ReplicaBeingWritten, 
blk_1342440165_272630578, RBW
  getNumBytes()     = 3072
  getBytesOnDisk()  = 2678
  getVisibleLength()= 3072
  getVolume()       = /mnt/hdfs/current
  getBlockFile()    = 
/mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
  bytesAcked=3072
  bytesOnDisk=2678{noformat}
6. Since recovery failed, the Name Node restarts the {{updatePipeline}} for the 
block by increasing the generation:
{noformat}
2021-01-25 03:42:40,034 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 
8020): updatePipeline(blk_1342440165_272630578, newGS=272645069, 
newLength=3072, newNodes=[172.21.226.26:50010], 
client=DFSClient_NONMAPREDUCE_390070548_42)

2021-01-25 03:42:40,034 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 
8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069) 
success{noformat}
Notice how the correct *{{newLength}}* in sent in the above log from Name Node 
i.e. *{{3072}}*.

7. The data node then receives the new generation block of this block correctly:
{noformat}
2021-01-25 03:46:40,066 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest: /172.21.226.26:50010, 
bytes: 3554, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_390070548_42, 
offset: 0, srvID: 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration: 
240052941459

2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
type=LAST_IN_PIPELINE): PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
type=LAST_IN_PIPELINE terminating{noformat}
But here notice how bytes: *3554* which is different from what was sent from 
the Name Node i.e. *3072*.

8. The data node then does the successful downstream replication into another 
node for this block:
{noformat}
2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(BP-908477295-172.21.224.178-1606768078949 heartbeating to 
/172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010, 
datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949)
 Starting thread to transfer 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to 
172.21.234.181:50010 

2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58): 
DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 
(numBytes=3554) to /172.21.234.181:5001{noformat}
9. However, immediately after successful transmitting and replication the 
namenode declares the block as corrupt:
{noformat}
2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server 
handler 46 on 8020): *DIR* reportBadBlocks for block: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on datanode: 
172.21.226.26:50010{noformat}
 

 

 

 

 

  was:
We are seeing cases on HDFS blocks being marked as *bad* after the initial 
block receive fails during *update pipeline* followed by *HDFS* *recovery* for 
the block failing as well. Here is the life cycle of a block 
*{{blk_1342440165_272630578}}* that was ultimately marked as corrupt:

1. The block creation starts at name node as part of *update pipeline* process:

 
{noformat}
2021-01-25 03:41:17,335 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 
8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) 
success{noformat}
2. The block receiver on the data node fails with a socket timeout exception, 
and so do the retries:

 
{noformat}
2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
    at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
    at java.lang.Thread.run(Thread.java:748)

2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
IOException in BlockReceiver.run(): 
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
    at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
    at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
    at java.lang.Thread.run(Thread.java:748)

2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
    at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
    at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
    at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
    at java.lang.Thread.run(Thread.java:748){noformat}
3. The data node terminates the receiving of block and initiates recovery:

 

 
{noformat}
2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] 
terminating

2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:60430 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Receiving 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 src: 
/172.21.240.31:60430 dest: /172.21.226.26:50010

2021-01-25 03:42:22,529 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:60430 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover 
RBW replica 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578{noformat}
4. The recovery on the datanode for *{{blk_1342440165_272630578}}* fails with:

 
{noformat}
2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:45918 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): Exception 
for BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578
java.nio.channels.ClosedByInterruptException
    at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
    at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
    at java.lang.Thread.run(Thread.java:748)

2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:45918 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): 
opWriteBlock BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 
received exception java.nio.channels.ClosedByInterruptException
2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:45918 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): 
ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK 
operation  src: /172.21.240.31:45918 dst: /172.21.226.26:50010
java.nio.channels.ClosedByInterruptException
    at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
    at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
    at java.lang.Thread.run(Thread.java:748){noformat}
 

 

5. This is followed by the resetting of bytes on the data node to {{3072}} the 
correct length, which is an expected behavior based on 
https://issues.apache.org/jira/browse/hdfs-11472:

 
{noformat}
2021-01-25 03:42:40,007 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:60430 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At 
ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten, 
blk_1342440165_272630578, RBW
  getNumBytes()     = 3072
  getBytesOnDisk()  = 2678
  getVisibleLength()= 3072
  getVolume()       = /mnt/hdfs/current
  getBlockFile()    = 
/mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
  bytesAcked=3072
  bytesOnDisk=2678
2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client 
DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at 
/172.21.244.117:34354 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]): Receiving 
BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220 src: 
/172.21.244.117:34354 dest: /172.21.226.26:50010

2021-01-25 03:42:40,013 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
(DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
/172.21.240.31:60430 [Receiving block 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Resetting 
bytesOnDisk to match blockDataLength (=3072) for replica ReplicaBeingWritten, 
blk_1342440165_272630578, RBW
  getNumBytes()     = 3072
  getBytesOnDisk()  = 2678
  getVisibleLength()= 3072
  getVolume()       = /mnt/hdfs/current
  getBlockFile()    = 
/mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
  bytesAcked=3072
  bytesOnDisk=2678{noformat}
 

6. Since recovery failed, the Name Node restarts the {{updatePipeline}} for the 
block by increasing the generation:

 
{noformat}
2021-01-25 03:42:40,034 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 
8020): updatePipeline(blk_1342440165_272630578, newGS=272645069, 
newLength=3072, newNodes=[172.21.226.26:50010], 
client=DFSClient_NONMAPREDUCE_390070548_42)

2021-01-25 03:42:40,034 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 
8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069) 
success{noformat}

Notice how the correct *{{newLength}}* in sent in the above log from Name Node 
i.e. *{{3072}}*.

 

7. The data node then receives the new generation block of this block correctly:
{noformat}
2021-01-25 03:46:40,066 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest: /172.21.226.26:50010, 
bytes: 3554, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_390070548_42, 
offset: 0, srvID: 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration: 
240052941459

2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
type=LAST_IN_PIPELINE): PacketResponder: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
type=LAST_IN_PIPELINE terminating{noformat}
But here notice how bytes: *3554* which is different from what was sent from 
the Name Node i.e. *3072*.

 

8. The data node then does the successful downstream replication into another 
node for this block:
{noformat}
2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(BP-908477295-172.21.224.178-1606768078949 heartbeating to 
/172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010, 
datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075, 
infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949)
 Starting thread to transfer 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to 
172.21.234.181:50010 

2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58): 
DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 
(numBytes=3554) to /172.21.234.181:5001{noformat}

9. However, immediately after successful transmitting and replication the 
namenode declares the block as corrupt:
{noformat}
2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server 
handler 46 on 8020): *DIR* reportBadBlocks for block: 
BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on datanode: 
172.21.226.26:50010{noformat}
 

 

 

 

 


> Incorrect bytes causing block corruption after update pipeline and recovery 
> failure
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-15837
>                 URL: https://issues.apache.org/jira/browse/HDFS-15837
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs
>    Affects Versions: 2.8.5
>            Reporter: Udit Mehrotra
>            Priority: Major
>
> We are seeing cases on HDFS blocks being marked as *bad* after the initial 
> block receive fails during *update pipeline* followed by *HDFS* *recovery* 
> for the block failing as well. Here is the life cycle of a block 
> *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt:
> 1. The block creation starts at name node as part of *update pipeline* 
> process:
> {noformat}
> 2021-01-25 03:41:17,335 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 
> 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) 
> success{noformat}
> 2. The block receiver on the data node fails with a socket timeout exception, 
> and so do the retries:
> {noformat}
> 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
> java.net.SocketTimeoutException: 65000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>     at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>     at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>     at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>     at java.io.FilterInputStream.read(FilterInputStream.java:83)
>     at java.io.FilterInputStream.read(FilterInputStream.java:83)
>     at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
>     at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
>     at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> IOException in BlockReceiver.run(): 
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>     at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>     at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>     at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
>     at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
>     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>     at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
>     at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
> java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>     at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>     at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>     at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
>     at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
>     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>     at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
>     at java.lang.Thread.run(Thread.java:748){noformat}
> 3. The data node terminates the receiving of block and initiates recovery:
> {noformat}
> 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): 
> PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] 
> terminating
> 2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
> /172.21.240.31:60430 [Receiving block 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): 
> Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 
> src: /172.21.240.31:60430 dest: /172.21.226.26:50010
> 2021-01-25 03:42:22,529 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
> /172.21.240.31:60430 [Receiving block 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover 
> RBW replica 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578{noformat}
> 4. The recovery on the datanode for *{{blk_1342440165_272630578}}* fails with:
> {noformat}
> 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
> /172.21.240.31:45918 [Receiving block 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): 
> Exception for 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578
> java.nio.channels.ClosedByInterruptException
>     at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>     at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
>     at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
>     at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
>     at java.lang.Thread.run(Thread.java:748)
> 2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
> /172.21.240.31:45918 [Receiving block 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): 
> opWriteBlock 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 received 
> exception java.nio.channels.ClosedByInterruptException
> 2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
> /172.21.240.31:45918 [Receiving block 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): 
> ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK 
> operation  src: /172.21.240.31:45918 dst: /172.21.226.26:50010
> java.nio.channels.ClosedByInterruptException
>     at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>     at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
>     at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
>     at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
>     at java.lang.Thread.run(Thread.java:748){noformat}
>  
> 5. This is followed by the resetting of bytes on the data node to {{3072}} 
> the correct length, which is an expected behavior based on 
> https://issues.apache.org/jira/browse/hdfs-11472:
> {noformat}
> 2021-01-25 03:42:40,007 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
> /172.21.240.31:60430 [Receiving block 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At 
> ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten, 
> blk_1342440165_272630578, RBW
>   getNumBytes()     = 3072
>   getBytesOnDisk()  = 2678
>   getVisibleLength()= 3072
>   getVolume()       = /mnt/hdfs/current
>   getBlockFile()    = 
> /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
>   bytesAcked=3072
>   bytesOnDisk=2678
> 2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client 
> DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at 
> /172.21.244.117:34354 [Receiving block 
> BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]): 
> Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220 
> src: /172.21.244.117:34354 dest: /172.21.226.26:50010
> 2021-01-25 03:42:40,013 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl 
> (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at 
> /172.21.240.31:60430 [Receiving block 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): 
> Resetting bytesOnDisk to match blockDataLength (=3072) for replica 
> ReplicaBeingWritten, blk_1342440165_272630578, RBW
>   getNumBytes()     = 3072
>   getBytesOnDisk()  = 2678
>   getVisibleLength()= 3072
>   getVolume()       = /mnt/hdfs/current
>   getBlockFile()    = 
> /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
>   bytesAcked=3072
>   bytesOnDisk=2678{noformat}
> 6. Since recovery failed, the Name Node restarts the {{updatePipeline}} for 
> the block by increasing the generation:
> {noformat}
> 2021-01-25 03:42:40,034 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 
> 8020): updatePipeline(blk_1342440165_272630578, newGS=272645069, 
> newLength=3072, newNodes=[172.21.226.26:50010], 
> client=DFSClient_NONMAPREDUCE_390070548_42)
> 2021-01-25 03:42:40,034 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 
> 8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069) 
> success{noformat}
> Notice how the correct *{{newLength}}* in sent in the above log from Name 
> Node i.e. *{{3072}}*.
> 7. The data node then receives the new generation block of this block 
> correctly:
> {noformat}
> 2021-01-25 03:46:40,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
> type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest: 
> /172.21.226.26:50010, bytes: 3554, op: HDFS_WRITE, cliID: 
> DFSClient_NONMAPREDUCE_390070548_42, offset: 0, srvID: 
> 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration: 
> 240052941459
> 2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
> type=LAST_IN_PIPELINE): PacketResponder: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, 
> type=LAST_IN_PIPELINE terminating{noformat}
> But here notice how bytes: *3554* which is different from what was sent from 
> the Name Node i.e. *3072*.
> 8. The data node then does the successful downstream replication into another 
> node for this block:
> {noformat}
> 2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (BP-908477295-172.21.224.178-1606768078949 heartbeating to 
> /172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010, 
> datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075, 
> infoSecurePort=0, ipcPort=50020, 
> storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949)
>  Starting thread to transfer 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to 
> 172.21.234.181:50010 
> 2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58): 
> DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 
> (numBytes=3554) to /172.21.234.181:5001{noformat}
> 9. However, immediately after successful transmitting and replication the 
> namenode declares the block as corrupt:
> {noformat}
> 2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server 
> handler 46 on 8020): *DIR* reportBadBlocks for block: 
> BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on 
> datanode: 172.21.226.26:50010{noformat}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to