[jira] [Commented] (HDFS-15837) Incorrect bytes causing block corruption after update pipeline and recovery failure
[ https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285488#comment-17285488 ] Udit Mehrotra commented on HDFS-15837: -- [~kihwal] hadoop version is *2.8.5* running on EMR. So the block it seems was successfully written with size *3554* and ultimately reported as a bad block. Can you point to some of the probable causes for this that you have seen ? Could this be due to high network load on the particular data node ? Also, is there something we can do to avoid recurrence of this. At this time we are investigating a strategy to distribute the network load to avoid data node from being overwhelmed. But we still want to understand the root cause for this. > Incorrect bytes causing block corruption after update pipeline and recovery > failure > --- > > Key: HDFS-15837 > URL: https://issues.apache.org/jira/browse/HDFS-15837 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Affects Versions: 2.8.5 >Reporter: Udit Mehrotra >Priority: Major > > We are seeing cases on HDFS blocks being marked as *bad* after the initial > block receive fails during *update pipeline* followed by *HDFS* *recovery* > for the block failing as well. Here is the life cycle of a block > *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt: > 1. The block creation starts at name node as part of *update pipeline* > process: > {noformat} > 2021-01-25 03:41:17,335 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on > 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) > success{noformat} > 2. The block receiver on the data node fails with a socket timeout exception, > and so do the retries: > {noformat} > 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): > PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] > java.net.SocketTimeoutException: 65000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/172.21.226.26:56294 remote=/172.21.246.239:50010] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305) > at java.lang.Thread.run(Thread.java:748) > 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode > (PacketResponder: > BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): > IOException in BlockReceiver.run(): > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) > at > org.apache.hadoop.hdfs.server.
[jira] [Updated] (HDFS-15837) Incorrect bytes causing block corruption after update pipeline and recovery failure
[ https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated HDFS-15837: - Description: We are seeing cases on HDFS blocks being marked as *bad* after the initial block receive fails during *update pipeline* followed by *HDFS* *recovery* for the block failing as well. Here is the life cycle of a block *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt: 1. The block creation starts at name node as part of *update pipeline* process: {noformat} 2021-01-25 03:41:17,335 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) success{noformat} 2. The block receiver on the data node fails with a socket timeout exception, and so do the retries: {noformat} 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.226.26:56294 remote=/172.21.246.239:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): IOException in BlockReceiver.run(): java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117
[jira] [Updated] (HDFS-15837) Incorrect bytes causing block corruption after update pipeline and recovery failure
[ https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated HDFS-15837: - Description: We are seeing cases on HDFS blocks being marked as *bad* after the initial block receive fails during *update pipeline* followed by *HDFS* *recovery* for the block failing as well. Here is the life cycle of a block *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt: 1. The block creation starts at name node as part of *update pipeline* process: {noformat} 2021-01-25 03:41:17,335 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) success{noformat} 2. The block receiver on the data node fails with a socket timeout exception, and so do the retries: {noformat} 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.226.26:56294 remote=/172.21.246.239:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): IOException in BlockReceiver.run(): java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117
[jira] [Updated] (HDFS-15837) Incorrect bytes causing block corruption after update pipeline and recovery failure
[ https://issues.apache.org/jira/browse/HDFS-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated HDFS-15837: - Description: We are seeing cases on HDFS blocks being marked as *bad* after the initial block receive fails during *update pipeline* followed by *HDFS* *recovery* for the block failing as well. Here is the life cycle of a block *{{blk_1342440165_272630578}}* that was ultimately marked as corrupt: 1. The block creation starts at name node as part of *update pipeline* process: {noformat} 2021-01-25 03:41:17,335 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) success{noformat} 2. The block receiver on the data node fails with a socket timeout exception, and so do the retries: {noformat} 2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.226.26:56294 remote=/172.21.246.239:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): IOException in BlockReceiver.run(): java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402) at java.lang.Thread.run(Thread.java:748) 2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.ja
[jira] [Created] (HDFS-15837) Incorrect bytes causing block corruption after update pipeline and recovery failure
Udit Mehrotra created HDFS-15837: Summary: Incorrect bytes causing block corruption after update pipeline and recovery failure Key: HDFS-15837 URL: https://issues.apache.org/jira/browse/HDFS-15837 Project: Hadoop HDFS Issue Type: Bug Components: datanode, hdfs Affects Versions: 2.8.5 Reporter: Udit Mehrotra -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org