[ https://issues.apache.org/jira/browse/HDFS-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
YCozy updated HDFS-15414: ------------------------- Description: We observed this exception in a DataNode's log while we are not shutting down any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception occurs in DN3's log: {noformat} 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,381 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /app/dn3/current 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 got java.net.SocketException: Original Exception : java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Broken pipe ... 11 more{noformat} Port 9766 is DN2's address. Around the same time, we observe the following exceptions in DN2's log: {noformat} 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: /127.0.0.1:47618 dest: /127.0.0.1:9766 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47618 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created.{noformat} However, this exception doesn't look like the cause of the broken pipe because earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2: {noformat} 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: /127.0.0.1:47462 dest: /127.0.0.1:9766 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47462 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created.{noformat} So we think there is a bug causing the broken pipe. was: We observed this exception in a DataNode's log while we are not shutting down any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception occurs in DN3's log: {noformat} 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,381 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /app/dn3/current 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 got java.net.SocketException: Original Exception : java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Broken pipe ... 11 more{noformat} Port 9766 is DN2's address. Around the same time, we observe the following exceptions in DN2's log: {noformat} 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: /127.0.0.1:47618 dest: /127.0.0.1:9766 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47618 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created.{noformat} However, this exception does look like the cause of the broken pipe because earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2: {noformat} 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: /127.0.0.1:47462 dest: /127.0.0.1:9766 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47462 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created.{noformat} So we think there is a bug causing the broken pipe. > java.net.SocketException: Original Exception : java.io.IOException: Broken > pipe > ------------------------------------------------------------------------------- > > Key: HDFS-15414 > URL: https://issues.apache.org/jira/browse/HDFS-15414 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.10.0 > Reporter: YCozy > Priority: Major > > We observed this exception in a DataNode's log while we are not shutting down > any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes > (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception > occurs in DN3's log: > {noformat} > 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(127.0.0.1:9666, > datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, > infoSecurePort=0, ipcPort=9667, > storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) > Starting thread to transfer > BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 > 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(127.0.0.1:9666, > datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, > infoSecurePort=0, ipcPort=9667, > storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) > Starting thread to transfer > BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 > 2020-06-08 21:53:03,381 INFO > org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: > Scheduling a check for /app/dn3/current > 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(127.0.0.1:9666, > datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, > infoSecurePort=0, ipcPort=9667, > storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed > to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to > 127.0.0.1:9766 got > java.net.SocketException: Original Exception : java.io.IOException: Broken > pipe > at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) > at > sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) > at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) > at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) > at > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) > > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280) > > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620) > > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804) > > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751) > > at > org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469) > > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Broken pipe > ... 11 more{noformat} > Port 9766 is DN2's address. > Around the same time, we observe the following exceptions in DN2's log: > {noformat} > 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: > /127.0.0.1:47618 dest: /127.0.0.1:9766 > 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 > received exception > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in > state FINALIZED and thus cannot be created. > 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: > /127.0.0.1:47618 dst: /127.0.0.1:9766; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in > state FINALIZED and thus cannot be created.{noformat} > However, this exception doesn't look like the cause of the broken pipe > because earlier DN2 has another occurrence of a > ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken > pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2: > {noformat} > 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: > /127.0.0.1:47462 dest: /127.0.0.1:9766 > 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 > received exception > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in > state FINALIZED and thus cannot be created. > 2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: > /127.0.0.1:47462 dst: /127.0.0.1:9766; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in > state FINALIZED and thus cannot be created.{noformat} > So we think there is a bug causing the broken pipe. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org