[ 
https://issues.apache.org/jira/browse/HDFS-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated HDFS-15414:
-------------------------
    Description: 
We observed this exception in a DataNode's log while we are not shutting down 
any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes 
(DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception 
occurs in DN3's log:
{noformat}
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,381 INFO 
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: 
Scheduling a check for /app/dn3/current
2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
 to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 
127.0.0.1:9766 got
java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
    at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
    at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
    at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
    at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
 
    at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
 
    at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
 
    at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
 
    at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
 
    at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
 
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe 
    ... 11 more{noformat}
Port 9766 is DN2's address. 

Around the same time, we observe the following exceptions in DN2's log:
{noformat}
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: 
/127.0.0.1:47618 dest: /127.0.0.1:9766
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists 
in state FINALIZED and thus cannot be created.
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:47618 dst: /127.0.0.1:9766; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in 
state FINALIZED and thus cannot be created.{noformat}
However, this exception doesn't look like the cause of the broken pipe because 
earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 
only has one occurrence of broken pipe. Here's the other occurrence of 
ReplicaAlreadyExistsException on DN2:
{noformat}
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: 
/127.0.0.1:47462 dest: /127.0.0.1:9766
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists 
in state FINALIZED and thus cannot be created.
2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:47462 dst: /127.0.0.1:9766; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in 
state FINALIZED and thus cannot be created.{noformat}
So we think there is a bug causing the broken pipe.

  was:
We observed this exception in a DataNode's log while we are not shutting down 
any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes 
(DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception 
occurs in DN3's log:
{noformat}
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
 Starting thread to transfer 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,381 INFO 
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: 
Scheduling a check for /app/dn3/current
2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:9666, 
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
infoSecurePort=0, ipcPort=9667, 
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
 to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 
127.0.0.1:9766 got
java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
    at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
    at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
    at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
    at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
 
    at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
 
    at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
 
    at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
 
    at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
 
    at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
 
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe 
    ... 11 more{noformat}
Port 9766 is DN2's address. 

Around the same time, we observe the following exceptions in DN2's log:
{noformat}
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: 
/127.0.0.1:47618 dest: /127.0.0.1:9766
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists 
in state FINALIZED and thus cannot be created.
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:47618 dst: /127.0.0.1:9766; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in 
state FINALIZED and thus cannot be created.{noformat}
However, this exception does look like the cause of the broken pipe because 
earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 
only has one occurrence of broken pipe. Here's the other occurrence of 
ReplicaAlreadyExistsException on DN2:
{noformat}
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: 
/127.0.0.1:47462 dest: /127.0.0.1:9766
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received 
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: 
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists 
in state FINALIZED and thus cannot be created.
2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:47462 dst: /127.0.0.1:9766; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in 
state FINALIZED and thus cannot be created.{noformat}
So we think there is a bug causing the broken pipe.


> java.net.SocketException: Original Exception : java.io.IOException: Broken 
> pipe
> -------------------------------------------------------------------------------
>
>                 Key: HDFS-15414
>                 URL: https://issues.apache.org/jira/browse/HDFS-15414
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.10.0
>            Reporter: YCozy
>            Priority: Major
>
> We observed this exception in a DataNode's log while we are not shutting down 
> any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes 
> (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception 
> occurs in DN3's log:
> {noformat}
> 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(127.0.0.1:9666, 
> datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
> infoSecurePort=0, ipcPort=9667, 
> storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
>  Starting thread to transfer 
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
> 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(127.0.0.1:9666, 
> datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
> infoSecurePort=0, ipcPort=9667, 
> storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
>  Starting thread to transfer 
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
> 2020-06-08 21:53:03,381 INFO 
> org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: 
> Scheduling a check for /app/dn3/current
> 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(127.0.0.1:9666, 
> datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, 
> infoSecurePort=0, ipcPort=9667, 
> storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
>  to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 
> 127.0.0.1:9766 got
> java.net.SocketException: Original Exception : java.io.IOException: Broken 
> pipe
>     at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
>     at 
> sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
>     at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
>     at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
>     at 
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
>  
>     at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
>  
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
>  
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
>  
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
>  
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
>  
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Broken pipe 
>     ... 11 more{noformat}
> Port 9766 is DN2's address. 
> Around the same time, we observe the following exceptions in DN2's log:
> {noformat}
> 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: 
> /127.0.0.1:47618 dest: /127.0.0.1:9766
> 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 
> received exception 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in 
> state FINALIZED and thus cannot be created.
> 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
> /127.0.0.1:47618 dst: /127.0.0.1:9766; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in 
> state FINALIZED and thus cannot be created.{noformat}
> However, this exception doesn't look like the cause of the broken pipe 
> because earlier DN2 has another occurrence of a 
> ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken 
> pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2:
> {noformat}
> 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: 
> /127.0.0.1:47462 dest: /127.0.0.1:9766
> 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 
> received exception 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in 
> state FINALIZED and thus cannot be created.
> 2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation  src: 
> /127.0.0.1:47462 dst: /127.0.0.1:9766; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in 
> state FINALIZED and thus cannot be created.{noformat}
> So we think there is a bug causing the broken pipe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to