YCozy created HDFS-15414:
----------------------------
Summary: java.net.SocketException: Original Exception :
java.io.IOException: Broken pipe
Key: HDFS-15414
URL: https://issues.apache.org/jira/browse/HDFS-15414
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.10.0
Reporter: YCozy
We observed this exception in a DataNode's log while we are not shutting down
any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes
(DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception
occurs in DN3's log:
{noformat}
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
Starting thread to transfer
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
Starting thread to transfer
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,381 INFO
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:
Scheduling a check for /app/dn3/current
2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to
127.0.0.1:9766 got
java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
at
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
... 11 more{noformat}
Port 9766 is DN2's address.
Around the same time, we observe the following exceptions in DN2's log:
{noformat}
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src:
/127.0.0.1:47618 dest: /127.0.0.1:9766
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
Block BP-553302063-172.17.0.3- 1591653120007:blk_1073741825_1002 already
exists in state FINALIZED and thus cannot be created.
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
/127.0.0.1:47618 dst: /127.0.0.1:9766;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in
state FINALIZED and thus cannot be created.{noformat}
However, this exception does look like the cause of the broken pipe because
earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3
only has one occurrence of broken pipe. Here's the other occurrence of
ReplicaAlreadyExistsException on DN2:
{noformat}
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src:
/127.0.0.1:47462 dest: /127.0.0.1:9766
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
Block BP-553302063-172.17.0.3- 1591653120007:blk_1073741825_1001 already
exists in state FINALIZED and thus cannot be created.
2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
/127.0.0.1:47462 dst: /127.0.0.1:9766;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in
state FINALIZED and thus cannot be created.{noformat}
So we think there is a bug causing the broken pipe.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]