[ 
https://issues.apache.org/jira/browse/HDFS-15709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244946#comment-17244946
 ] 

Yushi Hayasaka commented on HDFS-15709:
---------------------------------------

[~weichiu] Thanks for taking a look!
We found this issue when we tried to record the checksum of all EC files for 
detecting EC file corruption (we observed the similar situation like 
https://issues.apache.org/jira/browse/HDFS-15240 and needed to check whether it 
happens with Hadoop the patch applied).
Therefore, we doubt there is a problem in calculating checksum. Also, we 
sometimes observe the error like below in DN during recording checksum:
{noformat}
2020-11-16 23:26:24,908 WARN 
org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper: Exception while 
reading checksum
java.net.SocketTimeoutException: 3000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/dn1:46800 remote=/dn1:1019]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:547)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.checksumBlock(BlockChecksumHelper.java:627)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:492)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1030)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
        at java.lang.Thread.run(Thread.java:748)
{noformat}
So, we guess there is a problematic point in 
BlockGroupNonStripedChecksumComputer.recalculateChecksum because it is called 
for handling the exception.
 Then, we found there was no way to close StripedReader in 
StripedBlockChecksumReconstructor. When we change to close it, the leak seems 
to be gone, so we suppose the cause is.
 In short, it was found by checking sources manually ^^;

> Socket file descriptor leak in StripedBlockChecksumReconstructor
> ----------------------------------------------------------------
>
>                 Key: HDFS-15709
>                 URL: https://issues.apache.org/jira/browse/HDFS-15709
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, ec, erasure-coding
>            Reporter: Yushi Hayasaka
>            Assignee: Yushi Hayasaka
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We found a socket file descriptor leak when we tried to get the checksum of 
> EC file with reconstruction happened during the operation.
> The cause of the leak seems that the StripedBlockChecksumReconstructor does 
> not close StripedReader. Making the reader closed, the CLOSE_WAIT connections 
> are gone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to