[ https://issues.apache.org/jira/browse/HDFS-15709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244946#comment-17244946 ]
Yushi Hayasaka commented on HDFS-15709: --------------------------------------- [~weichiu] Thanks for taking a look! We found this issue when we tried to record the checksum of all EC files for detecting EC file corruption (we observed the similar situation like https://issues.apache.org/jira/browse/HDFS-15240 and needed to check whether it happens with Hadoop the patch applied). Therefore, we doubt there is a problem in calculating checksum. Also, we sometimes observe the error like below in DN during recording checksum: {noformat} 2020-11-16 23:26:24,908 WARN org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper: Exception while reading checksum java.net.SocketTimeoutException: 3000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/dn1:46800 remote=/dn1:1019] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:547) at org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.checksumBlock(BlockChecksumHelper.java:627) at org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper$BlockGroupNonStripedChecksumComputer.compute(BlockChecksumHelper.java:492) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.blockGroupChecksum(DataXceiver.java:1030) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opStripedBlockChecksum(Receiver.java:327) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:119) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) at java.lang.Thread.run(Thread.java:748) {noformat} So, we guess there is a problematic point in BlockGroupNonStripedChecksumComputer.recalculateChecksum because it is called for handling the exception. Then, we found there was no way to close StripedReader in StripedBlockChecksumReconstructor. When we change to close it, the leak seems to be gone, so we suppose the cause is. In short, it was found by checking sources manually ^^; > Socket file descriptor leak in StripedBlockChecksumReconstructor > ---------------------------------------------------------------- > > Key: HDFS-15709 > URL: https://issues.apache.org/jira/browse/HDFS-15709 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, ec, erasure-coding > Reporter: Yushi Hayasaka > Assignee: Yushi Hayasaka > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We found a socket file descriptor leak when we tried to get the checksum of > EC file with reconstruction happened during the operation. > The cause of the leak seems that the StripedBlockChecksumReconstructor does > not close StripedReader. Making the reader closed, the CLOSE_WAIT connections > are gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org