[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

JiangHua Zhu (Jira) Fri, 06 May 2022 04:46:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17532808#comment-17532808
 ]


JiangHua Zhu commented on HDFS-16565:
-------------------------------------

Thanks [~hexiaoqiao] for the comment.
This phenomenon occurs on all DataNodes in our online cluster, we use hadoop 
3.3.x, and these clusters are mainly used to store EC data (RS 6x3). Here is 
the phenomenon on 1 of the DataNodes:
jsvc 198492 hdfs *100u IPv4 2393306999 0t0 TCP 
hadoop-ec482.xxx.org:45344->hadoop-ec505.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *117u IPv4 2541480174 0t0 TCP 
hadoop-ec482.xxx:53954->hadoop-ec495.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *123u IPv4 2542535148 0t0 TCP 
hadoop-ec482.xxx:39860->hadoop-ec564.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *125u IPv4 2543324650 0t0 TCP 
hadoop-ec482.xxx:42518->hadoop-ec490.xxx:1004 (CLOSE_WAIT)

Here, hadoop-ec482.xxx is the local DataNode node. You can see that when 
connecting to other nodes, a random port is used, but eventually the connection 
here will remain for a long time and will not be released. I guess the problem 
is in nodes like hadoop-ec482.xxx, due to not closing the stream or socket 
properly.
On our cluster, there are 3 ways to use it:
1. Use HDFS Client Api to store EC data.
2. The data is copied or transferred when the DataNode is forced to go offline, 
or when the balancer is executed.
3. A small amount of storage multi-copy data occurs.
I'm still investigating the exact cause of what's happening here.
[~hexiaoqiao], do you have some better suggestions.
Thank you very much.

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-16565
>                 URL: https://issues.apache.org/jira/browse/HDFS-16565
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.3.0
>         Environment: CentOS Linux release 7.5.1804 (Core)
>            Reporter: JiangHua Zhu
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

Reply via email to