jobmanager holds too many CLOSE_WAIT connection to datanode

Yuan,Youjun Thu, 23 Aug 2018 07:53:41 -0700

Hi,

After running for a while , my job manager holds thousands of CLOSE_WAIT TCP 
connection to HDFS datanode, the number is growing up slowly, and it's likely 
will hit the max open file limit. My jobs checkpoint to HDFS every minute.
If I run lsof -i -a -p $JMPID, I can get a tons of following output:
java    9433  iot  408u  IPv4 4060901898      0t0  TCP 
jmHost:17922->datanode:50010 (CLOSE_WAIT)
java    9433  iot  409u  IPv4 4061478455      0t0  TCP 
jmHost:52854->datanode:50010 (CLOSE_WAIT)
java    9433  iot  410r  IPv4 4063170767      0t0  TCP 
jmHost:49384->datanode:50010 (CLOSE_WAIT)
java    9433  iot  411w  IPv4 4063188376      0t0  TCP 
jmHost:50516->datanode:50010 (CLOSE_WAIT)
java    9433  iot  412u  IPv4 4061459881      0t0  TCP 
jmHost:51651->datanode:50010 (CLOSE_WAIT)
java    9433  iot  413u  IPv4 4063737603      0t0  TCP 
jmHost:31318->datanode:50010 (CLOSE_WAIT)
java    9433  iot  414w  IPv4 4062030625      0t0  TCP 
jmHost:34033->datanode:50010 (CLOSE_WAIT)
java    9433  iot  415u  IPv4 4062049134      0t0  TCP 
jmHost:35156->datanode:50010 (CLOSE_WAIT)
java    9433  iot  416u  IPv4 4062615550      0t0  TCP 
jmHost:16962->datanode:50010 (CLOSE_WAIT)
java    9433  iot  417r  IPv4 4063757056      0t0  TCP 
jmHost:32553->datanode:50010 (CLOSE_WAIT)
java    9433  iot  418w  IPv4 4064304789      0t0  TCP 
jmHost:13375->datanode:50010 (CLOSE_WAIT)
java    9433  iot  419u  IPv4 4062599328      0t0  TCP 
jmHost:15915->datanode:50010 (CLOSE_WAIT)
java    9433  iot  420w  IPv4 4065462963      0t0  TCP 
jmHost:30432->datanode:50010 (CLOSE_WAIT)
java    9433  iot  421u  IPv4 4067178257      0t0  TCP 
jmHost:28334->datanode:50010 (CLOSE_WAIT)
java    9433  iot  422u  IPv4 4066022066      0t0  TCP 
jmHost:11843->datanode:50010 (CLOSE_WAIT)



I know restarting the job manager should cleanup those connections, but I 
wonder if there is any better solution?
Btw, I am using flink 1.4.0, and running a standalone cluster.

Thanks
Youjun

jobmanager holds too many CLOSE_WAIT connection to datanode

Reply via email to