Hi André, I didn't fully read all responses, so I hope i don't repeat to much (or worse contradict statements contained in other replies).
On 08.04.2009 12:32, André Warnier wrote: > Like the original poster, I am seeing on my systems a fair number of > sockets apparently stuck for a long time in the CLOSE_WAIT state. > (Sometimes several hundreds of them). > They seem to predominantly concern Tomcat and other java processes, but > as Alan pointed out previously and I confirm, my perspective is slanted, > because we use a lot of common java programs and webapps on our servers, > and the ones mostly affected talk to eachother and come from the same > vendor. > Unfortunately also, I do not have the sources of these programs/webapps > available, and will not get them, and I can't do without these programs. > > It has been previously established that a socket in a > long-time-lingering CLOSE-WAIT status, is due to one or the other side > of a TCP connection not properly closing its side of the connection when > it is done with it. CLOSE_WAIT says the other side shut down the connection. TCP connections are allowed to stay for an arbitrary time in half-closed state. In general TCP connection can be used in a duplex way. But assume one end has finished communication (sending data). Then it can already close its side of the connection. The nice TCP state diagram is contained in the fundamental book of Stevens, and can be seen e.g. at http://www.cse.iitb.ac.in/perfnet/cs456/tcp-state-diag.pdf As you can see, CLOSE_WAIT on one end always implies FIN_WAIT2 on the other end (except, when between the two ends there's yet another component, that interferes with the communication like maybe a firewall). In the special situation where both ends of the communication are on the same system, one finds each connection twice, one from the point of view of each side of the connection. It is always important to think about which end one is looking at, when interpreting the two lines. > I also surmise (without having a definite proof of this), that this is > essentially "bad", as it ties up some resources that could be otherwise > freed. > I have also been told or discovered that, our servers being Linux Debian > servers, programs such as "ps", "netstat" and "lsof" can help in > determining precisely how many such lingering sockets there are, and who > the culprit processes are (to some extent). True. > In our case, we know which are the programs involved, because we know > which ones open a listening socket and on what fixed port, and we also > know which are the other processes talking to them. > But, as mentioned previously, we do not have the source of these > programs and will not get them, but cannot practically do without them > for now. But we do have full root control of the Linux servers where > these programs are running. The details may depend on the used protocols and sometimes you can get information about timeouts you can set in the application, like idle timeouts for persistent connections. > So my question is : considering the situation above, is there something > I can do locally to free these lingering CLOSE_WAIT sockets, and under > which conditions ? > (I must admit that I am a bit lost among the myriad options of lsof) I would say "no", if you can't change the application and the developper of it didn't provide any configuration options. CLOSE_WAIT from the point of view of tcp is a legitimate state without any builtin timeout. > For example, suppose I start with a "netstat -pan" command and I see the > display below (sorry for the line-wrapping). > I see a number of sockets in the CLOSE_WAIT state, and for those I have > a process-id, which I can associate to a particular process. > For example, I see this line : > tcp6 12 0 ::ffff:127.0.0.1:41764 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 29649/java > which tells me that there is a local process 29649/java, whith a "local" > socket port 41674 in the CLOSE_WAIT state, related to another socket > #11002 on the same host. > On the other hand, I see this line : > tcp 0 0 127.0.0.1:11002 127.0.0.1:41764 FIN_WAIT2 - > which shows a "local" socket on port 11002, related to this other local > socket port #41764, with no process-id/program displayed. > What does that tell me ? My interpretation (not 100% sure): Not sure, what your OS shows in netstat after closing the local side of a connection, more precisely whether the pid is still shown, or is removed. Depending on this answer, either we have a simple one-sided shutdown, or even a process exit. In both cases the process 41764 didn't have any reason to use the established connection in the meantime, so it didn't realise, that the connection is only half-open. As soon as it tried to use it, it should/would detect that and most likely (if programmed correctly) close it. > I also know that the process-id 29649 corresponds to a local java > process, of the daemon variety, multi-threaded. That program "talks to" > another known server program, written in C, of which instances are > started on an ad-hoc base by inetd, and which "listens" on port 11002 > (in fact it is inetd who does, and it passes this socket on to the > process it forks, I understand that). > > (The link with Tomcat is that I also see frequently the same situation, > where the process "owning" the CLOSE_WAIT socket is Tomcat, more > specifically one webapp running inside it. It's just that in this > particular snapshot it isn't.) So Tomcat is the end of the connection that didn't close. You should try to guess from the remote port, what the other side is. That's the one closing the socket. Then think about what this connection is used for (protocol etc.) and what inside Tomcat might use that (very likely the webapp, or a database pool, LDAP based realm etc.) and whether the Tomcat side technology for this connection allows idle timeouts or regular probing. > What it looks like to me in this case, is that at some point one of the > threads of process # 29649 opened a client socket #41674 to the local > inetd port #11002; that inetd then started the underlying server process > (the C program); that the underlying C program then at some point > exited; but that process #41674 never closes one of the sides of its > connection with port #11002. > Can I somehow detect this condition, and "force" the offending thread of > process #29649 to close that socket (or just force this thread to exit) ? Not to my knowledge, I mean not from outside the process 29649. Maybe gstack against the native process gives you an idea, what it is doing. > I realise this may be a complex question, and that the answers may be > different if it is a Tomcat webapp than a stand-alone process. I would > be content to just have answers for the webapp case. > > > Full display of "netstat -pan | grep WAIT" : TIME_WAIT is something completely unrelated. Separate discussion ;) > Proto Recv-Q Send-Q Local Address Foreign Address State > PID/Program name > tcp 0 0 127.0.0.1:11002 127.0.0.1:41764 FIN_WAIT2 - > tcp 0 0 127.0.0.1:11002 127.0.0.1:41739 FIN_WAIT2 - > tcp 0 0 127.0.0.1:11002 127.0.0.1:41753 FIN_WAIT2 - > tcp 0 0 127.0.0.1:11002 127.0.0.1:41759 FIN_WAIT2 - > tcp6 0 0 ::ffff:127.0.0.1:11101 ::ffff:127.0.0.1:41762 > FIN_WAIT2 - > tcp6 0 0 ::ffff:127.0.0.1:11101 ::ffff:127.0.0.1:41748 > FIN_WAIT2 - > tcp6 12 0 ::ffff:127.0.0.1:41711 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 12 0 ::ffff:127.0.0.1:41708 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 12 0 ::ffff:127.0.0.1:41764 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 29649/java > tcp6 12 0 ::ffff:127.0.0.1:41753 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 12 0 ::ffff:127.0.0.1:41759 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 29649/java > tcp6 12 0 ::ffff:127.0.0.1:41739 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 12 0 ::ffff:127.0.0.1:39436 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 12 0 ::ffff:127.0.0.1:38989 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 12 0 ::ffff:127.0.0.1:39364 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 12 0 ::ffff:127.0.0.1:39390 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 12 0 ::ffff:127.0.0.1:40859 ::ffff:127.0.0.1:11002 > CLOSE_WAIT 13333/java > tcp6 1 0 ::ffff:127.0.0.1:39412 ::ffff:127.0.0.1:11101 > CLOSE_WAIT 2864/java > tcp6 1 0 ::ffff:127.0.0.1:41249 ::ffff:127.0.0.1:11101 > CLOSE_WAIT 2864/java > tcp6 1 0 ::ffff:127.0.0.1:41748 ::ffff:127.0.0.1:11101 > CLOSE_WAIT 2864/java > tcp6 1 0 ::ffff:127.0.0.1:41731 ::ffff:127.0.0.1:11101 > CLOSE_WAIT 2864/java > tcp6 1 0 ::ffff:127.0.0.1:41762 ::ffff:127.0.0.1:11101 > CLOSE_WAIT 2864/java Regards, Rainer --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org