[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829162#action_12829162
 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

Qian, if you look at the logs you can see both of these clients, the client I 
mentioned in earlier comment, also the "stat" client:

2010-02-01 06:24:49,783 - INFO  [NIOServerCxn.Factory:8181:nioserverc...@698] - 
Processing stat command from /10.65.7.48:48413
2010-02-01 06:24:49,783 - WARN  [NIOServerCxn.Factory:8181:nioserverc...@494] - 
Exception causing close of session 0x0 due to java.io.IOException: Responded to 
info probe

(really the second line should not be a warn, this is improved in 3.3.0 
codebase).

>From the logs I don't see anything to indicate a problem though. I'm wondering 
>if there is some timing problem in either our c or java networking code (also 
>you are using linux 2.6.9 which is older kernel, I'm wondering if perhaps the 
>timing our app sees is different).

One thing about the 4 letter words (like stat). In some cases I've seen the 
response from the 4letter word be truncated. Perhaps this caused your 
monitoring app to fail? You might add some diags to your monitor app to debug 
this sort of thing.

What I mean is, you request a "stat" and the client sees some of the response, 
but not all of the response. I'm not sure why this is, but
it may have something to do with either the way nc works (I always use nc for 
this) or the way the server works - in the sense that
the server pushes the response text onto the wire and then closes the 
connection. Perhaps in some cases the socket close causes the client
to not see all the response? Is that possible in tcp close?


> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is 
> the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday 
> morning, I found the there were many sockets with the state of CLOSE_WAIT on 
> the clientPort (8181),  the total was over about 120. Because of these 
> CLOSE_WAIT, the server.200 could not accept more connections from the 
> clients. The only thing I can do under this situation is restart the 
> server.200, at about 2010-02-01 06:06:35. The related log is attached to the 
> issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to