[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669794#comment-13669794
 ] 

John Billings commented on ZOOKEEPER-974:
-----------------------------------------

I've run into this same problem during production testing of ZK node failures 
and would like to submit an updated patch to resolve this issue.

Background:  We're running a three-node ZK quorum, with ~3000 clients.  Killing 
one of the quorum members causes 3000/3=1000 of the clients to attempt to 
reconnect to one of the remaining quorum members.  Using the default socket 
connection backlog setting, there are many failed connection attempts (100s) 
and some session timeouts.  'netstat -s' indicates that this is caused by 
overflow of the socket connection backlog.

I've resolved this issue by patching ZK (v3.4.5) to increase the socket 
connection backlog setting.

Question: Which version(s) of ZK should I target for the patch?  The relevant 
code in NettyServerCnxnFactory and NIOServerCnxnFactory has substantially 
changed between 3.4.5 and trunk (3.5.0?).  Will there be a 3.4.6 release based 
off 3.4.5?
                
> Configurable listen socket backlog for the client port
> ------------------------------------------------------
>
>                 Key: ZOOKEEPER-974
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-974
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.2
>            Reporter: Hoonmin Kim
>            Priority: Minor
>         Attachments: ZOOKEEPER-974.patch
>
>
> We're running ZooKeeper ensemble(3-node configuration) for production use for 
> months.
> Days ago, we suffered temporary network? problems that caused many 
> reconnections(about 300) of ephemeral nodes in one ZooKeeper server.
> The almost all clients successfully reconnected to the other ZooKeeper 
> servers,
> but one client failed to reconnect in time and got a session expired message 
> from the server.
> (The problem is that our clients died when they got SessionExpired message.)
> There were many listenQ overflows/drops and out resets in a minute just 
> before the problem situation.
> ---
> So we patched ZooKeeper to increase the backlog size for the client port 
> socket to avoid unhappy cases like this.
> As ZooKeeper uses default backlog size(50) to bind(), we added 
> "clientPortBacklog" option.
> Though the default backlog should be good for common environment,
> we believe that configuring the size is also meaningful.
> [Note]
> On linux, below parameter :
>     net.core.somaxconn
> needs to be larger than above "clientPortBacklog"  to correctly configure 
> listen socket backlog

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to