[ https://issues.apache.org/jira/browse/ZOOKEEPER-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669794#comment-13669794 ]
John Billings commented on ZOOKEEPER-974: ----------------------------------------- I've run into this same problem during production testing of ZK node failures and would like to submit an updated patch to resolve this issue. Background: We're running a three-node ZK quorum, with ~3000 clients. Killing one of the quorum members causes 3000/3=1000 of the clients to attempt to reconnect to one of the remaining quorum members. Using the default socket connection backlog setting, there are many failed connection attempts (100s) and some session timeouts. 'netstat -s' indicates that this is caused by overflow of the socket connection backlog. I've resolved this issue by patching ZK (v3.4.5) to increase the socket connection backlog setting. Question: Which version(s) of ZK should I target for the patch? The relevant code in NettyServerCnxnFactory and NIOServerCnxnFactory has substantially changed between 3.4.5 and trunk (3.5.0?). Will there be a 3.4.6 release based off 3.4.5? > Configurable listen socket backlog for the client port > ------------------------------------------------------ > > Key: ZOOKEEPER-974 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-974 > Project: ZooKeeper > Issue Type: Improvement > Components: server > Affects Versions: 3.3.2 > Reporter: Hoonmin Kim > Priority: Minor > Attachments: ZOOKEEPER-974.patch > > > We're running ZooKeeper ensemble(3-node configuration) for production use for > months. > Days ago, we suffered temporary network? problems that caused many > reconnections(about 300) of ephemeral nodes in one ZooKeeper server. > The almost all clients successfully reconnected to the other ZooKeeper > servers, > but one client failed to reconnect in time and got a session expired message > from the server. > (The problem is that our clients died when they got SessionExpired message.) > There were many listenQ overflows/drops and out resets in a minute just > before the problem situation. > --- > So we patched ZooKeeper to increase the backlog size for the client port > socket to avoid unhappy cases like this. > As ZooKeeper uses default backlog size(50) to bind(), we added > "clientPortBacklog" option. > Though the default backlog should be good for common environment, > we believe that configuring the size is also meaningful. > [Note] > On linux, below parameter : > net.core.somaxconn > needs to be larger than above "clientPortBacklog" to correctly configure > listen socket backlog -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira