lvfangmin opened a new pull request #843: ZOOKEEPER-3296: Explicitly closing the sslsocket when it failed handshake to prevent issue where peers cannot join quorum URL: https://github.com/apache/zookeeper/pull/843 The quorum connection manager is handling connections sequentially with a default listen backlog queue size 50, during the network loss, there are socket read timed out, which is syncLimit * tickTime, and almost all the following connect requests in the backlog queue will timed out from the other side before it's being processed. Those timed out learners will try to connect to a different server, and leaves the connect requests on server side without sending the close_notify packet. The server is slowly consuming from these queue with syncLimit * tickTime timeout for each of those requests which haven't sent notify_close packet. Any new connect requests will be queued up again when there is spot in the listen backlog queue, but timed out before the server handles it, and it can never successfully finish any new connection, and it failed to join the quorum. Please check the Jira for more details.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services