[ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933718#action_12933718 ]
Flavio Junqueira commented on ZOOKEEPER-880: -------------------------------------------- One problem here is that we had some discussions over IRC and the information is not reflected here. If you have a look at the logs, you'll observe this: {noformat} 2010-09-28 10:31:22,227 DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection request /10.10.20.5:41861 2010-09-28 10:31:22,227 DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection request: 0 2010-09-28 10:31:22,227 DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager: Address of remote peer: 0 2010-09-28 10:31:22,229 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection broken: java.io.IOException: Channel eof at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:595) {noformat} If I remember the discussion with J-D correctly, that node trying to connect is running Nagios. My conjecture at the time was that the IOException was killing the receiver thread, but not the sender thread (RecvWorker.finish() does not close its SendWorker counterpart). Your point is good, but it sounds like that the race you mention would have to be triggered continuously to cause the number of SendWorker threads to grow steadily. It sounds unlikely to me. > QuorumCnxManager$SendWorker grows without bounds > ------------------------------------------------ > > Key: ZOOKEEPER-880 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880 > Project: Zookeeper > Issue Type: Bug > Affects Versions: 3.2.2 > Reporter: Jean-Daniel Cryans > Priority: Critical > Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, > hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, > TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz > > > We're seeing an issue where one server in the ensemble has a steady growing > number of QuorumCnxManager$SendWorker threads up to a point where the OS runs > out of native threads, and at the same time we see a lot of exceptions in the > logs. This is on 3.2.2 and our config looks like: > {noformat} > tickTime=3000 > dataDir=/somewhere_thats_not_tmp > clientPort=2181 > initLimit=10 > syncLimit=5 > server.0=sv4borg9:2888:3888 > server.1=sv4borg10:2888:3888 > server.2=sv4borg11:2888:3888 > server.3=sv4borg12:2888:3888 > server.4=sv4borg13:2888:3888 > {noformat} > The issue is on the first server. I'm going to attach threads dumps and logs > in moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.