[ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933602#action_12933602 ]
Vishal K commented on ZOOKEEPER-880: ------------------------------------ Hi Benoit, May I suggest to see if you can reproduce this problem with 3.3.3 (with patch for ZOOKEEPER-822)? I was going through QuorumCnxManager.java for 3.2.2. It clearly leaks a SendWorker thread for every other connection. After receiving a connection from a peer, it creates a new thread and inserts its reference in senderWorkerMap. SendWorker sw = new SendWorker(s, sid); RecvWorker rw = new RecvWorker(s, sid); sw.setRecv(rw); SendWorker vsw = senderWorkerMap.get(sid); senderWorkerMap.put(sid, sw); Then it kills the old thread for the peer (created from earlier connection) if(vsw != null) vsw.finish(); However, the SendWorker.finish method removes an entry from senderWorkerMap. This results in removing a reference for recently created SendWorker thread. senderWorkerMap.remove(sid); Thus, it will end up removing both the entries. As a result, one thread will be leaked for every other connection. If you count the number of error messages in hbase-hadoop-zookeeper-sv4borg9.log, you will see that messages from RecvWorker is approximately twice of SendWorker. I think this proves the point. $:/tmp/hadoop # grep "RecvWorker" hbase-hadoop-zookeeper-sv4borg9.log | wc -l 60 $:/tmp/hadoop # grep "SendWorker" hbase-hadoop-zookeeper-sv4borg9.log | wc -l 32 -Vishal > QuorumCnxManager$SendWorker grows without bounds > ------------------------------------------------ > > Key: ZOOKEEPER-880 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880 > Project: Zookeeper > Issue Type: Bug > Affects Versions: 3.2.2 > Reporter: Jean-Daniel Cryans > Priority: Critical > Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, > hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, > TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz > > > We're seeing an issue where one server in the ensemble has a steady growing > number of QuorumCnxManager$SendWorker threads up to a point where the OS runs > out of native threads, and at the same time we see a lot of exceptions in the > logs. This is on 3.2.2 and our config looks like: > {noformat} > tickTime=3000 > dataDir=/somewhere_thats_not_tmp > clientPort=2181 > initLimit=10 > syncLimit=5 > server.0=sv4borg9:2888:3888 > server.1=sv4borg10:2888:3888 > server.2=sv4borg11:2888:3888 > server.3=sv4borg12:2888:3888 > server.4=sv4borg13:2888:3888 > {noformat} > The issue is on the first server. I'm going to attach threads dumps and logs > in moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.