[ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915832#action_12915832 ]
Jean-Daniel Cryans commented on ZOOKEEPER-880: ---------------------------------------------- bq. to be overly clear - this is happening on just 1 server, the other servers on the cluster are not seeing this, is that right? Yes, sv4borg9. bq. any insight on GC and JVM activity. Are there significant pauses on the GC, or perhaps swapping of that jvm? How active is the JVM? How active (cpu) are the other processes on this host? You mentioned they are using 50% disk, what about cpu? No swapping, GC activity is normal as far as I can tell by the GC log, 1 active CPU for that process according to top (the rest of the cpus are idle most of the time). bq. If I understood correctly the JVM hosting the ZK server is hosting other code as well, is that right? You mentioned something about hbase managing the ZK server, could you elaborate on that as well? That machine is also the Namenode, JobTracker and HBase master (all in their own JVMs). The only thing special is that the quorum peers are started by HBase. bq. Is there a way you could move the ZK datadir on that host to an unused spindle and see if that helps at all? I'll look into that. > QuorumCnxManager$SendWorker grows without bounds > ------------------------------------------------ > > Key: ZOOKEEPER-880 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880 > Project: Zookeeper > Issue Type: Bug > Affects Versions: 3.2.2 > Reporter: Jean-Daniel Cryans > Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, > hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack > > > We're seeing an issue where one server in the ensemble has a steady growing > number of QuorumCnxManager$SendWorker threads up to a point where the OS runs > out of native threads, and at the same time we see a lot of exceptions in the > logs. This is on 3.2.2 and our config looks like: > {noformat} > tickTime=3000 > dataDir=/somewhere_thats_not_tmp > clientPort=2181 > initLimit=10 > syncLimit=5 > server.0=sv4borg9:2888:3888 > server.1=sv4borg10:2888:3888 > server.2=sv4borg11:2888:3888 > server.3=sv4borg12:2888:3888 > server.4=sv4borg13:2888:3888 > {noformat} > The issue is on the first server. I'm going to attach threads dumps and logs > in moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.