[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792684#comment-13792684 ]
Mark Miller commented on SOLR-5325: ----------------------------------- I'm still kind of surprised this would happen - we should be retrying on connectionloss up to an expiration - which would make us the leader no longer. Perhaps the length of retrying can be a little short or something. And perhaps that is part of why it is more difficult for me to reproduce in a test. > zk connection loss causes overseer leader loss > ---------------------------------------------- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug > Affects Versions: 4.3, 4.4, 4.5 > Reporter: Christine Poerschke > Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org