[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793666#comment-13793666 ] Mark Miller commented on SOLR-5325: --- I think that the reason that this is hard to catch in a test is that we try and do retries on connectionloss up to the expiration time - there must be some case where we were still getting a connectionloss and no expiration though. This issue should handle that case for this particular bit of code, but as an overall precautionary measure, I have also bumped up the retries just a bit to try and ensure they are going beyond the session timeout. > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792695#comment-13792695 ] ASF subversion and git services commented on SOLR-5325: --- Commit 1531327 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1531327 ] SOLR-5325: raise retry padding a bit > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792694#comment-13792694 ] ASF subversion and git services commented on SOLR-5325: --- Commit 1531325 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1531325 ] SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing commands. > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792689#comment-13792689 ] ASF subversion and git services commented on SOLR-5325: --- Commit 1531324 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1531324 ] SOLR-5325: raise retry padding a bit > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792688#comment-13792688 ] ASF subversion and git services commented on SOLR-5325: --- Commit 1531323 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1531323 ] SOLR-5325: raise retry padding a bit > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792684#comment-13792684 ] Mark Miller commented on SOLR-5325: --- I'm still kind of surprised this would happen - we should be retrying on connectionloss up to an expiration - which would make us the leader no longer. Perhaps the length of retrying can be a little short or something. And perhaps that is part of why it is more difficult for me to reproduce in a test. > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792671#comment-13792671 ] Mark Miller commented on SOLR-5325: --- Add some more testing that I thought would catch it, but it has not yet on my system. Still poking around a bit. Anyway, I've committed the fix. > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792663#comment-13792663 ] ASF subversion and git services commented on SOLR-5325: --- Commit 1531315 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1531315 ] SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing commands. > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792661#comment-13792661 ] ASF subversion and git services commented on SOLR-5325: --- Commit 1531313 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1531313 ] SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing commands. > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4, 4.5 >Reporter: Christine Poerschke >Assignee: Mark Miller > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss
[ https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792274#comment-13792274 ] Mark Miller commented on SOLR-5325: --- Thanks guys - I'll try and get this in quickly as it would be a great to fix it for 4.5.1. > zk connection loss causes overseer leader loss > -- > > Key: SOLR-5325 > URL: https://issues.apache.org/jira/browse/SOLR-5325 > Project: Solr > Issue Type: Bug >Affects Versions: 4.3, 4.4 >Reporter: Christine Poerschke >Assignee: Mark Miller > Attachments: SOLR-5325.patch > > > The problem we saw was that when the solr overseer leader experienced > temporary zk connectivity problems it stopped processing overseer queue > events. > This first happened when quorum within the external zk ensemble was lost due > to too many zookeepers being stopped (similar to SOLR-5199). The second time > it happened when there was a sufficient number of zookeepers but they were > holding zookeeper leadership elections and thus refused connections (the > elections were taking several seconds, we were using the default > zookeeper.cnxTimeout=5s value and it was hit for one ensemble member). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org