[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-13 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793666#comment-13793666
 ] 

Mark Miller commented on SOLR-5325:
---

I think that the reason that this is hard to catch in a test is that we try and 
do retries on connectionloss up to the expiration time - there must be some 
case where we were still getting a connectionloss and no expiration though. 
This issue should handle that case for this particular bit of code, but as an 
overall precautionary measure, I have also bumped up the retries just a bit to 
try and ensure they are going beyond the session timeout.

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792695#comment-13792695
 ] 

ASF subversion and git services commented on SOLR-5325:
---

Commit 1531327 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_5'
[ https://svn.apache.org/r1531327 ]

SOLR-5325: raise retry padding a bit

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792694#comment-13792694
 ] 

ASF subversion and git services commented on SOLR-5325:
---

Commit 1531325 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_5'
[ https://svn.apache.org/r1531325 ]

SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing 
commands.

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792689#comment-13792689
 ] 

ASF subversion and git services commented on SOLR-5325:
---

Commit 1531324 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1531324 ]

SOLR-5325: raise retry padding a bit

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792688#comment-13792688
 ] 

ASF subversion and git services commented on SOLR-5325:
---

Commit 1531323 from [~markrmil...@gmail.com] in branch 'dev/trunk'
[ https://svn.apache.org/r1531323 ]

SOLR-5325: raise retry padding a bit

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792684#comment-13792684
 ] 

Mark Miller commented on SOLR-5325:
---

I'm still kind of surprised this would happen - we should be retrying on 
connectionloss up to an expiration - which would make us the leader no longer. 
Perhaps the length of retrying can be a little short or something. And perhaps 
that is part of why it is more difficult for me to reproduce in a test.

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792671#comment-13792671
 ] 

Mark Miller commented on SOLR-5325:
---

Add some more testing that I thought would catch it, but it has not yet on my 
system. Still poking around a bit.

Anyway, I've committed the fix.

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792663#comment-13792663
 ] 

ASF subversion and git services commented on SOLR-5325:
---

Commit 1531315 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1531315 ]

SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing 
commands.

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-11 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792661#comment-13792661
 ] 

ASF subversion and git services commented on SOLR-5325:
---

Commit 1531313 from [~markrmil...@gmail.com] in branch 'dev/trunk'
[ https://svn.apache.org/r1531313 ]

SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing 
commands.

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4, 4.5
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: SOLR-5325.patch, SOLR-5325.patch, SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5325) zk connection loss causes overseer leader loss

2013-10-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792274#comment-13792274
 ] 

Mark Miller commented on SOLR-5325:
---

Thanks guys - I'll try and get this in quickly as it would be a great to fix it 
for 4.5.1.

> zk connection loss causes overseer leader loss
> --
>
> Key: SOLR-5325
> URL: https://issues.apache.org/jira/browse/SOLR-5325
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.3, 4.4
>Reporter: Christine Poerschke
>Assignee: Mark Miller
> Attachments: SOLR-5325.patch
>
>
> The problem we saw was that when the solr overseer leader experienced 
> temporary zk connectivity problems it stopped processing overseer queue 
> events.
> This first happened when quorum within the external zk ensemble was lost due 
> to too many zookeepers being stopped (similar to SOLR-5199). The second time 
> it happened when there was a sufficient number of zookeepers but they were 
> holding zookeeper leadership elections and thus refused connections (the 
> elections were taking several seconds, we were using the default 
> zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org