[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107382#comment-14107382 ] Jessica Cheng Mallet commented on SOLR-6405: OK, thanks for the explanations! It's really helpful. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107370#comment-14107370 ] Mark Miller commented on SOLR-6405: --- SOLR-6409 Improve ZkCmdExecutor retry implementation. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107325#comment-14107325 ] Mark Miller commented on SOLR-6405: --- bq. we can really just retry "forever" without limiting the loop count You can get other exceptions as well though. The original implementation around this waited forever I think and it caused a lot of issues in some cases - perhaps that was mainly in failure cases. I don't really remember the details currently and a lot has changed since then. bq. The advantage of this approach is to never sleep for too long before finding out the definitive answer of success or SessionExpired Yeah, I guess it's just not been a big concern of mine since the extra wait is not really enough to matter, and the worst cases is not common or that bad either. This should be an exceptional case and either it's a quick blip or a really expensive recovery anyway. I didn't really see it as a goal to get out as fast as possible because there didn't seem to be much benefit, just to ensure we waited longer than expiration. If you want to explore this alternative approach, let's open a new JIRA for it. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107289#comment-14107289 ] Jessica Cheng Mallet commented on SOLR-6405: Right, most likely the first time it hits the ConnectionLoss it's not time=0 of the connection loss, so by loop i=4, it would've slept for 15s since the i=0 and therefore hit a SessionExpired. But then, thinking about it again, why be clever at all about the padding or back-off? Not to propose that we change this now, but let's pretend we don't do back-off and just sleep 1s between each loop. If we were to get ConnectionLoss back in the next attempt, there's no harm to try at all because if we're disconnected, the attempt wouldn't be hitting zookeeper anyway. If we were to get SessionExpired back, great, we can break out now and throw the exception. If we've reconnected, then yay, we succeeded. Because with each call we're expecting to get either success, failure (SessionExpired), or "in progress" (ConnectionLoss), we can really just retry "forever" without limiting the loop count (unless we're worried that somehow we'll keep getting ConnectionLoss even though the session has expired, but that'd be a pretty serious zookeeper client bug. And if we're really worried about that, we can always say do 10 more loops after we have slept a total of timeout already). In the end, it's really weird that this method should ever semantically allow throwing a ConnectionLoss exception, if we got the math wrong, because the intent is to retry until we get a SessionExpired, isn't it? > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107273#comment-14107273 ] ASF subversion and git services commented on SOLR-6405: --- Commit 1619895 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619895 ] SOLR-6405: Update comment around retry pause padding. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107270#comment-14107270 ] Mark Miller commented on SOLR-6405: --- bq. I'm much more worried about doing too little than too much Also, keep mind, it only keeps retrying on connection loss - it's going to end up getting the session expiration which will usually cause it to bail before waiting the full timeout. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107261#comment-14107261 ] ASF subversion and git services commented on SOLR-6405: --- Commit 1619892 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1619892 ] SOLR-6405: Update comment around retry pause padding. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107257#comment-14107257 ] ASF subversion and git services commented on SOLR-6405: --- Commit 1619891 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619891 ] SOLR-6405: Remove unnecessary attemptCount > 0 check. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107250#comment-14107250 ] Mark Miller commented on SOLR-6405: --- bq. Is that too much? I'm much more worried about doing too little than too much, that is why it was done that way. Perhaps the padding could be brought down (it was probably brought up to combat a sign of this bug - it was 1200 initially I think), but I think we want to error on the long wait side. bq. // 1500 ms over for padding Hmm, actually 500 over was right. It's mean to be a second, so anything over 1000 is the padding. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107242#comment-14107242 ] Jessica Cheng Mallet commented on SOLR-6405: {quote} but not so long as we are waiting for no reason and tying up threads {quote} This is actually the other thing that I was worried about. With the padding being on a multiplier, for the default 15s timeout, we're already doing 7.5s of total extra sleep (1.5+3+4.5+6+7.5=22.5). Is that too much? With your change of the comment "// 1500 ms over for padding", did you actually mean to do something like (attemptCount + 1) * 1000 + retryDelay? > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107236#comment-14107236 ] ASF subversion and git services commented on SOLR-6405: --- Commit 1619886 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619886 ] SOLR-6405: Remove unnecessary attemptCount > 0 check. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107222#comment-14107222 ] ASF subversion and git services commented on SOLR-6405: --- Commit 1619879 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1619879 ] SOLR-6405: Remove unnecessary attemptCount > 0 check. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107207#comment-14107207 ] Mark Miller commented on SOLR-6405: --- bq. Should the "if (attemptCount > 0)" check be removed in retryDelay, now that the sleep is (attemptCount + 1) * retryDelay? Yeah, good catch. bq. I think in practice we'd never miss the initial 1.5s sleep since the padding on the retryDelay is enough to make up for it, Agreed. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107199#comment-14107199 ] Mark Miller commented on SOLR-6405: --- bq. Is this right? Yeah, that was the idea basically - a fall back timeout that has some padding to ensure we try longer than the session timeout, but not so long as we are waiting for no reason and tying up threads. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107153#comment-14107153 ] Jessica Cheng Mallet commented on SOLR-6405: Should the "if (attemptCount > 0)" check be removed in retryDelay, now that the sleep is (attemptCount + 1) * retryDelay? I think in practice we'd never miss the initial 1.5s sleep since the padding on the retryDelay is enough to make up for it, but it's slightly harder to reason about. (The way I thought it worked was that roughly retry count is calculated so 1+2+3+4+...+retryCount ~= timeoutSec, so when that's multiplied by 1.5x (the retryDelay), we have the timeout covered. Is this right?) > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.10 > > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106883#comment-14106883 ] ASF subversion and git services commented on SOLR-6405: --- Commit 1619817 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619817 ] SOLR-6405: ZooKeeper calls can easily not be retried enough on ConnectionLoss. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106849#comment-14106849 ] ASF subversion and git services commented on SOLR-6405: --- Commit 1619811 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619811 ] SOLR-6405: ZooKeeper calls can easily not be retried enough on ConnectionLoss. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
[ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106847#comment-14106847 ] ASF subversion and git services commented on SOLR-6405: --- Commit 1619810 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1619810 ] SOLR-6405: ZooKeeper calls can easily not be retried enough on ConnectionLoss. > ZooKeeper calls can easily not be retried enough on ConnectionLoss. > --- > > Key: SOLR-6405 > URL: https://issues.apache.org/jira/browse/SOLR-6405 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Critical > Attachments: SOLR-6405.patch > > > The current design requires that we are sure we retry on connection loss > until session expiration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org