[ https://issues.apache.org/jira/browse/SOLR-11932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352678#comment-16352678 ]
Shalin Shekhar Mangar edited comment on SOLR-11932 at 2/5/18 5:41 PM: ---------------------------------------------------------------------- ZkCmdExecutor cannot work on session expiry because the command sent to the ZkCmdExecutor has an instance of SolrZooKeeper which is no longer usable after expiry. A new SolrZooKeeper instance must be created after session expiry. The right way to handle session expiry is to use the OnReconnect hook in ZkController.addOnReconnectListener and re-initialize state as needed. was (Author: shalinmangar): ZkCmdExecutor cannot work on session expiry because the command sent to the ZkCmdExecutor has an instance of SolrZooKeeper instance is no longer usable after expiry. A new SolrZooKeeper instance must be created after session expiry. The right way to handle session expiry is to use the OnReconnect hook in ZkController.addOnReconnectListener and re-initialize state as needed. > ZkCmdExecutor: Retry ZkOperation on SessionExpired > --------------------------------------------------- > > Key: SOLR-11932 > URL: https://issues.apache.org/jira/browse/SOLR-11932 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 7.2 > Reporter: John Gallagher > Assignee: Ishan Chattopadhyaya > Priority: Major > Attachments: SessionExpiredLog.txt, zk_retry.patch > > > We are seeing situations where an operation, such as changing a replica's > state to active after a recovery, fails because the zk session has expired. > However, these operations seem like they are retryable, because the > ZookeeperConnect receives an event that the session expired and tries to > reconnect. > That makes the SessionExpired handling scenario seem very similar to the > ConnectionLoss handling scenario, so the ZkCmdExecutor seems like it could > handle them in the same way. > > Here's an example stack trace with some slight redactions: > [^SessionExpiredLog.txt] In this case, a zk operation (a read) failed with a > SessionExpired event, which seems retriable. The exception kicked off a > reconnection, but seems like the subsequent operation, (publishing as active) > failed (perhaps it was using a stale connection handle at that point?) > > Regardless, the watch mechanism that reestablishes connection on > SessionExpired seems sufficient to allow the ZkCmdExecutor to retry that > operation at a later time and have hope of succeeding. > > I have included a simple patch we are trying that catches both exceptions > instead of just ConnectionLossException: [^zk_retry.patch] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org