[
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar updated SOLR-7819:
----------------------------------------
Attachment: SOLR-7819.patch
Here's a patch which:
# Adds retryOnConnLoss in ZkController's
ensureReplicaInLeaderInitiatedRecovery, updateLeaderInitiatedRecoveryState and
markShardAsDownIfLeader method.
# Starts a LIR thread if leader cannot mark replica as down on connection loss.
Earlier a session loss or connection loss both would skip starting the LIR
thread.
I'm still running Solr's integration and jepsen tests.
This causes a subtle change in behavior which is best analyzed with two
different scenarios:
# Leader fails to send an update to replica but also suffers a temporary blip
in its ZK connection during the DistributedUpdateProcessor's doFinish method
## Currently, a few indexing threads will hang but eventually succeed in
marking the 'replica' as down and the leader will start a new LIR thread to ask
the replica to recover.
## With this patch, the indexing threads do not hang but a connection loss
exception is thrown. At this point, we started a new LIR thread to ask the
replica to recover. Although this removes the safety of explicitly marking the
'replica' as down, the LIR thread does provide us a timeout-based safety of
making sure that the replica does recover from the leader.
# Leader fails to send an update to replica but also suffers a long network
partition between itself and ZK server during DUP.doFinish method.
## Currently, a few indexing threads will hang in
ZkController.ensureReplicaInLeaderInitiatedRecovery until the ZK operations
time out because of connection loss or session loss and no LIR thread will be
created. This seems okay because the current connection loss timeout value is
higher than ZK session expiration time and session loss means that ZK has
determined that our session has expired already. In both cases, a new leader
election should have happened and there's no need to put the replica as 'down'.
## With this patch, the difference is that the indexing threads do not hang and
the ensureReplicaInLeaderInitiatedRecovery returns immediately with a
connection loss exception. A new LIR thread *is* started in this scenario. This
is also fine because we were not able to mark the replica as 'down' and we
aren't sure that the session has expired so it is important that we start the
LIR thread to ask the replica to recover. Even if a new leader has been
elected, there's no major harm done by asking the replica to recover.
So, net-net this patch doesn't seem to introduce any new problems in the system.
> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect
> retryOnConnLoss
> ------------------------------------------------------------------------------------
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.2, 5.2.1
> Reporter: Shalin Shekhar Mangar
> Labels: Jepsen
> Fix For: 5.3, Trunk
>
> Attachments: SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads
> do not hang during a partition on ZK operations. However, some of those
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed
> to put a leader into a 'down' state (I'm still investigating and will open a
> separate issue about this problem).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]