[jira] [Created] (SOLR-10889) Stale zookeper information is used during failover check

Mihaly Toth (JIRA) Wed, 14 Jun 2017 08:49:13 -0700

Mihaly Toth created SOLR-10889:
----------------------------------

             Summary: Stale zookeper information is used during failover check
                 Key: SOLR-10889
                 URL: https://issues.apache.org/jira/browse/SOLR-10889
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
    Affects Versions: master (7.0)
            Reporter: Mihaly Toth



In {{OverseerAutoReplicaFailoverThread}} it goes over each and every replica to 
check if it needs to be reloaded on a new node. In each such round it reads 
cluster state just in the beginning. Especially in case of big clusters, 
cluster state may change during the process of iterating through the replicas. 
As a result false decisions may be made: restarting a healthy core, or not 
handling a bad node.

The code fragment in question:
{code}
        for (Slice slice : slices) {
          if (slice.getState() == Slice.State.ACTIVE) {
            final Collection<DownReplica> downReplicas = new 
ArrayList<DownReplica>();
            int goodReplicas = findDownReplicasInSlice(clusterState, 
docCollection, slice, downReplicas);
{code}

The solution seems rather straightforward, reading the state every time:
{code}
            int goodReplicas = 
findDownReplicasInSlice(zkStateReader.getClusterState(), docCollection, slice, 
downReplicas);
{code}

The only counter argument that comes into my mind is too frequent reading of 
the cluster state. We can enhance this naive solution so that re-reading is 
done only if a bad node is found. But I am not sure if such a read optimization 
is necessary.

I have done some unit tests around this class, mocking out even the time 
factor. It runs in a second. I am interested in getting feedback about such an 
approach. I will upload a patch with this shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-10889) Stale zookeper information is used during failover check

Reply via email to