[ 
https://issues.apache.org/jira/browse/SOLR-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057705#comment-16057705
 ] 

Mihaly Toth commented on SOLR-10889:
------------------------------------

bq. The only counter argument that comes into my mind is too frequent reading 
of the cluster state. We can enhance this naive solution so that re-reading is 
done only if a bad node is found. But I am not sure if such a read optimization 
is necessary.
Actually, looking into {{ZkStateReader}} there is no network activity involved 
when reading the cluster state. So there is not much counter argument against 
using the most latest cluster state instead of a stale one.

> Stale zookeper information is used during failover check
> --------------------------------------------------------
>
>                 Key: SOLR-10889
>                 URL: https://issues.apache.org/jira/browse/SOLR-10889
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: master (7.0)
>            Reporter: Mihaly Toth
>            Assignee: Mark Miller
>         Attachments: SOLR-10889.patch
>
>
> In {{OverseerAutoReplicaFailoverThread}} it goes over each and every replica 
> to check if it needs to be reloaded on a new node. In each such round it 
> reads cluster state just in the beginning. Especially in case of big 
> clusters, cluster state may change during the process of iterating through 
> the replicas. As a result false decisions may be made: restarting a healthy 
> core, or not handling a bad node.
> The code fragment in question:
> {code}
>         for (Slice slice : slices) {
>           if (slice.getState() == Slice.State.ACTIVE) {
>             final Collection<DownReplica> downReplicas = new 
> ArrayList<DownReplica>();
>             int goodReplicas = findDownReplicasInSlice(clusterState, 
> docCollection, slice, downReplicas);
> {code}
> The solution seems rather straightforward, reading the state every time:
> {code}
>             int goodReplicas = 
> findDownReplicasInSlice(zkStateReader.getClusterState(), docCollection, 
> slice, downReplicas);
> {code}
> The only counter argument that comes into my mind is too frequent reading of 
> the cluster state. We can enhance this naive solution so that re-reading is 
> done only if a bad node is found. But I am not sure if such a read 
> optimization is necessary.
> I have done some unit tests around this class, mocking out even the time 
> factor. It runs in a second. I am interested in getting feedback about such 
> an approach. I will upload a patch with this shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to