[jira] [Updated] (SOLR-10889) Stale zookeper information is used during failover check

Mihaly Toth (JIRA) Wed, 14 Jun 2017 09:03:46 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mihaly Toth updated SOLR-10889:
-------------------------------
    Attachment: SOLR-10889.patch

Here is the unit test and the implementation (first one is bigger)
* Time is "mocked" out: interface introduced for getting nanoseconds. In test 
it is overwritten.
* Each doWork loop is invoked separately from test, forever looping is not used
* Hamcrest matchers for collection asserts
* updateExecutor basically executes the code in the same Thread context, no 
problems in waiting for background thread to complete
* Core Create Requests are not actually executed, just collected into a list, 
and verified from test
 
Comments are welcome.

> Stale zookeper information is used during failover check
> --------------------------------------------------------
>
>                 Key: SOLR-10889
>                 URL: https://issues.apache.org/jira/browse/SOLR-10889
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: master (7.0)
>            Reporter: Mihaly Toth
>         Attachments: SOLR-10889.patch
>
>
> In {{OverseerAutoReplicaFailoverThread}} it goes over each and every replica 
> to check if it needs to be reloaded on a new node. In each such round it 
> reads cluster state just in the beginning. Especially in case of big 
> clusters, cluster state may change during the process of iterating through 
> the replicas. As a result false decisions may be made: restarting a healthy 
> core, or not handling a bad node.
> The code fragment in question:
> {code}
>         for (Slice slice : slices) {
>           if (slice.getState() == Slice.State.ACTIVE) {
>             final Collection<DownReplica> downReplicas = new 
> ArrayList<DownReplica>();
>             int goodReplicas = findDownReplicasInSlice(clusterState, 
> docCollection, slice, downReplicas);
> {code}
> The solution seems rather straightforward, reading the state every time:
> {code}
>             int goodReplicas = 
> findDownReplicasInSlice(zkStateReader.getClusterState(), docCollection, 
> slice, downReplicas);
> {code}
> The only counter argument that comes into my mind is too frequent reading of 
> the cluster state. We can enhance this naive solution so that re-reading is 
> done only if a bad node is found. But I am not sure if such a read 
> optimization is necessary.
> I have done some unit tests around this class, mocking out even the time 
> factor. It runs in a second. I am interested in getting feedback about such 
> an approach. I will upload a patch with this shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10889) Stale zookeper information is used during failover check

Reply via email to