[ 
https://issues.apache.org/jira/browse/SOLR-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270726#comment-17270726
 ] 

Ilan Ginzburg commented on SOLR-14928:
--------------------------------------

I was working under the (incorrect) assumption that cluster state updates could 
be distributed independently of Collection API processing given (the false 
assumption) that all state updates were originating in Collection API calls and 
as long as all these were running on a single node (the Overseer) they would 
each see the state updated by previous execution of Collection API commands for 
a given collection.

As I'm progressing in the cluster state update distribution and making sure all 
tests pass, I realize this assumption does not hold. Although _most_ Cluster 
state updates originate in the Collection API commands, *some do not*.

More specifically, in {{ZkController}} there are three reasons that trigger 
cluster state changes:
* Registering a core with Overseer and the cluster state ({{publish()}}),
* Unregistering a core ({{unregister()}}),
* Marking a node down by updating the state of all replicas 
({{publishNodeAsDown()}}).

Marking the replicas down unbeknownst to the Overseer state is likely ok 
(SOLR-15052 would have been running into issues if that wasn't ok) but 
registering and unregistering a core is most likely not ok without further 
changes.

This might force coupling some Collection API distribution changes into the 
cluster state update distribution (for example forcing a freshness check on the 
collection before starting work on it through the Collection API).
Nothing that wouldn't be needed anyway in order to distribute not only cluster 
state updates but also Collection API commands (both need to be distributed to 
remove Overseer), but possibly not as clean a separation between the two phases 
as I would have hoped.

To be continued...

> Remove Overseer ClusterStateUpdater
> -----------------------------------
>
>                 Key: SOLR-14928
>                 URL: https://issues.apache.org/jira/browse/SOLR-14928
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Ilan Ginzburg
>            Assignee: Ilan Ginzburg
>            Priority: Major
>              Labels: cluster, collection-api, overseer
>
> Remove the Overseer {{ClusterStateUpdater}} thread and associated Zookeeper 
> queue at {{<_chroot_>/overseer/queue}}.
> Change cluster state updates so that each (Collection API) command execution 
> does the update directly in Zookeeper using optimistic locking (Compare and 
> Swap on the {{state.json}} Zookeeper files).
> Following this change cluster state updates would still be happening only 
> from the Overseer node (that's where Collection API commands are executing), 
> but the code will be ready for distribution once such commands can be 
> executed by any node (other work done in the context of parent task 
> SOLR-14927).
> See the [Cluster State 
> Updater|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/edit#heading=h.ymtfm3p518c]
>  section in the Removing Overseer doc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to