[ https://issues.apache.org/jira/browse/SOLR-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270726#comment-17270726 ]
Ilan Ginzburg commented on SOLR-14928: -------------------------------------- I was working under the (incorrect) assumption that cluster state updates could be distributed independently of Collection API processing given (the false assumption) that all state updates were originating in Collection API calls and as long as all these were running on a single node (the Overseer) they would each see the state updated by previous execution of Collection API commands for a given collection. As I'm progressing in the cluster state update distribution and making sure all tests pass, I realize this assumption does not hold. Although _most_ Cluster state updates originate in the Collection API commands, *some do not*. More specifically, in {{ZkController}} there are three reasons that trigger cluster state changes: * Registering a core with Overseer and the cluster state ({{publish()}}), * Unregistering a core ({{unregister()}}), * Marking a node down by updating the state of all replicas ({{publishNodeAsDown()}}). Marking the replicas down unbeknownst to the Overseer state is likely ok (SOLR-15052 would have been running into issues if that wasn't ok) but registering and unregistering a core is most likely not ok without further changes. This might force coupling some Collection API distribution changes into the cluster state update distribution (for example forcing a freshness check on the collection before starting work on it through the Collection API). Nothing that wouldn't be needed anyway in order to distribute not only cluster state updates but also Collection API commands (both need to be distributed to remove Overseer), but possibly not as clean a separation between the two phases as I would have hoped. To be continued... > Remove Overseer ClusterStateUpdater > ----------------------------------- > > Key: SOLR-14928 > URL: https://issues.apache.org/jira/browse/SOLR-14928 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Reporter: Ilan Ginzburg > Assignee: Ilan Ginzburg > Priority: Major > Labels: cluster, collection-api, overseer > > Remove the Overseer {{ClusterStateUpdater}} thread and associated Zookeeper > queue at {{<_chroot_>/overseer/queue}}. > Change cluster state updates so that each (Collection API) command execution > does the update directly in Zookeeper using optimistic locking (Compare and > Swap on the {{state.json}} Zookeeper files). > Following this change cluster state updates would still be happening only > from the Overseer node (that's where Collection API commands are executing), > but the code will be ready for distribution once such commands can be > executed by any node (other work done in the context of parent task > SOLR-14927). > See the [Cluster State > Updater|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/edit#heading=h.ymtfm3p518c] > section in the Removing Overseer doc. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org