Distributed mode doesn't behave nicely when there are many concurrent updates to a given collection's state.json.
I'd recommend *against* making it the default at this time. The "root cause" is the presence of replica specific information in state.json. In addition to relatively rare cases of changes to the sharding of the collection, state.json is updated when replicas are created or destroyed or moved or have their properties changed, and when PRS is not used, also when replicas change state (which happens a lot when a Solr node restarts for example). Therefore before making distributed mode the default, something has to be done. As Pierre suggests, redesign Collection API operations that require multiple updates to be more efficient and group them when executing in distributed mode. Also make sure that smaller operations that happen concurrently are efficient enough. Another option is to remove replica information from state.json (keep collection metadata and shard definitions there), and create state- *<shardname>*.json for each shard with the replicas of that shard. Contention on anything replica related will be restricted to replicas of the same shard. There will be more watches on ZooKeeper, they will trigger less often and less data will be read each time. Also less data to compress/uncompress each time state.json is written or read (when so configured). Throttling goes against making SolrCloud as fast as we can. SolrCloud started with a single clusterstate.json file describing all collections (removed in 9.0), then moved to per collection state.json files for scalability reasons. Maybe the time has come to split that big blob further? Ilan On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter <[email protected]> wrote: > > : I don't think this should prevent shipping a system that is objectively > way > : simpler than the Overseer. Solr 10 will have both modes, no matter what > : the default is. Changing the default makes it easier to remove it in > Solr > : 11. The impact on ease of understanding SolrCloud in 11 will be amazing! > > I'm not understanding yoru claim that changing a default from A(x) to A(y) > in 10.0 makes removing A(x) in 11.0 easier? > > You could change the default in 10.1, 10.2, etc... and it would still be > the same amount of effort to remove it in 11.0. > > No matter when you change the default, if the *option* to use A(x) still > exists in all versions < 11.0, then any "removal" of the code implementing > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have some > code/process/documentation enabling users to migrate their cluster to > A(y) > > > -Hoss > http://www.lucidworks.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
