Distributed mode doesn't behave nicely when there are many concurrent
updates to a given collection's state.json.

I'd recommend *against* making it the default at this time.

The "root cause" is the presence of replica specific information in
state.json. In addition to relatively rare cases of changes to the sharding
of the collection, state.json is updated when replicas are created or
destroyed or moved or have their properties changed, and when PRS is not
used, also when replicas change state (which happens a lot when a Solr node
restarts for example).

Therefore before making distributed mode the default, something has to be
done.
As Pierre suggests, redesign Collection API operations that require
multiple updates to be more efficient and group them when executing in
distributed mode. Also make sure that smaller operations that happen
concurrently are efficient enough.
Another option is to remove replica information from state.json (keep
collection metadata and shard definitions there), and create state-
*<shardname>*.json for each shard with the replicas of that shard.
Contention on anything replica related will be restricted to replicas of
the same shard.
There will be more watches on ZooKeeper, they will trigger less often and
less data will be read each time. Also less data to compress/uncompress
each time state.json is written or read (when so configured).

Throttling goes against making SolrCloud as fast as we can.

SolrCloud started with a single clusterstate.json file describing all
collections (removed in 9.0), then moved to per collection state.json files
for scalability reasons.
Maybe the time has come to split that big blob further?

Ilan

On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter <[email protected]>
wrote:

>
> : I don't think this should prevent shipping a system that is objectively
> way
> : simpler than the Overseer.  Solr 10 will have both modes, no matter what
> : the default is.  Changing the default makes it easier to remove it in
> Solr
> : 11.  The impact on ease of understanding SolrCloud in 11 will be amazing!
>
> I'm not understanding yoru claim that changing a default from A(x) to A(y)
> in 10.0 makes removing A(x) in 11.0 easier?
>
> You could change the default in 10.1, 10.2, etc... and it would still be
> the same amount of effort to remove it in 11.0.
>
> No matter when you change the default, if the *option* to use A(x) still
> exists in all versions < 11.0, then any "removal" of the code implementing
> A(x) in 11.0 still needs to ensure that all versions >= 11.0 have some
> code/process/documentation enabling users to migrate their cluster to
> A(y)
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to