It's hard to provide a recommended threshold on collection size for
distributed mode.
I didn't run tests and it obviously depends on the number of nodes in the
cluster and how fast everything (including ZooKeeper) runs, but I'd say
that below a couple hundred replicas total for a collection it should be ok.
When a Solr node starts it marks all its replicas DOWN before marking them
ACTIVE. If PRS is not used this could take a long time with distributed
mode and be slower than Overseer due to the lack of batching of updates.

Indexing and query performance is obviously not impacted by distributed
mode or Overseer performance, unless shard split performance is considered
part of indexing performance.

Ilan

On Tue, Sep 30, 2025 at 10:20 PM David Smiley <[email protected]> wrote:

> Let's imagine (quite realistically) that I'm updating our upgrade notes and
> providing advice for users to choose the Overseer or not.  Given the
> benefits & risks, at what replica count threshold (for the biggest
> collection) would you advise continued use of the Overseer?
>
> Sufficient stability is the characteristic I'm looking for in the above
> question, not performance.  I understand that it'd be ideal if the
> cluster's state was structured differently to improve performance of
> certain administrative operations, but performance is not a requirement for
> stability.  The most important performance considerations our users have
> relate to index & query.  There's a basic assumption that nodes can restart
> in a "reasonable time"... maybe you'd care to try to define that.  I think
> your recommendations around restructuring the cluster state would
> ultimately impact the performance of restarts and some other administrative
> scenarios but shouldn't be a prerequisite for a stable system.
>
> On Tue, Sep 30, 2025 at 4:20 AM Ilan Ginzburg <[email protected]> wrote:
>
> > Distributed mode doesn't behave nicely when there are many concurrent
> > updates to a given collection's state.json.
> >
> > I'd recommend *against* making it the default at this time.
> >
> > The "root cause" is the presence of replica specific information in
> > state.json. In addition to relatively rare cases of changes to the
> sharding
> > of the collection, state.json is updated when replicas are created or
> > destroyed or moved or have their properties changed, and when PRS is not
> > used, also when replicas change state (which happens a lot when a Solr
> node
> > restarts for example).
> >
> > Therefore before making distributed mode the default, something has to be
> > done.
> > As Pierre suggests, redesign Collection API operations that require
> > multiple updates to be more efficient and group them when executing in
> > distributed mode. Also make sure that smaller operations that happen
> > concurrently are efficient enough.
> > Another option is to remove replica information from state.json (keep
> > collection metadata and shard definitions there), and create state-
> > *<shardname>*.json for each shard with the replicas of that shard.
> > Contention on anything replica related will be restricted to replicas of
> > the same shard.
> > There will be more watches on ZooKeeper, they will trigger less often and
> > less data will be read each time. Also less data to compress/uncompress
> > each time state.json is written or read (when so configured).
> >
> > Throttling goes against making SolrCloud as fast as we can.
> >
> > SolrCloud started with a single clusterstate.json file describing all
> > collections (removed in 9.0), then moved to per collection state.json
> files
> > for scalability reasons.
> > Maybe the time has come to split that big blob further?
> >
> > Ilan
> >
> > On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter <
> [email protected]
> > >
> > wrote:
> >
> > >
> > > : I don't think this should prevent shipping a system that is
> objectively
> > > way
> > > : simpler than the Overseer.  Solr 10 will have both modes, no matter
> > what
> > > : the default is.  Changing the default makes it easier to remove it in
> > > Solr
> > > : 11.  The impact on ease of understanding SolrCloud in 11 will be
> > amazing!
> > >
> > > I'm not understanding yoru claim that changing a default from A(x) to
> > A(y)
> > > in 10.0 makes removing A(x) in 11.0 easier?
> > >
> > > You could change the default in 10.1, 10.2, etc... and it would still
> be
> > > the same amount of effort to remove it in 11.0.
> > >
> > > No matter when you change the default, if the *option* to use A(x)
> still
> > > exists in all versions < 11.0, then any "removal" of the code
> > implementing
> > > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have some
> > > code/process/documentation enabling users to migrate their cluster to
> > > A(y)
> > >
> > >
> > > -Hoss
> > > http://www.lucidworks.com/
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
>

Reply via email to