About the particular scenario:  couldn't we just limit the number of
in-flight requests?  Cluster admin commands use HttpShardHandler, and it
has an executor defaulting to Integer.MAX_VALUE.  This is configurable in
solr.xml with "maximumPoolSize" but I'm not certain if that same instance
of ShardHandler is also used for cluster admin commands.  We ought to have
separately configurable ones.

I don't think this should prevent shipping a system that is objectively way
simpler than the Overseer.  Solr 10 will have both modes, no matter what
the default is.  Changing the default makes it easier to remove it in Solr
11.  The impact on ease of understanding SolrCloud in 11 will be amazing!
It can't be seen now because we can't remove the needless ZNodeProps admin
command conversions that exist simply because we put commands into ZK
queues.

~ David

On Mon, Sep 29, 2025 at 9:00 AM Pierre Salagnac <[email protected]>
wrote:

> Indeed, we're having some issues with "no-overseer" mode and we decided to
> hold off deployment of this feature for now.
>
> This is mostly because of a design flaw in distributed cluster updates.
> When a collection reaches a given size, we end up having many "writers"
> that want to update the state.json of this collection concurrently. Since
> they all do Zookeeper "check-and-write" operations using the version
> number, there is no data corruption, but this puts a huge load on the
> cluster.
>
> This issue has a major impact when deleting a large collection (1000+
> cores). The collection API sends 1000+ UNLOAD core admin requests that are
> processed concurrently, and each one wants to update the 'state.json' file
> of the collection. We enter a tidal wave of writes operations that are
> rejected by Zookeeper.
>
> This mode probably works pretty well and scales better when there are many
> small collections in the cluster. Updates to these collections can then be
> done concurrently since the probability of having write conflicts is low.
> That's the opposite for big collections, where it seems the overseer scales
> better.
>
> Best,
> Pierre
>
>
> Le ven. 26 sept. 2025 à 15:37, Jason Gerlowski <[email protected]> a
> écrit :
>
> > In the latest Virtual Community Meetup, Pierre and Bruno shared that
> > they recently enabled "no-overseer" mode (really need a better name!)
> > at their workplace in production and were hitting some stability
> > issues.  Their anecdotes give me at least some "pause" about whether
> > this is really ready to be the default, or whether it needs a bit more
> > time to bake.
> >
> > Any chance either of them are watching this thread and can provide an
> > update or their own 2c here?
> >
> > Best,
> >
> > Jason
> >
> > On Wed, Aug 20, 2025 at 6:35 PM David Smiley <[email protected]> wrote:
> > >
> > > Update:  I proposed the Solr version go into cluster properties[1] on
> ZK
> > > initialization but Houston pushed back on that approach.  The other
> > > approach is to rely on the "least Solr version" as returned by the Solr
> > > version being added to live nodes[2].  However that's very dynamic and
> > I'm
> > > concerned about an old Solr version somehow joining an existing
> cluster.
> > > Rather than simply tell users "don't do that" (which we should do
> > anyway),
> > > I'm inclined to have Solr fail to join an existing cluster if doing so
> > > would *lower* the least Solr version.  Perhaps ignoring the final patch
> > > version.  I plan on updating that PR accordingly.  The PR would have to
> > go
> > > into 9.x too, and thus such logic can't be enforced for older Solr
> > > versions.  Regardless, an upgrading user can choose the setting.
> > >
> > > [1] https://issues.apache.org/jira/browse/SOLR-17664
> > > [2] https://issues.apache.org/jira/browse/SOLR-17620
> > >
> > > On Tue, Jul 22, 2025 at 1:12 AM David Smiley <[email protected]>
> wrote:
> > >
> > > > On the epic of seeking the _eventual_ demise of the Overseer, I'm
> > seeking
> > > > to make it disabled for *new* SolrCloud clusters in Solr 10. --
> > > > https://issues.apache.org/jira/browse/SOLR-17293
> > > > The epic: https://issues.apache.org/jira/browse/SOLR-14927 (oddly no
> > SIP
> > > > but it has a doc anyway).  I think it's sufficiently ready for the
> > great
> > > > majority of SolrCloud clusters.  A cluster with a collection
> > containing a
> > > > thousand+ replicas might pose a performance concern on start/restart
> > events
> > > > due to independent replica state updates.  Of course, with such a
> > change,
> > > > there will be a section in the upgrade page in the ref guide to
> advise
> > > > users who may opt to make an explicit choice.
> > > >
> > > > I don't love that the new mode doesn't have an elegant/clear way to
> > refer
> > > > to it.  The best I've come up with is to say what it *isn't* -- it
> > *isn't*
> > > > the Overseer.  "The Overseer is disabled".  Awkwardly there are two
> > > > undocumented solr.xml booleans, both a mouthful:
> > > > distributedClusterStateUpdates and
> > > > distributedCollectionConfigSetExecution.  I propose instead that a
> > single
> > > > boolean cluster property be defined named "overseer" or
> > "overseerEnabled".
> > > > FYI the existing known cluster properties are defined here:
> > > >  org.apache.solr.common.cloud.ZkStateReader#KNOWN_CLUSTER_PROPS
> > > > Even if such a boolean is agreeable... it raises the question of what
> > > > should become of the "overseer" node role.
> > > >
> > > > ~ David Smiley
> > > > Apache Lucene/Solr Search Developer
> > > > http://www.linkedin.com/in/davidwsmiley
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Reply via email to