(A) -- Then Salesforce's recent experience with disabled Overseer isn't
relevant to technical problem (A), as both Salesforce's fork and PRS offer
solutions to this technical problem -- scoped narrowly to the replica state
(an enum).

(B) minor clarification to "Concurrent updates will exhibit issues" -- it's
*many* concurrent collection geometry/structure/location changes
(state.json changes), which doesn't scale, not just the possibility of some
races.

On Mon, Oct 20, 2025 at 12:59 PM Ilan Ginzburg <[email protected]> wrote:

> (A) Salesforce removed most of the replica state changes and will
> eventually remove the concept of replica state altogether (Salesforce runs
> a variant of SIP-20
> <
> https://cwiki.apache.org/confluence/display/SOLR/SIP-20%3A+Separation+of+Compute+and+Storage+in+SolrCloud
> >
> in
> its fork, where any node can get the latest copy of any replica at any
> time).
>




> (B) Concurrent updates (not only of replica state) will exhibit issues.
> Deleting a collection can be made in one update to state.json so indeed not
> a big deal. Creating a collection is ok (assuming no replica state updates
> or updates handled by PRS). Moving multiple replicas concurrently will be
> an issue, as would be splitting multiple shards concurrently. I think
> restoring a collection would also stress distributed mode.
>
> Doing conditional updates of state.json can be slow due to the time it
> takes to read and esp. parse large state.json. This increases the
> likelihood of collisions, retries and eventually failures.
>
>
> On Mon, Oct 20, 2025 at 4:54 PM David Smiley <[email protected]> wrote:
>
> > I'm hearing two technical concerns with disabling the Overseer:
> >
> > (A) For many-replica collections, replica state changes don't scale well
> to
> > a single state.json that has the state of all replicas.  SolrCloud has a
> > solution to _that_ problem today -- PRS.  Salesforce's SolrCloud fork
> > basically removed the state of replicas, notwithstanding examining live
> > nodes at runtime, so I don't think Salesforce sees this problem today
> > either, right?  (not that it matters to the community but I want to
> ensure
> > I'm understanding the scope of this problem)
> >
> > (B) For many-replica collections, creation/deletion/moving of many
> replicas
> > at once doesn't scale well.  The example given was deleting a collection
> > but I'm skeptical that's a good example, since my reading of its code is
> > that it deletes replicas in sequence, not parallel.  But let's imagine it
> > did or lets imagine many-replica collection creation.  PRS doesn't
> address
> > this, as it only separates the replica's state enum, not the replica's
> very
> > existence or location.  Near term proposed solution: The "Cmd"
> > implementations generally use a ShardHandler and its concurrency could be
> > capped to something reasonable.  And for other cluster replica
> rebalancing
> > -- it would likewise need to be throttled to do a manageable number at
> > once; can't move all replicas at once (for big collections, anyway).
> Users
> > have solutions for this; not within SolrCloud's OOTB code.
> >
> > I don't mean to push against a redesign of cluster state, but that's not
> in
> > the short term.  I think there are reasonable short term solutions.
> >
> > >
> >
>

Reply via email to