(A) -- Then Salesforce's recent experience with disabled Overseer isn't relevant to technical problem (A), as both Salesforce's fork and PRS offer solutions to this technical problem -- scoped narrowly to the replica state (an enum).
(B) minor clarification to "Concurrent updates will exhibit issues" -- it's *many* concurrent collection geometry/structure/location changes (state.json changes), which doesn't scale, not just the possibility of some races. On Mon, Oct 20, 2025 at 12:59 PM Ilan Ginzburg <[email protected]> wrote: > (A) Salesforce removed most of the replica state changes and will > eventually remove the concept of replica state altogether (Salesforce runs > a variant of SIP-20 > < > https://cwiki.apache.org/confluence/display/SOLR/SIP-20%3A+Separation+of+Compute+and+Storage+in+SolrCloud > > > in > its fork, where any node can get the latest copy of any replica at any > time). > > (B) Concurrent updates (not only of replica state) will exhibit issues. > Deleting a collection can be made in one update to state.json so indeed not > a big deal. Creating a collection is ok (assuming no replica state updates > or updates handled by PRS). Moving multiple replicas concurrently will be > an issue, as would be splitting multiple shards concurrently. I think > restoring a collection would also stress distributed mode. > > Doing conditional updates of state.json can be slow due to the time it > takes to read and esp. parse large state.json. This increases the > likelihood of collisions, retries and eventually failures. > > > On Mon, Oct 20, 2025 at 4:54 PM David Smiley <[email protected]> wrote: > > > I'm hearing two technical concerns with disabling the Overseer: > > > > (A) For many-replica collections, replica state changes don't scale well > to > > a single state.json that has the state of all replicas. SolrCloud has a > > solution to _that_ problem today -- PRS. Salesforce's SolrCloud fork > > basically removed the state of replicas, notwithstanding examining live > > nodes at runtime, so I don't think Salesforce sees this problem today > > either, right? (not that it matters to the community but I want to > ensure > > I'm understanding the scope of this problem) > > > > (B) For many-replica collections, creation/deletion/moving of many > replicas > > at once doesn't scale well. The example given was deleting a collection > > but I'm skeptical that's a good example, since my reading of its code is > > that it deletes replicas in sequence, not parallel. But let's imagine it > > did or lets imagine many-replica collection creation. PRS doesn't > address > > this, as it only separates the replica's state enum, not the replica's > very > > existence or location. Near term proposed solution: The "Cmd" > > implementations generally use a ShardHandler and its concurrency could be > > capped to something reasonable. And for other cluster replica > rebalancing > > -- it would likewise need to be throttled to do a manageable number at > > once; can't move all replicas at once (for big collections, anyway). > Users > > have solutions for this; not within SolrCloud's OOTB code. > > > > I don't mean to push against a redesign of cluster state, but that's not > in > > the short term. I think there are reasonable short term solutions. > > > > > > > >
