Sounds like an experimental feature that could end up changing (to fix this issue with replica count). While we want feedback from real users, this discussion makes me think it's just not ready for default status yet.
On Wed, Oct 1, 2025 at 12:44 PM Jason Gerlowski <[email protected]> wrote: > > Let's imagine (quite realistically) that I'm updating our upgrade notes > and > > providing advice for users to choose the Overseer or not. Given the > > benefits & risks, at what replica count threshold (for the biggest > > collection) would you advise continued use of the Overseer? > > Even assuming we can figure out a value on 'N' that sounds reasonable, > can be shown to be stable in load testing, etc....is that enough to > recommend "non-overseer" mode? > > Sizing (which includes replica-count) is notoriously a guess-and-check > process in Solr. And even for users who have done everything right > and dialed in their replica-count with some benchmarking - what > happens when their requirements change and they need to add replicas > to (e.g.) accommodate a higher QPS. Is there an easy way for those > users to switch back to the overseer, or do they have to risk > instability going forward? > > I guess I'm worried about basing recommendations on a factor like > replica-count which has a tendency to drift over time, if the decision > itself (i.e. overseer or not) is difficult to reverse after the fact. > Not 100% sure that's the case here, but that's my suspicion based on a > hazy recollection of some past discussions. > > Best, > > Jason > > On Wed, Oct 1, 2025 at 10:10 AM Ilan Ginzburg <[email protected]> wrote: > > > > It's hard to provide a recommended threshold on collection size for > > distributed mode. > > I didn't run tests and it obviously depends on the number of nodes in the > > cluster and how fast everything (including ZooKeeper) runs, but I'd say > > that below a couple hundred replicas total for a collection it should be > ok. > > When a Solr node starts it marks all its replicas DOWN before marking > them > > ACTIVE. If PRS is not used this could take a long time with distributed > > mode and be slower than Overseer due to the lack of batching of updates. > > > > Indexing and query performance is obviously not impacted by distributed > > mode or Overseer performance, unless shard split performance is > considered > > part of indexing performance. > > > > Ilan > > > > On Tue, Sep 30, 2025 at 10:20 PM David Smiley <[email protected]> > wrote: > > > > > Let's imagine (quite realistically) that I'm updating our upgrade > notes and > > > providing advice for users to choose the Overseer or not. Given the > > > benefits & risks, at what replica count threshold (for the biggest > > > collection) would you advise continued use of the Overseer? > > > > > > Sufficient stability is the characteristic I'm looking for in the above > > > question, not performance. I understand that it'd be ideal if the > > > cluster's state was structured differently to improve performance of > > > certain administrative operations, but performance is not a > requirement for > > > stability. The most important performance considerations our users > have > > > relate to index & query. There's a basic assumption that nodes can > restart > > > in a "reasonable time"... maybe you'd care to try to define that. I > think > > > your recommendations around restructuring the cluster state would > > > ultimately impact the performance of restarts and some other > administrative > > > scenarios but shouldn't be a prerequisite for a stable system. > > > > > > On Tue, Sep 30, 2025 at 4:20 AM Ilan Ginzburg <[email protected]> > wrote: > > > > > > > Distributed mode doesn't behave nicely when there are many concurrent > > > > updates to a given collection's state.json. > > > > > > > > I'd recommend *against* making it the default at this time. > > > > > > > > The "root cause" is the presence of replica specific information in > > > > state.json. In addition to relatively rare cases of changes to the > > > sharding > > > > of the collection, state.json is updated when replicas are created or > > > > destroyed or moved or have their properties changed, and when PRS is > not > > > > used, also when replicas change state (which happens a lot when a > Solr > > > node > > > > restarts for example). > > > > > > > > Therefore before making distributed mode the default, something has > to be > > > > done. > > > > As Pierre suggests, redesign Collection API operations that require > > > > multiple updates to be more efficient and group them when executing > in > > > > distributed mode. Also make sure that smaller operations that happen > > > > concurrently are efficient enough. > > > > Another option is to remove replica information from state.json (keep > > > > collection metadata and shard definitions there), and create state- > > > > *<shardname>*.json for each shard with the replicas of that shard. > > > > Contention on anything replica related will be restricted to > replicas of > > > > the same shard. > > > > There will be more watches on ZooKeeper, they will trigger less > often and > > > > less data will be read each time. Also less data to > compress/uncompress > > > > each time state.json is written or read (when so configured). > > > > > > > > Throttling goes against making SolrCloud as fast as we can. > > > > > > > > SolrCloud started with a single clusterstate.json file describing all > > > > collections (removed in 9.0), then moved to per collection state.json > > > files > > > > for scalability reasons. > > > > Maybe the time has come to split that big blob further? > > > > > > > > Ilan > > > > > > > > On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter < > > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > > > > : I don't think this should prevent shipping a system that is > > > objectively > > > > > way > > > > > : simpler than the Overseer. Solr 10 will have both modes, no > matter > > > > what > > > > > : the default is. Changing the default makes it easier to remove > it in > > > > > Solr > > > > > : 11. The impact on ease of understanding SolrCloud in 11 will be > > > > amazing! > > > > > > > > > > I'm not understanding yoru claim that changing a default from A(x) > to > > > > A(y) > > > > > in 10.0 makes removing A(x) in 11.0 easier? > > > > > > > > > > You could change the default in 10.1, 10.2, etc... and it would > still > > > be > > > > > the same amount of effort to remove it in 11.0. > > > > > > > > > > No matter when you change the default, if the *option* to use A(x) > > > still > > > > > exists in all versions < 11.0, then any "removal" of the code > > > > implementing > > > > > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have > some > > > > > code/process/documentation enabling users to migrate their cluster > to > > > > > A(y) > > > > > > > > > > > > > > > -Hoss > > > > > http://www.lucidworks.com/ > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: [email protected] > > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)
