So this is interesting. I'm assuming that you are running a SolrCloud resource per-shard, so that you can set system properties separately for autoscaling purposes. The Solr Operator assumes that each cloud it is managing is independent. However, the rolling restart process really just kills as many pods as possible until the cluster state is too unhealthy to kill more (configurable).
In theory it should be fine to do a rolling restart at the same time on each SolrCloud resource. This is especially true because no two-SolrCloud resources share shard, so their restarts should not affect each other. (Actually you have devised the only truly safe way of upgrading multiple SolrCloud resources at the same time that are actually one large cloud) The only overlap in logic between the SolrCloud resources is the overseer. The logic in the solr operator is to restart the overseer last, and wait for all nodes to be live and the cluster state to be healthy before killing it. Are you seeing that all other node upgrades have succeeded, and the cluster is healthy, but the overseer is still not upgraded? On Thu, Oct 14, 2021 at 1:50 PM Joel Bernstein <[email protected]> wrote: > This is a followup to my last question with my findings thus far. In a > scenario where there is one SolrCloud resource per-shard I'm seeing the > overseer node get skipped entirely during rolling restarts. So, it appears > the solr-operator can only manage rolling restarts when there is one > SolrCloud object in the cluster. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Tue, Oct 12, 2021 at 6:44 PM Joel Bernstein <[email protected]> wrote: > > > Hi, > > > > I saw that the Solr operator takes into account collection topology when > > performing rolling restarts. In a situation where there is one SolrCloud > > object per-shard, I'm wondering how this will behave. In this case the > Solr > > Operator would receive a different CR for each shard which would kick off > > the rolling restarts in parallel. Would the operator be able to > understand > > that it was operating on a single shard in each CR and not get tangled up > > in the larger cluster state? > > > > Thanks, > > Joel > > > > > > >
