Justin,

Thanks for your reply. My understanding is that with PRS, DOWNNODE on
Overseer still iterates over all collections and marks the relevant
replicas down. It may be faster, but it's not a no-op. Did I miss something?
We do not use PRS in our current setup.

I agree with the risk related to ZK session expiration. TBH we already have
quite a few issues around session expiration, we've observed that the
replicas on that node are no longer correctly seen by the rest of the
cluster and therefore no longer used. Didn't manage to root cause this yet
so I don't know if it's related to local changes (we run a fork of upstream
SolrCloud) or if it's a general problem.
In any case, if PRS is modified to use EPHEMERAL ZK nodes, indeed ZK
session expiration is something to be careful with.

Ilan

On Wed, Sep 27, 2023 at 1:24 AM Justin Sweeney <justin.sweene...@gmail.com>
wrote:

> Hey Ilan, curious if you have tried PRS in your implementation or not
> at this point and what your experience has been if you have tried it?
> I believe PRS currently publishes DOWNNODE messages to overseer, but
> they are essentially a no-op by the overseer so they have very little
> impact. We are running a cluster with many collections/shards and PRS
> has been a huge improvement for us in processing nodes going down/up.
>
> The idea of ephemeral nodes seems interesting, but maybe some added
> risk around Zookeeper session expiration and re-establishing replica
> state.
>
> Justin
>
> On Tue, Sep 26, 2023 at 6:14 PM Ilan Ginzburg <ilans...@gmail.com> wrote:
> >
> > *TL;DR; a way to track replica state using EPHEMERAL nodes that disappear
> > automatically when a node goes down.*
> >
> > Hi,
> >
> > When running a cluster with many collections and replicas per node,
> > processing of DOWNNODE messages takes more time.
> > In a public cloud setup, the node that went down can come back quickly
> > before that processing is finished. When that happens, replicas are
> marked
> > DOWN by DOWNNODE while they are marked ACTIVE by the node starting, and
> > depending on how the two operations intermesh, some replicas then stay
> DOWN
> > forever (forever is until node is restarted).
> > We had to put in place K8s init containers to add a delay before nodes
> > restart. This delays rolling restarts, deployments and node crash
> recovery
> > so not a desirable long term solution.
> >
> > What do you think of a change that avoids the need for a DOWNNODE message
> > altogether:
> > - Each replica state is captured as an *EPHEMERAL* node in Zookeeper
> > - No such node implicitly means the replica state is DOWN
> > - If the node is present, it contains an encoding of the actual state
> (DOWN,
> > ACTIVE, RECOVERING, RECOVERY_FAILED)
> > - When a node goes down (or when its ZK session expires) all its replica
> > state nodes automatically vanish.
> >
> > This change is similar to the Per Replica State implementation (starting
> > point
> > <
> https://github.com/apache/solr/blob/main/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/PerReplicaStatesOps.java#L99C17-L99C17
> >
> > in the code) but different:
> > - EPHEMERAL rather than PERSISTENT Zookeeper nodes
> > - No duplicate replica state nodes (and no node version to pick the right
> > one)
> > - DOWNNODE not needed (if all collections are tracked in that way).
> > - Need to republish all replica states after Zookeeper session expiration
> > since they will disappear
> >
> > What do you think? esp. Noble and Ishan the authors of PRS.
> > I have no detailed design and no code, just sharing an idea to solve
> > a real issue we're facing.
> >
> > Ilan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>
>

Reply via email to