Hey Ilan, curious if you have tried PRS in your implementation or not at this point and what your experience has been if you have tried it? I believe PRS currently publishes DOWNNODE messages to overseer, but they are essentially a no-op by the overseer so they have very little impact. We are running a cluster with many collections/shards and PRS has been a huge improvement for us in processing nodes going down/up.
The idea of ephemeral nodes seems interesting, but maybe some added risk around Zookeeper session expiration and re-establishing replica state. Justin On Tue, Sep 26, 2023 at 6:14 PM Ilan Ginzburg <ilans...@gmail.com> wrote: > > *TL;DR; a way to track replica state using EPHEMERAL nodes that disappear > automatically when a node goes down.* > > Hi, > > When running a cluster with many collections and replicas per node, > processing of DOWNNODE messages takes more time. > In a public cloud setup, the node that went down can come back quickly > before that processing is finished. When that happens, replicas are marked > DOWN by DOWNNODE while they are marked ACTIVE by the node starting, and > depending on how the two operations intermesh, some replicas then stay DOWN > forever (forever is until node is restarted). > We had to put in place K8s init containers to add a delay before nodes > restart. This delays rolling restarts, deployments and node crash recovery > so not a desirable long term solution. > > What do you think of a change that avoids the need for a DOWNNODE message > altogether: > - Each replica state is captured as an *EPHEMERAL* node in Zookeeper > - No such node implicitly means the replica state is DOWN > - If the node is present, it contains an encoding of the actual state (DOWN, > ACTIVE, RECOVERING, RECOVERY_FAILED) > - When a node goes down (or when its ZK session expires) all its replica > state nodes automatically vanish. > > This change is similar to the Per Replica State implementation (starting > point > <https://github.com/apache/solr/blob/main/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/PerReplicaStatesOps.java#L99C17-L99C17> > in the code) but different: > - EPHEMERAL rather than PERSISTENT Zookeeper nodes > - No duplicate replica state nodes (and no node version to pick the right > one) > - DOWNNODE not needed (if all collections are tracked in that way). > - Need to republish all replica states after Zookeeper session expiration > since they will disappear > > What do you think? esp. Noble and Ishan the authors of PRS. > I have no detailed design and no code, just sharing an idea to solve > a real issue we're facing. > > Ilan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org