>
> I wonder if ZK session expiration and re establishment works nicely for
> others? The code handling this is in ZkController.onReconnect().
Answering my own question: the issue was specific to our fork so I assume
ZK session expiration and re establishment does work nicely in general.
On Mon,
Oh I’m not referring to your proposal, just what happens with the current
system and what I had done around DOWN state to address it. The problem
there is the missing replica state when you come up as the live node can’t
cover for it. How you cover for that state could be done in a lot of ways,
it’
Thanks for bringing up this topic, Ilan.
> We are also running a fork of Solr and in our
> fork we have made some optimizations to avoid processing DOWNNODE
> messages for nodes that only host PRS collections. Those optimizations
> have not made it upstream at this point. I can take a look at
> up
Not sure I totally follow what you mean Mark.
We thought making actual replica state = published replica state AND node
state, which would set practical replica states to down when an ephemeral
Zookeeper node for a SolrCloud node disappears.
This works nicely for the going down part, but still requ
Actually, I think what I did was move the DOWN state to startup. Since you
can’t count on it on shutdown (crash, killed process, state doesn’t get
published for a variety of reasons), it doesn’t do anything solid for the
holes where you are indexing and a node cycles. So it can come up in any
state
Yeah, I think a jira issue or two was filed for it, but I didn't see
anything user facing go in. You can do it for queries by asking the
overseer to publish a DOWN state though. It won't drop indexing leadership
until you close the core, but it will prevent the temporary slow/hotspot
you get if you
Yeah ive mentioned it a number of times, but it’s absolutely something we
should have. Give up leadership, dont accept new replicas, dont accept new
requests. Maybe remove live_node!
- Houstob
On Thu, Sep 28, 2023 at 6:45 PM David Smiley
wrote:
> Somewhat related to this, we don't yet have a wa
Somewhat related to this, we don't yet have a way to signal to the cluster
that a node will soon be shut down, but has not shut down yet. With such
information, we'd prefer to not send queries there, and perhaps could even
begin shard/overseer leadership changes. This was mentioned somewhere in
c
That did require some changes around live node handeling, which is why a
different approach as you suggest would also be reasonable. You still do
want to solve for the original motivation of DOWN - stopping search traffic
to the node before things start closing.
Yeah, I took the DOWN state out all together in shutdown as its problematic
and effectively sugar for the user view of the cluster state - as far as
the system goes, if the ephemeral live node is gone, that node is down,
regardless of the replica state. There is some value in being able to
remove a
You are right, sorry. We are also running a fork of Solr and in our
fork we have made some optimizations to avoid processing DOWNNODE
messages for nodes that only host PRS collections. Those optimizations
have not made it upstream at this point. I can take a look at
upstreaming those changes or som
Justin,
Thanks for your reply. My understanding is that with PRS, DOWNNODE on
Overseer still iterates over all collections and marks the relevant
replicas down. It may be faster, but it's not a no-op. Did I miss something?
We do not use PRS in our current setup.
I agree with the risk related to Z
Hey Ilan, curious if you have tried PRS in your implementation or not
at this point and what your experience has been if you have tried it?
I believe PRS currently publishes DOWNNODE messages to overseer, but
they are essentially a no-op by the overseer so they have very little
impact. We are runni
*TL;DR; a way to track replica state using EPHEMERAL nodes that disappear
automatically when a node goes down.*
Hi,
When running a cluster with many collections and replicas per node,
processing of DOWNNODE messages takes more time.
In a public cloud setup, the node that went down can come back q
14 matches
Mail list logo