This is a cross-post from a user list.

We faced this issue for a lot of times before and got a lot of users
complaining about the whole cluster freeze. We can protect a cluster from
such a situation simply by dropping non-responsive nodes from the cluster.
Of course, we need to get to the bottom of the root cause, and killing
nodes may cause some data loss in the cluster, but I think it is better
than restarting the whole cluster from scratch.

To summarize, I suggest to 'kill' non-responsive nodes from topology after
some timeout in exchange future.
​
Thoughts?

Reply via email to