This is a cross-post from a user list. We faced this issue for a lot of times before and got a lot of users complaining about the whole cluster freeze. We can protect a cluster from such a situation simply by dropping non-responsive nodes from the cluster. Of course, we need to get to the bottom of the root cause, and killing nodes may cause some data loss in the cluster, but I think it is better than restarting the whole cluster from scratch.
To summarize, I suggest to 'kill' non-responsive nodes from topology after some timeout in exchange future. Thoughts?