Losing Singleton leader on rolling restart of cluster

Dave Brosius Wed, 24 Jan 2024 09:19:13 -0800

Hi folks, we have a cluster of nodes using Pekko (2.13 1.0.1, jdk17),
and a bunch of singleton actors. We had been using netty forever
(first with akka, then pekko), w/o issues, and we just switched to
using artery.


What we see is when nodes get repaved in a rolling restart fashion for
some reason the singleton leader is lost, or more clearly the old
leader becoming unavailable is not noticed. (so singleton messages are
not processed). This from a release point of view came with the artery
change (altho perhaps there is something else explaining it - and this
is just an unfortunate correlation).

Rumaging around docs we saw the notes about using the SplitBrain
resolver and so we tried that,

pekko.cluster.downing-provider-class =
"org.apache.pekko.cluster.sbr.SplitBrainResolverProvider"

and got

####<2024-01-19T03:25:18,717> <ERROR>
<ParmClusterSystem-pekko.actor.default-dispatcher-13>
<org.acme.pekko.cluster.Cluster> <s > <tg > <t > <u > <tr > - Cluster
Node [[pekko://[email protected]:2552]] - Couldn't join
seed nodes because of incompatible cluster configuration. It's
recommended to perform a full cluster shutdown in order to deploy this
new version. If a cluster shutdown isn't an option, you may want to
disable this protection by setting
'pekko.cluster.configuration-compatibility-check.enforce-on-join =
off'. Note that disabling it will allow the formation of a cluster
with nodes having incompatible configuration settings. This node will
be shutdown!

Which makes us think that perhaps the two are related. Is there any
logging we can use to determine what the incompatibility is? Or any
other suggestions as to how to debug this further?

thanks,
dave

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Losing Singleton leader on rolling restart of cluster

Reply via email to