Re: Losing Singleton leader on rolling restart of cluster

PJ Fanning Wed, 24 Jan 2024 10:50:50 -0800

Is the difference in the config just the SplitBrainResolverProvider
and that you are doing a restart of servers one by one - meaning that
they the SplitBrainResolverProvider is the diff that is causing the
mismatched config issues?


If so, are you stuck with having to set
`pekko.cluster.configuration-compatibility-check.enforce-on-join =
off` ?

Otherwise, can you check the configs on your different machines to see
if you can spot a diff?

We discourage the use of `pekko.log-config-on-start = on` because it
can log sensitive info. Pekko 1.0.2 has a fix that tries to mask some
of the sensitive info.

Changing the config in a rolling way to enable this may run into the
same issue with configs match check at startup.

Is there any chance that you are using a mix of Akka and Pekko cluster
nodes? This is not yet supported. You would get similar error messages
about mismatched configs.

It might be worthwhile adding a custom config compat checker that
maybe logs more details about why the compat check fails.

```
pekko.cluster.configuration-compatibility-check.checkers {
  receptionist =
"org.apache.pekko.cluster.typed.internal.receptionist.ClusterReceptionistConfigCompatChecker"
}
```

On Wed, 24 Jan 2024 at 18:19, Dave Brosius <[email protected]> wrote:
>
> Hi folks, we have a cluster of nodes using Pekko (2.13 1.0.1, jdk17),
> and a bunch of singleton actors. We had been using netty forever
> (first with akka, then pekko), w/o issues, and we just switched to
> using artery.
>
> What we see is when nodes get repaved in a rolling restart fashion for
> some reason the singleton leader is lost, or more clearly the old
> leader becoming unavailable is not noticed. (so singleton messages are
> not processed). This from a release point of view came with the artery
> change (altho perhaps there is something else explaining it - and this
> is just an unfortunate correlation).
>
> Rumaging around docs we saw the notes about using the SplitBrain
> resolver and so we tried that,
>
> pekko.cluster.downing-provider-class =
> "org.apache.pekko.cluster.sbr.SplitBrainResolverProvider"
>
> and got
>
> ####<2024-01-19T03:25:18,717> <ERROR>
> <ParmClusterSystem-pekko.actor.default-dispatcher-13>
> <org.acme.pekko.cluster.Cluster> <s > <tg > <t > <u > <tr > - Cluster
> Node [[pekko://[email protected]:2552]] - Couldn't join
> seed nodes because of incompatible cluster configuration. It's
> recommended to perform a full cluster shutdown in order to deploy this
> new version. If a cluster shutdown isn't an option, you may want to
> disable this protection by setting
> 'pekko.cluster.configuration-compatibility-check.enforce-on-join =
> off'. Note that disabling it will allow the formation of a cluster
> with nodes having incompatible configuration settings. This node will
> be shutdown!
>
> Which makes us think that perhaps the two are related. Is there any
> logging we can use to determine what the incompatibility is? Or any
> other suggestions as to how to debug this further?
>
> thanks,
> dave
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Losing Singleton leader on rolling restart of cluster

Reply via email to