[
https://issues.apache.org/jira/browse/ARTEMIS-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Francesco Nigro updated ARTEMIS-3345:
-------------------------------------
Summary: Shared-Nothing Replication Master loose Node ID on failed
fail-back (was: Shared-Nothing Replication Master loose its Node ID on failed
fail-back)
> Shared-Nothing Replication Master loose Node ID on failed fail-back
> -------------------------------------------------------------------
>
> Key: ARTEMIS-3345
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3345
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.17.0
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Major
>
> A failing-back master forget its Node ID and on broker restart, having a
> different Node ID, can become live without searching any existing live with
> its previous Node ID.
> This is happen because of this mechanics on {{SharedNothingBackupActivation}}:
> # {{SharedNothingBackupActivation::init}} is going to call
> {{activeMQServer.resetNodeManager}} that's re-creating a {{NodeManager}} with
> {{replicatingBackup == true}}
> # SharedNothingBackupActivation::run is then
> {code:java}
> // move all data away:
> activeMQServer.getNodeManager().stop();
>
> activeMQServer.moveServerData(replicaPolicy.getMaxSavedReplicatedJournalsSize());
> activeMQServer.getNodeManager().start();
> {code}
> The server data rotation just clean up everything on the data path, including
> the lock file.
> {{NodeManager::start}}, due to {{replicatingBackup == true}} is going to skip
> setting up a new lock file (no lock files at this point)
> # this broker is setting an in-memory Node ID after a successful sync with
> the live, using {{NodeManager::setNodeID}}
> # *if* this broker is going to failover (or failback, given that's a master)
> {{activeMQServer.getNodeManager().stopBackup()}} is going to setup the lock
> file with the previously set Node ID, see
> {code:java}
> @Override
> public void stopBackup() throws NodeManagerException {
> if (replicatedBackup && getNodeId() != null) {
> try {
> setUpServerLockFile();
> } catch (IOException e) {
> throw new NodeManagerException(e);
> }
> }
> super.stopBackup();
> }
> {code}
> This process shows that if anything wrong is going to happen before the Node
> ID is being written on the durable storage, could be either because the
> broker was unable to become live (no majority or just still alive live) or
> because of a restart with unlucky timing, the broker won't have any lock file
> and it just forget its original Node ID.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)