[ 
https://issues.apache.org/jira/browse/ARTEMIS-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Nigro updated ARTEMIS-3345:
-------------------------------------
    Summary: Shared-Nothing Replication Master loose Node ID on failed 
fail-back  (was: Shared-Nothing Replication Master loose its Node ID on failed 
fail-back)

> Shared-Nothing Replication Master loose Node ID on failed fail-back
> -------------------------------------------------------------------
>
>                 Key: ARTEMIS-3345
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3345
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.17.0
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Major
>
> A failing-back master forget its Node ID and on broker restart, having a 
> different Node ID, can become live without searching any existing live with 
> its previous Node ID.
> This is happen because of this mechanics on {{SharedNothingBackupActivation}}:
>  # {{SharedNothingBackupActivation::init}} is going to call 
> {{activeMQServer.resetNodeManager}} that's re-creating a {{NodeManager}} with 
> {{replicatingBackup == true}}
>  # SharedNothingBackupActivation::run is then
> {code:java}
>          // move all data away:
>          activeMQServer.getNodeManager().stop();
>          
> activeMQServer.moveServerData(replicaPolicy.getMaxSavedReplicatedJournalsSize());
>          activeMQServer.getNodeManager().start();
> {code}
> The server data rotation just clean up everything on the data path, including 
> the lock file.
> {{NodeManager::start}}, due to {{replicatingBackup == true}} is going to skip 
> setting up a new lock file (no lock files at this point)
> # this broker is setting an in-memory Node ID after a successful sync with 
> the live, using {{NodeManager::setNodeID}}
> # *if* this broker is going to failover (or failback, given that's a master) 
> {{activeMQServer.getNodeManager().stopBackup()}} is going to setup the lock 
> file with the previously set Node ID, see
> {code:java}
>    @Override
>    public void stopBackup() throws NodeManagerException {
>       if (replicatedBackup && getNodeId() != null) {
>          try {
>             setUpServerLockFile();
>          } catch (IOException e) {
>             throw new NodeManagerException(e);
>          }
>       }
>       super.stopBackup();
>    }
> {code}
> This process shows that if anything wrong is going to happen before the Node 
> ID is being written on the durable storage, could be either because the 
> broker was unable to become live (no majority or just still alive live) or 
> because of a restart with unlucky timing, the broker won't have any lock file 
> and it just forget its original Node ID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to