[jira] [Updated] (ARTEMIS-3345) Shared-Nothing Replication Master loose Node ID on failed fail-back

Francesco Nigro (Jira) Mon, 14 Jun 2021 09:49:05 -0700


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Francesco Nigro updated ARTEMIS-3345:
-------------------------------------
    Description: 
A failing-back master forget its Node ID and on broker restart, having a 
different Node ID, can become live without searching any existing live with its 
previous Node ID.

This is happen because of this mechanics on {{SharedNothingBackupActivation}}:
 # {{SharedNothingBackupActivation::init}} is going to call 
{{activeMQServer.resetNodeManager}} that's re-creating a {{NodeManager}} with 
{{replicatingBackup == true}}
 # SharedNothingBackupActivation::run is then
{code:java}
         // move all data away:
         activeMQServer.getNodeManager().stop();
         
activeMQServer.moveServerData(replicaPolicy.getMaxSavedReplicatedJournalsSize());
         activeMQServer.getNodeManager().start();
{code}
The server data rotation just clean up everything on the data path, including 
the lock file.
{{NodeManager::start}}, due to {{replicatingBackup == true}} is going to skip 
setting up a new lock file (no lock files at this point and by consequence, *no 
durable NODE ID*), see
{code:java}
   @Override
   public synchronized void start() throws Exception {
      if (isStarted()) {
         return;
      }
      if (!replicatedBackup) {
         setUpServerLockFile();
      }

      super.start();
   }
{code}
# the broker set an in-memory Node ID after a successful sync with the live, 
using {{NodeManager::setNodeID}}
# *if* the broker is going to failover (or failback, given that's a master) 
{{activeMQServer.getNodeManager().stopBackup()}} it setup a new lock file with 
the previously set Node ID, see
{code:java}
   @Override
   public void stopBackup() throws NodeManagerException {
      if (replicatedBackup && getNodeId() != null) {
         try {
            setUpServerLockFile();
         } catch (IOException e) {
            throw new NodeManagerException(e);
         }
      }
      super.stopBackup();
   }
{code}

This process shows that if anything wrong is going to happen before the Node ID 
is being written on the durable storage, could be either because the broker was 
unable to become live (no majority or just still alive live) or because of a 
restart with unlucky timing, the broker won't have any lock file and it just 
forget its original Node ID.





  was:
A failing-back master forget its Node ID and on broker restart, having a 
different Node ID, can become live without searching any existing live with its 
previous Node ID.

This is happen because of this mechanics on {{SharedNothingBackupActivation}}:
 # {{SharedNothingBackupActivation::init}} is going to call 
{{activeMQServer.resetNodeManager}} that's re-creating a {{NodeManager}} with 
{{replicatingBackup == true}}
 # SharedNothingBackupActivation::run is then
{code:java}
         // move all data away:
         activeMQServer.getNodeManager().stop();
         
activeMQServer.moveServerData(replicaPolicy.getMaxSavedReplicatedJournalsSize());
         activeMQServer.getNodeManager().start();
{code}
The server data rotation just clean up everything on the data path, including 
the lock file.
{{NodeManager::start}}, due to {{replicatingBackup == true}} is going to skip 
setting up a new lock file (no lock files at this point and by consequence, *no 
durable NODE ID*), see
{code:java}
   @Override
   public synchronized void start() throws Exception {
      if (isStarted()) {
         return;
      }
      if (!replicatedBackup) {
         setUpServerLockFile();
      }

      super.start();
   }
{code}
# the broker set an in-memory Node ID after a successful sync with the live, 
using {{NodeManager::setNodeID}}
# *if* this broker is going to failover (or failback, given that's a master) 
{{activeMQServer.getNodeManager().stopBackup()}} is going to setup the lock 
file with the previously set Node ID, see
{code:java}
   @Override
   public void stopBackup() throws NodeManagerException {
      if (replicatedBackup && getNodeId() != null) {
         try {
            setUpServerLockFile();
         } catch (IOException e) {
            throw new NodeManagerException(e);
         }
      }
      super.stopBackup();
   }
{code}

This process shows that if anything wrong is going to happen before the Node ID 
is being written on the durable storage, could be either because the broker was 
unable to become live (no majority or just still alive live) or because of a 
restart with unlucky timing, the broker won't have any lock file and it just 
forget its original Node ID.






> Shared-Nothing Replication Master loose Node ID on failed fail-back
> -------------------------------------------------------------------
>
>                 Key: ARTEMIS-3345
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3345
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.17.0
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Major
>
> A failing-back master forget its Node ID and on broker restart, having a 
> different Node ID, can become live without searching any existing live with 
> its previous Node ID.
> This is happen because of this mechanics on {{SharedNothingBackupActivation}}:
>  # {{SharedNothingBackupActivation::init}} is going to call 
> {{activeMQServer.resetNodeManager}} that's re-creating a {{NodeManager}} with 
> {{replicatingBackup == true}}
>  # SharedNothingBackupActivation::run is then
> {code:java}
>          // move all data away:
>          activeMQServer.getNodeManager().stop();
>          
> activeMQServer.moveServerData(replicaPolicy.getMaxSavedReplicatedJournalsSize());
>          activeMQServer.getNodeManager().start();
> {code}
> The server data rotation just clean up everything on the data path, including 
> the lock file.
> {{NodeManager::start}}, due to {{replicatingBackup == true}} is going to skip 
> setting up a new lock file (no lock files at this point and by consequence, 
> *no durable NODE ID*), see
> {code:java}
>    @Override
>    public synchronized void start() throws Exception {
>       if (isStarted()) {
>          return;
>       }
>       if (!replicatedBackup) {
>          setUpServerLockFile();
>       }
>       super.start();
>    }
> {code}
> # the broker set an in-memory Node ID after a successful sync with the live, 
> using {{NodeManager::setNodeID}}
> # *if* the broker is going to failover (or failback, given that's a master) 
> {{activeMQServer.getNodeManager().stopBackup()}} it setup a new lock file 
> with the previously set Node ID, see
> {code:java}
>    @Override
>    public void stopBackup() throws NodeManagerException {
>       if (replicatedBackup && getNodeId() != null) {
>          try {
>             setUpServerLockFile();
>          } catch (IOException e) {
>             throw new NodeManagerException(e);
>          }
>       }
>       super.stopBackup();
>    }
> {code}
> This process shows that if anything wrong is going to happen before the Node 
> ID is being written on the durable storage, could be either because the 
> broker was unable to become live (no majority or just still alive live) or 
> because of a restart with unlucky timing, the broker won't have any lock file 
> and it just forget its original Node ID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARTEMIS-3345) Shared-Nothing Replication Master loose Node ID on failed fail-back

Reply via email to