Torsten Mielke created AMQ-5568:
-----------------------------------

             Summary: deleting lock file on broker shut down can take a master 
broker down
                 Key: AMQ-5568
                 URL: https://issues.apache.org/jira/browse/AMQ-5568
             Project: ActiveMQ
          Issue Type: Bug
          Components: Broker, Message Store
    Affects Versions: 5.11.0
            Reporter: Torsten Mielke


This problem may only occur on a shared file system master/slave setup. 
I can reproduce reliably on a NFSv4 mount using a persistence adapter 
configuration like 

{code}
<levelDB directory="/nfs/activemq/data/leveldb" lockKeepAlivePeriod="5000">
  <locker>
    <shared-file-locker lockAcquireSleepInterval="10000"/>
  </locker>
</levelDB>
{code}

However the problem is also reproducible using kahaDB.
Two broker instances competing for the lock on the shared storage (e.g. leveldb 
or kahadb). Lets say brokerA becomes master, broker B slave.

If brokerA looses access to the NFS share, it will shut down. As part of 
shutting down, it tries delete the lock file of the persistence adapter. Now 
since the NFS share is gone, all file i/o calls hang for a good while before 
returning errors. 

In the meantime the slave broker B (not affected by the NFS problem) grabs the 
lock and becomes master.

If the NFS mount is restored while broker A (the previous master) still hangs 
on the file i/o operations (as part of its shutdown routine), the attempt to 
delete the lock file will finally succeed and broker A shuts down. 

Deleting the lock file however also affects the new master broker B who 
periodically runs a keepAlive() check on the lock. That check verifies the file 
still exists and the FileLock is still valid. As the lock got deleted keepAlive 
fails on broker B and that broker shuts down as well. 
The overall result is that both broker instances have shut down.

Using restartAllowed=true is not an option either as this can cause other 
problems in an NFS based master/slave setup.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to