[ https://issues.apache.org/jira/browse/AMQ-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Torsten Mielke updated AMQ-5568: -------------------------------- Summary: Deleting lock file on broker shut down can take a master broker down (was: deleting lock file on broker shut down can take a master broker down) > Deleting lock file on broker shut down can take a master broker down > -------------------------------------------------------------------- > > Key: AMQ-5568 > URL: https://issues.apache.org/jira/browse/AMQ-5568 > Project: ActiveMQ > Issue Type: Bug > Components: Broker, Message Store > Affects Versions: 5.11.0 > Reporter: Torsten Mielke > Labels: persistence > > This problem may only occur on a shared file system master/slave setup. > I can reproduce reliably on a NFSv4 mount using a persistence adapter > configuration like > {code} > <levelDB directory="/nfs/activemq/data/leveldb" lockKeepAlivePeriod="5000"> > <locker> > <shared-file-locker lockAcquireSleepInterval="10000"/> > </locker> > </levelDB> > {code} > However the problem is also reproducible using kahaDB. > Two broker instances competing for the lock on the shared storage (e.g. > leveldb or kahadb). Lets say brokerA becomes master, broker B slave. > If brokerA looses access to the NFS share, it will shut down. As part of > shutting down, it tries delete the lock file of the persistence adapter. Now > since the NFS share is gone, all file i/o calls hang for a good while before > returning errors. As such the broker shut down gets delayed. > In the meantime the slave broker B (not affected by the NFS problem) grabs > the lock and becomes master. > If the NFS mount is restored while broker A (the previous master) still hangs > on the file i/o operations (as part of its shutdown routine), the attempt to > delete the persistence adapter lock file will finally succeed and broker A > shuts down. > Deleting the lock file however also affects the new master broker B who > periodically runs a keepAlive() check on the lock. That check verifies the > file still exists and the FileLock is still valid. As the lock file got > deleted, keepAlive() fails on broker B and that broker shuts down as well. > The overall result is that both broker instances have shut down despite an > initially successful failover. > Using restartAllowed=true is not an option either as this can cause other > problems in an NFS based master/slave setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)