[jira] [Comment Edited] (AMQ-5568) deleting lock file on broker shut down can take a master broker down

Torsten Mielke (JIRA) Fri, 06 Feb 2015 05:06:02 -0800

    [ 
https://issues.apache.org/jira/browse/AMQ-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309092#comment-14309092
 ]


Torsten Mielke edited comment on AMQ-5568 at 2/6/15 1:05 PM:
-------------------------------------------------------------

The keepAlive() check is needed due to 
[AMQ-4705|https://issues.apache.org/jira/browse/AMQ-4705], otherwise you may 
get two master broker instances.


was (Author: tmielke):
The keepAlive ping is needed due to 
[AMQ-4705|https://issues.apache.org/jira/browse/AMQ-4705], otherwise you may 
get two master broker instances.

> deleting lock file on broker shut down can take a master broker down
> --------------------------------------------------------------------
>
>                 Key: AMQ-5568
>                 URL: https://issues.apache.org/jira/browse/AMQ-5568
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker, Message Store
>    Affects Versions: 5.11.0
>            Reporter: Torsten Mielke
>              Labels: persistence
>
> This problem may only occur on a shared file system master/slave setup. 
> I can reproduce reliably on a NFSv4 mount using a persistence adapter 
> configuration like 
> {code}
> <levelDB directory="/nfs/activemq/data/leveldb" lockKeepAlivePeriod="5000">
>   <locker>
>     <shared-file-locker lockAcquireSleepInterval="10000"/>
>   </locker>
> </levelDB>
> {code}
> However the problem is also reproducible using kahaDB.
> Two broker instances competing for the lock on the shared storage (e.g. 
> leveldb or kahadb). Lets say brokerA becomes master, broker B slave.
> If brokerA looses access to the NFS share, it will shut down. As part of 
> shutting down, it tries delete the lock file of the persistence adapter. Now 
> since the NFS share is gone, all file i/o calls hang for a good while before 
> returning errors. As such the broker shut down gets delayed.
> In the meantime the slave broker B (not affected by the NFS problem) grabs 
> the lock and becomes master.
> If the NFS mount is restored while broker A (the previous master) still hangs 
> on the file i/o operations (as part of its shutdown routine), the attempt to 
> delete the persistence adapter lock file will finally succeed and broker A 
> shuts down. 
> Deleting the lock file however also affects the new master broker B who 
> periodically runs a keepAlive() check on the lock. That check verifies the 
> file still exists and the FileLock is still valid. As the lock file got 
> deleted, keepAlive() fails on broker B and that broker shuts down as well. 
> The overall result is that both broker instances have shut down despite an 
> initially successful failover.
> Using restartAllowed=true is not an option either as this can cause other 
> problems in an NFS based master/slave setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (AMQ-5568) deleting lock file on broker shut down can take a master broker down

Reply via email to