[jira] [Commented] (AMQ-5568) Deleting lock file on broker shut down can take a master broker down

Erik Wramner (JIRA) Fri, 11 Sep 2015 12:45:13 -0700

    [ 
https://issues.apache.org/jira/browse/AMQ-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741440#comment-14741440
 ]


Erik Wramner commented on AMQ-5568:
-----------------------------------

I'm surprised that I haven't seen others with the same issue, but the unit test 
LockFileTest#testNoDeleteOnUnlockIfNotLocked always fails for me on Windows. It 
works on Linux. The problem is that Windows refuses to delete open files and 
the file is open, so lockFile.delete() returns false and does nothing, hence 
the no-longer-valid check fails as the file still exists and is valid.

Am I the only one seeing this, or should I fix the test (check the return code 
from delete and skip the rest of the test on failure)?

> Deleting lock file on broker shut down can take a master broker down
> --------------------------------------------------------------------
>
>                 Key: AMQ-5568
>                 URL: https://issues.apache.org/jira/browse/AMQ-5568
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker, Message Store
>    Affects Versions: 5.11.0
>            Reporter: Torsten Mielke
>            Assignee: Gary Tully
>              Labels: persistence
>             Fix For: 5.12.0
>
>
> This problem may only occur on a shared file system master/slave setup. 
> I can reproduce reliably on a NFSv4 mount using a persistence adapter 
> configuration like 
> {code}
> <levelDB directory="/nfs/activemq/data/leveldb" lockKeepAlivePeriod="5000">
>   <locker>
>     <shared-file-locker lockAcquireSleepInterval="10000"/>
>   </locker>
> </levelDB>
> {code}
> However the problem is also reproducible using kahaDB.
> Two broker instances competing for the lock on the shared storage (e.g. 
> leveldb or kahadb). Lets say brokerA becomes master, broker B slave.
> If brokerA looses access to the NFS share, it will shut down. As part of 
> shutting down, it tries delete the lock file of the persistence adapter. Now 
> since the NFS share is gone, all file i/o calls hang for a good while before 
> returning errors. As such the broker shut down gets delayed.
> In the meantime the slave broker B (not affected by the NFS problem) grabs 
> the lock and becomes master.
> If the NFS mount is restored while broker A (the previous master) still hangs 
> on the file i/o operations (as part of its shutdown routine), the attempt to 
> delete the persistence adapter lock file will finally succeed and broker A 
> shuts down. 
> Deleting the lock file however also affects the new master broker B who 
> periodically runs a keepAlive() check on the lock. That check verifies the 
> file still exists and the FileLock is still valid. As the lock file got 
> deleted, keepAlive() fails on broker B and that broker shuts down as well. 
> The overall result is that both broker instances have shut down despite an 
> initially successful failover.
> Using restartAllowed=true is not an option either as this can cause other 
> problems in an NFS based master/slave setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AMQ-5568) Deleting lock file on broker shut down can take a master broker down

Reply via email to