[ 
https://issues.apache.org/jira/browse/AMQ-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296709#comment-14296709
 ] 

Heikki Manninen edited comment on AMQ-5549 at 1/29/15 12:30 PM:
----------------------------------------------------------------

The following combination seems to work IF the network outage between the NFS 
client(s) and the NFS server is short (a couple of seconds):

* Grace Time 90s, Lease Time 60s
* lockKeepAlivePeriod=5000
* lockAcquireSleepInterval=15000

In this case the master broker is able to renew the lock and continue operating 
and the slave broker fails to get the lock.

However if the network outage is significantly longer (tested with various 
durations between 30 and 300 seconds), both brokers are able to get lock on the 
file and start working simutaneously. Even though the master brokers dmesg 
shows the following message after the outage:

"NFS: nfs4_reclaim_open_state: Lock reclaim failed!"

It seems that this happens if the lock reclaiming (keepAlive?) operation on the 
master broker gets blocked for long enough time for the NFS server lease 
timeout to pass. In this case the slave is able to claim the lock (if it's NFS 
filesystem stops blocking earlier) and after the master stops blocking it 
continues to operate even though NFS client reports "Lock reclaim failed!".

Seems also that the time it takes for the individual NFS client to recover from 
blocking I/O varies between clients.


was (Author: heikki_m):
The following combination seems to work IF the network outage between the NFS 
client(s) and the NFS server is short (a couple of seconds):

* Grace Time 90s, Lease Time 60s
* lockKeepAlivePeriod=5000
* lockAcquireSleepInterval=15000

In this case the master broker is able to renew the lock and continue operating 
and the slave broker fails to get the lock.

However if the network outage is significantly longer (tested with various 
durations between 60 and 300 seconds), both brokers are able to get lock on the 
file and start working simutaneously. Even though the master brokers dmesg 
shows the following message after the outage:

"NFS: nfs4_reclaim_open_state: Lock reclaim failed!"

It seems that this happens if the lock reclaiming (keepAlive?) operation on the 
master broker gets blocked for long enough time for the NFS server lease 
timeout to pass. In this case the slave is able to claim the lock (if it's NFS 
filesystem stops blocking earlier) and after the master stops blocking it 
continues to operate even though NFS client reports "Lock reclaim failed!".

Seems also that the time it takes for the individual NFS client to recover from 
blocking I/O varies between clients.

> Shared Filesystem Master/Slave using NFSv4 allows both brokers become active 
> at the same time
> ---------------------------------------------------------------------------------------------
>
>                 Key: AMQ-5549
>                 URL: https://issues.apache.org/jira/browse/AMQ-5549
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker, Message Store
>    Affects Versions: 5.10.1
>         Environment: - CentOS Linux 6
> - OpenJDK 1.7
> - ActiveMQ 5.10.1
>            Reporter: Heikki Manninen
>            Priority: Critical
>
> Identical ActiveMQ master and slave brokers are installed on CentOS Linux 6 
> virtual machines. There is a third virtual machine (also CentOS 6) providing 
> an NFSv4 share for the brokers KahaDB.
> Both brokers are started and the master broker acquires file lock on the lock 
> file and the slave broker sits in a loop and waits for a lock as expected. 
> Also changing brokers work as expected.
> Once the network connection of the NFS server is disconnected both master and 
> slave NFS mounts block and slave broker stops logging file lock re-tries. 
> After a short while after bringing the network connection back the mounts 
> come back and the slave broker is able to acquire the lock simultaneously. 
> Both brokers accept client connections.
> In this situation it is also possible to stop and start both individual 
> brokers many times and they are always able to acquire the lock even if the 
> other one is already running. Only after stopping both brokers and starting 
> them again is the situation back to normal.
> * NFS server:
> ** CentOS Linux 6
> ** NFS v4 export options: rw,sync
> ** NFS v4 grace time 45 seconds
> ** NFS v4 lease time 10 seconds
> * NFS client:
> ** CentOS Linux 6
> ** NFS mount options: nfsvers=4,proto=tcp,hard,wsize=65536,rsize=65536
> * ActiveMQ configuration (otherwise default):
> {code:xml}
>         <persistenceAdapter>
>             <kahaDB directory="${activemq.data}/kahadb">
>               <locker>
>                 <shared-file-locker lockAcquireSleepInterval="1000"/>
>               </locker>
>             </kahaDB>
>         </persistenceAdapter>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to