Re: [OmniOS-discuss] Testing RSF-1 with zpool/nfs HA

Stephan Budach Thu, 18 Feb 2016 06:15:40 -0800

Am 18.02.16 um 12:14 schrieb Michael Rasmussen:

On Thu, 18 Feb 2016 07:13:36 +0100
Stephan Budach <stephan.bud...@jvm.de> wrote:

So, when I issue a simple ls -l on the folder of the vdisks, while the 
switchover is happening, the command somtimes comcludes in 18 to 20 seconds, 
but sometime ls will just sit there for minutes.

This is a known limitation in NFS. NFS was never intended to be
clustered so what you experience is the NFS process on the client side
keeps kernel locks for the now unavailable NFS server and any request
to the process hangs waiting for these locks to be resolved. This can
be compared to a situation where you hot-swap a drive in the pool
without notifying the pool.

Only way to resolve this is to forcefully kill all NFS client processes
and the restart the NFS client.

This is not the main issue, as this is not a clustered NFS, it's afailover one. Of course the client will have to reset it's connection,but it seems that the NFS client is just doing that, after the NFS sharebecomes available on the failover host. Looking at the tcpdump, I foundthat failing over from the primary NFS server to the secondary, willwork straight on. The service stalls for some more seconds than RSF-1needs to switch over the ZPOOL and the vip. In my tests it was alwaysthe switchback that caused these issue. Looking at the tcpdump, Inoticed that when the switchback occured, the dump was swamped with DUP!acks. This indicated to me that the still running nfs server on theprimary was still sending some outstanding ack to the now returnedclient, which vigorously denied them.I think the outcome is, that the server finally gave up sending thoseack(s) and then the connection resumed. So, what helped in this case wasto restart the nfs server on the primary after the ZPOOL had beenswitched over to the secondary…

I have just tried to wait at least 5 minutes before failing back fromthe secondary to the primary node and this time it went as smoothly asit did, when I initially failed over from the primary to the secondary.However, I think for sanity the RSF-1 agent should also restart the nfsserver on teh host, where it just moved the ZPOOL away from.

So, as fas as I am concerned, this issue is resolved. Thanks everybodyfor chiming in on this.


Cheers,
Stephan
_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss

Re: [OmniOS-discuss] Testing RSF-1 with zpool/nfs HA

Reply via email to