Am 18.02.16 um 12:14 schrieb Michael Rasmussen:
On Thu, 18 Feb 2016 07:13:36 +0100
Stephan Budach <stephan.bud...@jvm.de> wrote:

So, when I issue a simple ls -l on the folder of the vdisks, while the 
switchover is happening, the command somtimes comcludes in 18 to 20 seconds, 
but sometime ls will just sit there for minutes.

This is a known limitation in NFS. NFS was never intended to be
clustered so what you experience is the NFS process on the client side
keeps kernel locks for the now unavailable NFS server and any request
to the process hangs waiting for these locks to be resolved. This can
be compared to a situation where you hot-swap a drive in the pool
without notifying the pool.

Only way to resolve this is to forcefully kill all NFS client processes
and the restart the NFS client.



This is not the main issue, as this is not a clustered NFS, it's a failover one. Of course the client will have to reset it's connection, but it seems that the NFS client is just doing that, after the NFS share becomes available on the failover host. Looking at the tcpdump, I found that failing over from the primary NFS server to the secondary, will work straight on. The service stalls for some more seconds than RSF-1 needs to switch over the ZPOOL and the vip. In my tests it was always the switchback that caused these issue. Looking at the tcpdump, I noticed that when the switchback occured, the dump was swamped with DUP! acks. This indicated to me that the still running nfs server on the primary was still sending some outstanding ack to the now returned client, which vigorously denied them. I think the outcome is, that the server finally gave up sending those ack(s) and then the connection resumed. So, what helped in this case was to restart the nfs server on the primary after the ZPOOL had been switched over to the secondary…

I have just tried to wait at least 5 minutes before failing back from the secondary to the primary node and this time it went as smoothly as it did, when I initially failed over from the primary to the secondary. However, I think for sanity the RSF-1 agent should also restart the nfs server on teh host, where it just moved the ZPOOL away from.

So, as fas as I am concerned, this issue is resolved. Thanks everybody for chiming in on this.

Cheers,
Stephan
_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss

Reply via email to