Dear list (sorry for the rather long e-mail),

I'm looking for someone who has successfully implemented the "exportfs" RA with 
NFSv4 over TCP (and is willing to share some information).

The final goal is to present NFS datastores to ESXi over 2 "head" nodes. Both 
nodes must be active in the sense that they both have an NFS server running but 
they export different file systems (via exports and floating IPAddr2). 

When moving an export to another node, we move the entire 
"filesystem/export/ipaddr" stack but we keep the NFS server running (as it 
might potentially be exporting some other file systems via other IPs).

Both nodes are sharing disks (JBOD for physical and shared VMDKs for testing). 
Disks are only accessed by a single "head" node at any given time so a 
clustered file system is not required.

To my knowledge, this setup has been best described by Florian Haas over there:
https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha_techguides/book_sleha_techguides.html
(except we're not using DRBD and LVM)

Before going into more details, I mention that I have already read all those 
posts and examples as well as many of the NFS related questions in this list 
for the past year or so.

http://wiki.linux-nfs.org/wiki/index.php/Nfsd4_server_recovery
http://wiki.linux-nfs.org/wiki/index.php/NFS_Recovery_and_Client_Migration
http://oss.clusterlabs.org/pipermail/pacemaker/2011-July/011000.html
https://access.redhat.com/solutions/42868

I'm forced to use TCP because of ESXi and I'm willing to use NFSv4 because ESXi 
can use "session trunking" or some sort of "multipath" with version 4 (not 
tested yet)

The problem I see is what a lot of people have already mentioned: Failover 
works nicely but failback takes a very long time. Many posts mention putting 
/var/lib/nfs on a shared disk but this only makes sense when we failover an 
entire NFS server (compared to just exports). Moreover, I don't see any 
relevant information written to /var/lib/nfs when a single Linux NFSv4 client 
is mounting a folder.

NFSv4 LEASE and GRACE time have been reduced to 10 seconds. I'm using the 
exportfs RA parameter "wait_for_leasetime_on_stop=true".

>From my investigation, the problem actually happens at the TCP level. Let's 
>describe the most basic scenario, ie a single filesystem moving from node1 to 
>node2 and back.

I first start the NFS servers using a clone resource. Node1 then starts a group 
that mounts a file system, adds it to the export list (exportfs RA) and adds a 
floating IP.

I then mount this folder from a Linux NFS client.

When I "migrate" my group out of node1, everything correctly moves to node2. 
IPAddr2:stop, then the exportfs "stop" action takes about 12 seconds (10 
seconds LEASE time plus the rest) and my file system gets unmounted. During 
that time, I see the NFS client trying to talk to the floating IP (on its node1 
MAC address). Once everything has moved to node2, the client sends TCP packets 
to the new MAC address and node2 replies with a TCP RESET. At this point, the 
client restarts a NEW TCP session and it works fine.

However, on node 1, I can still see an ESTABLISHED TCP session between the 
client and the floating IP on port 2049 (NFS), even though the IP is gone. 
After a short time, the session moves to FIN_WAIT1 and stays there for a while.

When I then "unmigrate" my group to node1 I see the same behavior except that 
node1 is *not* sending TCP RESETS because it still has a TCP session with the 
client. I imagine that the sequence numbers do not match so node1 simply 
doesn't reply at all. It then takes several minutes for the client to give up 
and restart a new NFS session.

Does anyone have an idea about how to handle this problem ? I have done this 
with iSCSI where we can explicitly "kill" sessions but I don't think NFS has 
something similar. I also don't see anything in the IPAddr2 RA that would help 
in killing TCP sessions while removing a floating IP.

Next ideas would be to either tune the TCP stack in order to reduce the 
FIN_WAIT1 state or to synchronize sessions between the nodes (using 
conntrackd). That just seems an overkill.

Thanks for any input! Patrick


**************************************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager. "postmas...@navixia.com"      Navixia SA
**************************************************************************************

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to