On Fri, Nov 06, 2015 at 07:47:35AM -0500, Soumya Koduri wrote:
> Hi,
> 
> In a 2-node nfs-ganesha cluster setup, we have noticed that after
> couple of iterations of failover & failback of the IP between those
> nodes, client I/O gets stuck. We have observed this in RHEL 7.1
> environments (not sure about RHEL 6). While debugging I see that, the
> node which takes over Virtual IP(after couple of iterations) doesn't
> respond(acknowledge) to the client's TCP SYN packet. 
> 
> Found couple of discussions around it in few forums and I tried tuning
> certain TCP parameters (tcp_timestamp, tcp_window_scaling) as
> mentioned in there. But it did not work. The current work-around we
> are left with (to resume the I/Os) is either 
> * restart nfs-ganesha service on the node which has taken over IP, to
> clear the existing established TCP connections. Or 
> * failback the IP by getting the original node back online to resume
> the I/O.
> 
> Any ideas on what could be have been the reason for TCP ACK not being
> sent to the TCP SYN packet coming on an existing connection in
> ESTABLISHED state? Any pointers on how to fix that?

CTDB has a function to "tickle" connections. This facilitates a faster
fail-over if the client does not detect it needs to re-connect. We
possibly need to do something like this for pacemaker/ganesha too.

Some details about it can be found here:

  
https://github.com/samba-team/samba/commit/a104d1d8237979ae9fc5dd332e6624c0392be1d0

Pacemaker does seem to have something like that too, but documentation
about it is difficult to find:

  http://linux-ha.org/doc/man-pages/re-ra-portblock.html

HTH,
Niels

------------------------------------------------------------------------------
_______________________________________________
Nfs-ganesha-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Reply via email to