Hi Lars, 1) wireshark ... really nice tool. wireshark and I are already well on our way of becoming close friends as I try to debug this situation.
2) this is a pure test environment with everything that I can do to make the setup simple. Therefore no firewall configured on these systems. (All firewall is handled outside of my local environment.) 3) I've tried to manipulate "timeo" and "retrans". These are the current test values: timeo=20,retrans=4, which work great with NFSv3 reads over TCP. 4) This is SLES11 HAE GA release. Kernel is 2.6.27.19-5-default. 5) > analysing the network dump during a switchover/failover should be enough to > trouble shoot your issue. So thought I tooooo. But, the best that I've done is to become suspicious about retries after the migration with streamed writes. But, retries is a bit of a "duhhhh" ... as in an obvious culprit to the crime, and my manipulations of "timeo" and "retrans" have not solved the issue. Anyone have any ideas why NFSv3 over TCP reads should be successful across 100s of migrations and failovers, but writes bomb? Thanks, Bob Haxo SGI On Wed, 2009-05-20 at 18:39 +0200, Lars Ellenberg wrote: > On Tue, May 19, 2009 at 03:15:17PM -0700, Bob Haxo wrote: > > Greetings, > > > > I find that streamed writes fail with migration for NFS v3 over TCP. > > Not every time, but almost every time. > > > > Streamed writes continue nicely across many migrations for NFS v3 over > > UDP. > > > > With TCP, writes continue with migration back to the initial server. > > > > Does anyone have HA NFS migrations working for NFS over TCP? > > > > Suggestions? > > tcpdump/tshark dump nfs traffic during a switchover. > analyse with wireshark. > > suspicions: > timeo= mount option does a retry of failed requests every x seconds. > maybe it just needs a long time to recognize the failover? > do you find "NFS server not responding" in the client logs? > > connection tracking firewall on "new" server may drop tcp packets > that do not fit into existing connections, > so on retry you may run into much longer timeouts. > if you have a firewall, and you only ACCEPT "new" or "established" > connections, but DROP everything else, consider to instead REJECT > with tcp-reset NFS traffic from internal clients that connection > tracking does not know about. > > analysing the network dump during a switchover/failover should be enough > to trouble shoot your issue. > > btw, what kernel you are on? >
_______________________________________________ Pacemaker mailing list [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker
