On 03/19/10 08:31, John Baldwin wrote:
On Friday 19 March 2010 7:34:23 am Steve Polyack wrote:
Hi, we use a FreeBSD 8-STABLE (from shortly after release) system as an
NFS server to provide user home directories which get mounted across a
few machines (all 6.3-RELEASE).  For the past few weeks we have been
running into problems where one particular client will go into an
infinite loop where it is repeatedly trying to write data which causes
the NFS server to return "reply ok 40 write ERROR: Input/output error
PRE: POST:".  This retry loop can cause between 20mbps and 500mbps of
constant traffic on our network, depending on the size of the data
associated with the failed write.

We spent some time on the issue and determined that something on one of
the clients is deleting a file as it is being written to by another NFS
client.  We were able to enable the NFS lockmgr and use lockf(1) to fix
most of these conditions, and the frequency of this problem has dropped
from once a night to once a week.  However, it's still a problem and we
can't necessarily force all of our users to "play nice" and use lockf/flock.

Has anyone seen this before?  No errors are being logged on the NFS
server itself, but the "Server Ret-Failed" counter begins to increase
rapidly whenever a client gets stuck in this infinite retry loop:
Server Ret-Failed
          224768961

I have a feeling that using NFS in such a matter may simply be prone to
such problems, but what confuses me is why the NFS client system is
infinitely retrying the write operation and causing itself so much grief.
Yes, your feeling is correct.  This sort of race is inherent to NFS if you do
not use some sort of locking protocol to resolve the race.  The infinite
retries sound like a client-side issue.  Have you been able to try a newer OS
version on a client to see if it still causes the same behavior?

I can't try a newer FBSD version on the client where we are seeing the problems, but I can recreate the problem fairly easily. Perhaps I'll try it with an 8.0 client. If I remember correctly, one of the strange things is that it doesn't seem to hit "critical mass" until a few hours after the operation first fails. I may be wrong, but I'll double check that when I check vs. 8.0-release.

I forgot to add this in the first post, but these are all TCP NFS v3 mounts.

Thanks for the response.

_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

Reply via email to