We are constantly experiencing data corruptions while writing (different) files
from a lot of
clients (~100) to one ZFS pool NFS v4 mounted from a single machine.
Occasionally (1
out of a couple of 10000 writes) the files have a few ASCII 0 values in
sequence in the
middle of the file. Testet with SXDE 1/08 (79b) and recent Osol dev 06.09
(111a), host is a
X4600M2 with a 2x146GB SAS disk stripe (just testing), clients are Ultra20
M1.M2,
X4600, Blade X8420. This has been tested with a script (writing ASCII zeros to
files to
generate load)
#!/bin/sh
WORK=/myworkdir ; nfs mounted dir somewhere
echo Begin: `date`
for i in `seq 8192`; do
dd if=/dev/zero of=$WORK/test_nfs.${HOSTNAME}_$$ obs=1b count=4
done
echo Done: `date`
and sent via SGE gridengine (120-150 jobs, about 100 running at the same time),
where the log ouput of this script also goes to that NFS mounted directory
(runs about 2 hours on a 1 Gbe switched fiber network).
The problems appear in the log outputs (or data files in our other
applications),
and there are also not always 8192 dd instances recorded, but often 2 or 3
less,
although the jobs have run without failure. None of the servers involved in
generating
these errors show anything in messages, fmdump, syslog, and no errors in
netstat -i.
nfsstat Server v4 is completely clean.
We see this problem every week with other applications on different servers
also,
which is awkward to repair when running production jobs.
The NFS /etc/default/nfs has been changed to adapt to the higher load:
NFSD_SERVERS=256
NFSD_LISTEN_BACKLOG=256
LOCKD_LISTEN_BACKLOG=256
LOCKD_SERVERS=256
NFS_SERVER_DELEGATION=off
So :
1. ASCII 0 values are inserted under high NFS load
2. some output lines are completely lost (drops ?)
Any clues what's wrong here ? My understanding of the semantics of NFS implies
that
at least 1. should never happen on clean hardware, regardless if there are
timeouts
or drops or whatever. It looks to me like that this is a hidden bug somewhere
in
NFS or ZFS only visible under constant high load.
--
This message posted from opensolaris.org