We are constantly experiencing data corruptions while writing (different) files 
from a lot of
clients (~100) to one ZFS pool NFS v4 mounted from a single machine.  
Occasionally (1 
out of a couple of 10000 writes) the files have a few ASCII 0 values in 
sequence in the 
middle of the file. Testet with SXDE 1/08 (79b) and recent Osol dev 06.09 
(111a), host is a 
X4600M2 with a 2x146GB SAS disk stripe (just testing), clients are Ultra20 
M1.M2, 
X4600, Blade X8420. This has been tested with a script (writing ASCII zeros to 
files to 
generate load)

#!/bin/sh
WORK=/myworkdir  ; nfs mounted dir somewhere
echo Begin: `date`
for i in `seq 8192`; do
      dd if=/dev/zero of=$WORK/test_nfs.${HOSTNAME}_$$ obs=1b count=4
done
echo Done: `date`

and sent via SGE gridengine (120-150 jobs, about 100 running at the same time),
where the log ouput of this script also goes to that NFS mounted directory
(runs about 2 hours on a 1 Gbe switched fiber network). 
The problems appear in the log outputs (or data files in our other 
applications), 
and there are also not always 8192 dd instances recorded, but often 2 or 3 
less, 
although the jobs have run without failure. None of the servers involved in 
generating 
these errors show anything in messages, fmdump, syslog, and no errors in 
netstat -i.
nfsstat Server v4 is completely clean.
We see this problem every week with other applications on different servers 
also, 
which is awkward to repair when running production jobs.

The NFS /etc/default/nfs has been changed to adapt to the higher load:
NFSD_SERVERS=256
NFSD_LISTEN_BACKLOG=256
LOCKD_LISTEN_BACKLOG=256
LOCKD_SERVERS=256
NFS_SERVER_DELEGATION=off

So :
1. ASCII 0 values are inserted under high NFS load
2. some output lines are completely lost (drops ?)

Any clues what's wrong here ? My understanding of the semantics of NFS implies 
that
at least 1. should never happen on clean hardware, regardless if there are 
timeouts 
or drops or whatever. It looks to me like that this is a hidden bug somewhere 
in 
NFS or ZFS only visible under constant high load.
-- 
This message posted from opensolaris.org

Reply via email to