Hello, I've been trying to find the answer to my server's NFS problems for about a month now but the searching I have done on email list archives and HOWTOs so far hasn't helped. I don't think the two problems are related except that they both deal with NFS.
SITUATION: I run a variety of debian 2.1 and debian 2.2 servers on my network: file-server-one debian 2.1 kernel 2.2.14 rpc.nfsd 2.2beta37 automount ver 3.1.1 file-server-two debian 2.1 kernel 2.2.14 rpc.nfsd 2.2beta37 automount ver 3.1.1 file-server-three debian 2.1 kernel 2.2.14 rpc.nfsd 2.2beta37 automount ver 3.1.1 cpu-server-one debian 2.2 kernel 2.2.15pre20 rpc.nfsd 2.2beta47 automount ver 3.1.4 cpu-server-two debian 2.2 kernel 2.2.15pre20 rpc.nfsd 2.2beta47 automount ver 3.1.4 cpu-server-three debian 2.2 kernel 2.2.15pre20 rpc.nfsd 2.2beta47 automount ver 3.1.4 cpu-server-four debian 2.2 kernel 2.2.15pre20 rpc.nfsd 2.2beta47 automount ver 3.1.4 As you can see the newer servers are used for running cpu intensive jobs, and they use their local hard drives for storage of local intermediate data created during such batch jobs that is also used by the other cpu-server's. The original data set that such batch jobs use as input is stored on file-server-three. file-server's one and two are used for the user's individual home directories. All machines use NFS to export their data directories and mounts that point to file-server-one and file-server-two are automounted (via the autofs package) under /h, whilst the other machine's expored filesystems would be mounted under /d (eg /d/cpu-server-one-data1). All of this mount information is distributed via NIS, as well as user/group info. PROBLEM ONE - NFS STABILITY PROBLEM: Occasionally my users who are running these batch jobs on cpu-server's will find that their job grinds to a halt because file-server-three stops responding too all of the machines. For example, if file-server-three had stopped responding, then if I ran ls -l /d/file-server-three-data1 from the shell, it would hang indefinitely. To get it going again, I would run "/etc/init.d/nfs-server restart" as root on file-server-three, and then their batch jobs would continue on their merry way as that mount point would then start responding again. I've included info about file-server's one and two because as a comparison, these servers make available the user's home directories (eg file-server-one:/home/scottb would be mounted under /h/scottb on all of the linux servers on the network). There doesn't seem to be any stability problem with these servers at all, and they would make their data directories available to many more clients running either linux as a server or workstation, or via samba to win32 clients. Additionally, the cpu-server's exported file systems dont' suffer the same stability problems either - they share their intermediate results of the batch jobs eg /d/cpu-server-data1 (cpu-server-one:/data1) is available on the other cpu-servers. On the problematic file-server-three machine, I've tried to upgrade the nfs-server deb package but found I would also need to upgrade libraries on the server so have been reluctant to do so. Currently I have lowered the rsize and wsize variables that mount uses to mount the drive on other machines (from 8192 to 4096) and although the number of incidents that the drive stops responding has fallen a little, the problem has not gone away. PROBLEM TWO - FILE OWNERSHIP FOR NFS MOUNTED FILE SHARES This problem only affects the four cpu-servers. On each of these machines, I have an /etc/exports machine that looks like: /data1 192.168.0.1/255.255.255.0(rw,root_squash,map_nis=syrinx) and is identical on each machine (since the servers are essentially identical except for the serial numbers on the servers themselves :-) However, when I mount one of the cpu-server-xxx:/data1 filesystems on another machine, (eg file-server-one or even another cpu-server-yyy machine), absolutely all of the files belong to "nobody, nogroup" even though if you look at the files on the local machine they belong to proper users of the network (eg scottb, users). Additionally, on the file-server-zzz machines, their /etc/exports file uses exactly the same options (rw,root_squash,map_nis=syrinx) and when their mount points are mounted on other linux servers/workstations, the files contained on them show the right ownership. So, although I can share the files stored on cpu-server's machines, it can only ever be read-only at the moment because the system accessing the file over nfs thinks it is owned by nobody, when that is not correct. I'm hoping that my problem is not isolated and that someone out there has had problems similar to mine and has successfully dealt with it. Any solutions, or even ideas would be helpful. Apart from these two small issues, I have a great network and file sharing system that needs little maintenance and keeps going for long periods without breaking. Regards, Scott Bragg Senior System Administrator Syrinx Speech Systems