We got a new NFS server at work over the weekend, and it's been a bit flaky and causing my machines to hang left and right. In the course of poking about with ddb and crash I've found no less than three problems:
(1) The syncer holds syncer_mutex while calling VOP_FSYNC. If VOP_FSYNC goes off to pick daisies (perhaps for one of the reasons below, perhaps something else), syncer_mutex remains locked forever. Since unmounting requires syncer_mutex, any attempt to unmount anything, including umount -f on the offending volume, hangs forever in an uninterruptible sleep on "tstile". Note that while the syncer is not in an uninterruptible sleep (at least if you've done your nfs mounts correctly) to the best of my knowledge there's no way to interrupt or send signals to a kernel thread. (Also note that doing I/O while holding a mutex is contrary to rmind's repeated assertion that mutexes aren't supposed to be held for long periods of time.) (2) nfs_receive contains a loop (actually three copies that are slightly different depending on the socket type) that continues calling so->so_receive as long as the result is EWOULDBLOCK, i.e., as long as the receive attempt timed out, i.e., forever. The proper way out of this loop on a server failure is that nfs_timer() sets R_SOFTTERM on the nfsreq structure. However, this is apparently not working. I don't know why yet. (3) nfs_rcvlock contains an infinite loop waiting on nfs_rcvcv; while it uses cv_timedwait, the timeout logic seems to be a bizarre way of attempting to allow interruptible sleeps and does nothing to prevent the loop from looping forever. It is definitely possible for processes to get stuck in here because I've seen it. However, since it appears that this logic is a handrolled lock using a condvar (the existence of which is itself a bug), perhaps getting stuck here is actually a deadlock condition and a symptom of something else hanging elsewhere? -- David A. Holland dholl...@netbsd.org