Hi

I have encountered a bug with NetBSD NFS client. Despite a mount with
-o intr,soft, we can hit situation where a process can remain hang in 
kernel because the NFS server is gone.

This happens when the ioflush does its duty, with the following code path:
sync_fsync / nfs_sync / VOP_FSYNC / nfs_fsync / nfs_flush / VOP_PUTPAGES

VOP_PUTPAGES has flags = PGO_ALLPAGES|PGO_FREE. It then goes through
genfs_putpages and genfs_do_putpages, and get stuck in:

        /* Wait for output to complete. */
        if (!wasclean && !async && vp->v_numoutput != 0) {
                while (vp->v_numoutput != 0)
                        cv_wait(&vp->v_cv, slock);
        }

This cv_wait() is tiemout-less and uninterruptible. ioflush will 
sleep there forever, holding vnode lock. Any other process doing
I/O on the filesystem will sleep in tstile waiting for the vnode
lock with this path: 
sys_write / dofilewrite / vn_write / vn_lock / VOP_LOCK / rw_enter

We have another timeout-less and uninterruptible wait for the
vnode lock, which means -o intr,soft are not honoured. If the NFS 
server does not come back, the only way out is reboot -n. Even 
umount -f -R will get hung in tstile.

How can we fix it? 

1) ioflush should not sleep forever awaiting I/O completion for 
a NFS mount if it was mounted with -o soft. A PGO_SOFT
flags could be added to VOP_PUTPAGES so that cv_timedwait() is used
instead of cv_wait(), but how can we get the timeout? Should we introduce
a VOP_PUTPAGES2 with an addtionnal argument? Use a sane default? Get it
from the filesystem using a new VFS_GETTIMEOUT method? (or more general 
VFS_GETMNTINFO which would be able to query different informations).

2) Honouring -o intr seems to require either the introduction of a 
real nfs_lock (currently it is genfs_lock), or a change to genfs_lock.

The goal is to create an interruptible sleep for vp->v_lock. How can 
this be achieved? We have no rw_(try)enter_sig, should we introduce it? 
Or should we loop sleeping in an interruptible sleep  retrying  at
regular intervals? And how can a -o soft 's timeout could be hnoured here?

Last question: is there any hope to get this fixed in netbsd-7, or did the
VFS interface changed too much?

-- 
Emmanuel Dreyfus
m...@netbsd.org

Reply via email to