Re: 13-stable NFS server hang

Rick Macklem Wed, 28 Feb 2024 16:05:11 -0800

On Tue, Feb 27, 2024 at 9:30 PM Garrett Wollman <wollman@bimajorityorg> wrote:
>
> Hi, all,
>
> We've had some complaints of NFS hanging at unpredictable intervals.
> Our NFS servers are running a 13-stable from last December, and
> tonight I sat in front of the monitor watching `nfsstat -dW`.  I was
> able to clearly see that there were periods when NFS activity would
> drop *instantly* from 30,000 ops/s to flat zero, which would last
> for about 25 seconds before resuming exactly as it was before.
>
> I wrote a little awk script to watch for this happening and run
> `procstat -k` on the nfsd process, and I saw that all but two of the
> service threads were idle.  The three nfsd threads that had non-idle
> kstacks were:
>
>   PID    TID COMM                TDNAME              KSTACK
>   997 108481 nfsd                nfsd: master        mi_switch 
> sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program 
> svc_run_internal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall 
> fast_syscall_common
>   997 960918 nfsd                nfsd: service       mi_switch 
> sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid 
> nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit 
> fork_trampoline
>   997 962232 nfsd                nfsd: service       mi_switch _cv_wait 
> txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey 
> zfs_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range 
> VOP_COPY_FILE_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc 
> nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline
>
> I'm suspicious of two things: first, the copy_file_range RPC; second,
> the "master" nfsd thread is actually servicing an RPC which requires
> obtaining a lock.  The "master" getting stuck while performing client
> RPCs is, I believe, the reason NFS service grinds to a halt when a
> client tries to write into a near-full filesystem, so this problem
> would be more evidence that the dispatching function should not be
> mixed with actual operations.  I don't know what the clients are
> doing, but is it possible that nfsrvd_copy_file_range is holding a
> lock that is needed by one or both of the other two threads?
>
> Near-term I could change nfsrvd_copy_file_range to just
> unconditionally return NFSERR_NOTSUP and force the clients to fall
> back, but I figured I would ask if anyone else has seen this.
I have attached a little patch that should limit the server's Copy size
to vfs.nfsd.maxcopyrange (default of 10Mbytes).
Hopefully this makes sure that the Copy does not take too long.


You could try this instead of disabling Copy. It would be nice to know if
this is suffciient? (If not, I'll probably add a sysctl to disable Copy.)

rick

>
> -GAWollman
>
>

copylen.patch
Description: Binary data

Re: 13-stable NFS server hang

Reply via email to