On Thu, Mar 7, 2024 at 7:59 PM Garrett Wollman <woll...@bimajority.org> wrote: > > <<On Sat, 2 Mar 2024 23:28:20 -0500, I wrote: > > > I believe this explains why vn_copy_file_range sometimes takes much > > longer than a second: our servers often have lots of data waiting to > > be written to disk, and if the file being copied was recently modified > > (and so is dirty), this might take several seconds. I've set > > vfs.zfs.dmu_offset_next_sync=0 on the server that was hurting the most > > and am watching to see if we have more freezes. > > > If this does the trick, then I can delay deploying a new kernel until > > April, after my upcoming vacation. > > Since zeroing dmu_offset_next_sync, I've seen about 8000 copy > operations on the problematic server and no NFS work stoppages due to > the copy. I have observed a few others in a similar posture, where > one client wants to ExchangeID and is waiting for other requests to > drain, but nothing long enough to cause a service problem.[1] > > I think in general this choice to prefer "accurate" but very slow hole > detection is a poor choice on the part of the OpenZFS developers, but > so long as we can disable it, I don't think we need to change anything > in the NFS server itself. So the question is... How can this be documented? In the BUGS section of "man nfsd" maybe. What do others think?
> It would be a good idea longer term to > figure out a lock-free or synchronization-free way of handling these > client session accept/teardown operations, because it is still a > performance degradation, just not disruptive enough for users to > notice. Yes, as I've noted, it is on my todo list to take a look at it. Good sleuthing, rick > > -GAWollman > > [1] Saw one with a slow nfsrv_readdirplus and another with a bunch of > threads blocked on an upcall to nfsuserd.