On 9/4/23, Alexander Leidinger <alexan...@leidinger.net> wrote: > Am 2023-08-28 22:33, schrieb Alexander Leidinger: >> Am 2023-08-22 18:59, schrieb Mateusz Guzik: >>> On 8/22/23, Alexander Leidinger <alexan...@leidinger.net> wrote: >>>> Am 2023-08-21 10:53, schrieb Konstantin Belousov: >>>>> On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote: >>>>>> Am 2023-08-20 23:17, schrieb Konstantin Belousov: >>>>>> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote: >>>>>> > > On 8/20/23, Alexander Leidinger <alexan...@leidinger.net> wrote: >>>>>> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik: >>>>>> > > >> On 8/20/23, Alexander Leidinger <alexan...@leidinger.net> >>>>>> > > >> wrote: >>>>>> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik: >>>>>> > > >>>> On 8/18/23, Alexander Leidinger <alexan...@leidinger.net> >>>>>> > > >>>> wrote: >>>>>> > > >>> >>>>>> > > >>>>> I have a 51MB text file, compressed to about 1MB. Are you >>>>>> > > >>>>> interested >>>>>> > > >>>>> to >>>>>> > > >>>>> get it? >>>>>> > > >>>>> >>>>>> > > >>>> >>>>>> > > >>>> Your problem is not the vnode limit, but nullfs. >>>>>> > > >>>> >>>>>> > > >>>> https://people.freebsd.org/~mjg/netchild-periodic-find.svg >>>>>> > > >>> >>>>>> > > >>> 122 nullfs mounts on this system. And every jail I setup has >>>>>> > > >>> several >>>>>> > > >>> null mounts. One basesystem mounted into every jail, and then >>>>>> > > >>> shared >>>>>> > > >>> ports (packages/distfiles/ccache) across all of them. >>>>>> > > >>> >>>>>> > > >>>> First, some of the contention is notorious VI_LOCK in order >>>>>> > > >>>> to >>>>>> > > >>>> do >>>>>> > > >>>> anything. >>>>>> > > >>>> >>>>>> > > >>>> But more importantly the mind-boggling off-cpu time comes >>>>>> > > >>>> from >>>>>> > > >>>> exclusive locking which should not be there to begin with -- >>>>>> > > >>>> as >>>>>> > > >>>> in >>>>>> > > >>>> that xlock in stat should be a slock. >>>>>> > > >>>> >>>>>> > > >>>> Maybe I'm going to look into it later. >>>>>> > > >>> >>>>>> > > >>> That would be fantastic. >>>>>> > > >>> >>>>>> > > >> >>>>>> > > >> I did a quick test, things are shared locked as expected. >>>>>> > > >> >>>>>> > > >> However, I found the following: >>>>>> > > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) { >>>>>> > > >> mp->mnt_kern_flag |= >>>>>> > > >> lowerrootvp->v_mount->mnt_kern_flag & >>>>>> > > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED | >>>>>> > > >> MNTK_EXTENDED_SHARED); >>>>>> > > >> } >>>>>> > > >> >>>>>> > > >> are you using the "nocache" option? it has a side effect of >>>>>> > > >> xlocking >>>>>> > > > >>>>>> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache. >>>>>> > > > >>>>>> > > >>>>>> > > If you don't have "nocache" on null mounts, then I don't see how >>>>>> > > this >>>>>> > > could happen. >>>>>> > >>>>>> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set >>>>>> > for >>>>>> > fuse and nfs at least. >>>>>> >>>>>> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS >>>>>> exported. >>>>>> 6 of those nullfs mounts are also exported via Samba. The NFS >>>>>> exports >>>>>> shouldn't be needed anymore, I will remove them. >>>>> By nfs I meant nfs client, not nfs exports. >>>> >>>> No NFS client mounts anywhere on this system. So where is this >>>> exclusive >>>> lock coming from then... >>>> This is a ZFS system. 2 pools: one for the root, one for anything I >>>> need >>>> space for. Both pools reside on the same disks. The root pool is a >>>> 3-way >>>> mirror, the "space-pool" is a 5-disk raidz2. All jails are on the >>>> space-pool. The jails are all basejail-style jails. >>>> >>> >>> While I don't see why xlocking happens, you should be able to dtrace >>> or printf your way into finding out. >> >> dtrace looks to me like a faster approach to get to the root than >> printf... my first naive try is to detect exclusive locks. I'm not 100% >> sure I got it right, but at least dtrace doesn't complain about it: >> ---snip--- >> #pragma D option dynvarsize=32m >> >> fbt:nullfs:null_lock:entry >> /args[0]->a_flags & 0x080000 != 0/ >> { >> stack(); >> } >> ---snip--- >> >> In which direction should I look with dtrace if this works in tonights >> run of periodic? I don't have enough knowledge about VFS to come up >> with some immediate ideas. > > After your sysctl fix for maxvnodes I increased the amount of vnodes 10 > times compared to the initial report. This has increased the speed of > the operation, the find runs in all those jails finished today after ~5h > (@~8am) instead of in the afternoon as before. Could this suggest that > in parallel some null_reclaim() is running which does the exclusive > locks and slows down the entire operation? >
That may be a slowdown to some extent, but the primary problem is exclusive vnode locking for stat lookup, which should not be happening. -- Mateusz Guzik <mjguzik gmail.com>