Re: long stalls when creating 20 directory trees of 1 million inodes in parallel

Mateusz Guzik Mon, 10 Jun 2024 11:59:43 -0700

On Mon, Jun 10, 2024 at 8:50 PM Kent Overstreet
<kent.overstr...@linux.dev> wrote:
>
> On Mon, Jun 10, 2024 at 08:44:37PM +0200, Mateusz Guzik wrote:
> > On Mon, Jun 10, 2024 at 8:13 PM Kent Overstreet
> > <kent.overstr...@linux.dev> wrote:
> > >
> > > On Sat, Jun 08, 2024 at 11:24:37AM +0200, Mateusz Guzik wrote:
> > > > On Fri, Jun 07, 2024 at 02:10:05PM -0400, Kent Overstreet wrote:
> > > > > Does the following patch help? I think the hammering on the key cache
> > > > > lock may be correlated with the key cache being mostly empty (and it
> > > > > looks like the shrinker code is behaving badly and trying very hard to
> > > > > free from a mostly empty cache)
> > > > >
> > > > > [snip the patch]
> > > >
> > > > I see you committed the patch along with some other stuff.
> > > >
> > > > I git pulled to 46eb7b6c7420c2313bde44ab8f74f303b042e754 ("bcachefs: add
> > > > might_sleep() annotations for fsck_err()").
> > > >
> > > > For me the problem is still there.
> > > >
> > > > This time around I populated the fs and *rebooted*, then did walktrees
> > > > test in a loop. So the first run has 0 state cached.
> > > >
> > > > It looked like this:
> > > > # while true; do sh walktrees /testfs 20; done
> > > > 3.17s user 605.22s system 1409% cpu 43.150 total
> > > > 2.77s user 887.92s system 693% cpu 2:08.34 total
> > > > 3.05s user 726.20s system 777% cpu 1:33.77 total
> > > > 3.19s user 600.90s system 1476% cpu 40.904 total
> > > > 3.10s user 956.80s system 599% cpu 2:40.19 total
> > > > 3.15s user 575.78s system 1331% cpu 43.464 total
> > > > 3.30s user 865.48s system 1386% cpu 1:02.64 total
> > > > 2.79s user 470.09s system 1404% cpu 33.666 total
> > > > 2.78s user 884.00s system 718% cpu 2:03.36 total
> > > > 2.95s user 568.63s system 1439% cpu 39.714 total
> > > > 3.02s user 964.90s system 729% cpu 2:12.72 total
> > > > 2.95s user 829.68s system 703% cpu 1:58.44 total
> > > > 2.77s user 597.92s system 1486% cpu 40.403 total
> > > >
> > > > So the time varies wildly and there is tons of off cpu time -- with 20
> > > > workers eating 100% cpu for the duration it would be 2000%.
> > > >
> > > > I also started seeing splats in dmesg (example at the end of the mail).
> > >
> > > this confirms that the SRCU splats we've been seeing are partly caused
> > > by this, as I suspected
> > >
> > > > That is to say I would suggest you also leave it running in a loop.
> > > > Expected setting is that it approaches complete CPU usage for the period
> > > > and takes in the ballpark of 40s (well shorter on your hw, but you get
> > > > the idea).
> > >
> > > Yeah, I do see it with more iterations.
> > >
> > > I'm redoing the key cache to get rid of the key cache lock - that solves
> > > it nicely. If you want to take a look at what I'm working on,
> > >
> > > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-vfs-inodes-rhashtable
> > >
> > > this is unfinished - I'm going to be switching back to the polling srcu
> > > interface (so we can allocate from pending frees), and the patch to
> > > switch vfs inodes to an rhashtable currently breaks writeback - but it's
> > > much faster.
> >
> > So you are not patching up the vfs inode hash with rhashtables, just
> > migrating bcachefs away from it to your own rhashtables variant?
>
> Yeah, doing it for every filesystem will be more involved since the
> lifetime rules of fs/inode.c are tricky - and right now I just want to
> be able to profile without inode hash table lock contention blowing
> everything up.
>
> > I don't blame you, but this also means I'm going to do some touch ups
> > to my patch and send a v2. Was waiting for the bcachefs (and maybe
> > inode hash) situation to resolve itself.
>
> Well, we really should figure out some sort of a plan for fs/inode.c :/


My plan for short term is to get the cheap hack in, it already
provides vast majority of whatever win one can expect to get from any
inode hash rework for next to no price.

As for long term, there is other stuff I intend to do in vfs. Should I
manage to get the stuff in and find this remains unfinished, maybe
i'll do something about it.

Part of the problem is that plain rcu conversion would require
reviewing every iget5_locked and similar consumer out there, to see if
their callback is safe to use with it. Not something I'm signing up
for anytime soon.

Anyhow, you are free to object on fsdevel with a patch of your own. :)

-- 
Mateusz Guzik <mjguzik gmail.com>

Re: long stalls when creating 20 directory trees of 1 million inodes in parallel

Reply via email to