Re: long stalls when creating 20 directory trees of 1 million inodes in parallel

Kent Overstreet Fri, 07 Jun 2024 10:53:19 -0700

On Fri, Jun 07, 2024 at 06:51:05PM +0200, Mateusz Guzik wrote:
> On Fri, Jun 7, 2024 at 6:28 PM Mateusz Guzik <mjgu...@gmail.com> wrote:
> >
> > On Fri, Jun 7, 2024 at 6:07 PM Kent Overstreet
> > <kent.overstr...@linux.dev> wrote:
> > >
> > > On Fri, Jun 07, 2024 at 12:10:34PM +0200, Mateusz Guzik wrote:
> > > > On Fri, Jun 7, 2024 at 11:13 AM Kent Overstreet
> > > > <kent.overstr...@linux.dev> wrote:
> > > > >
> > > > > On Fri, Jun 07, 2024 at 08:50:40AM +0200, Mateusz Guzik wrote:
> > > > > > On Fri, Jun 7, 2024 at 2:31 AM Kent Overstreet
> > > > > > <kent.overstr...@linux.dev> wrote:
> > > > > > >
> > > > > > > On Thu, Jun 06, 2024 at 08:40:50PM +0200, Mateusz Guzik wrote:
> > > > > > > > So I tried out bcachefs again and it once more fails to complete
> > > > > > > > parallel creation of 20 mln files -- processes doing the work 
> > > > > > > > use cpu
> > > > > > > > time in the kernel indefinitely.
> > > > > > >
> > > > > > > Hey thanks for the report - can you try my branch? It's behaving
> > > > > > > reasonable well in my testing now
> > > > > > >
> > > > > > > https://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream
> > > > > >
> > > > > > No dice, booted up top of your tree: 6.10.0-rc2-00013-gb7d8959f5ce9
> > > > >
> > > > > That commit completely fixes the lock contention on the btree key 
> > > > > cache
> > > > > lock I was seeing before, are you sure that's the commit you were
> > > > > running?
> > > >
> > > > Yea, I have CONFIG_LOCALVERSION_AUTO=y and included part of uname -a
> > > > from the running kernel to indicate the commit. That's b7d8959f5ce9
> > > > "bcachefs: Fix reporting of freed objects from key cache shrinker".
> > > >
> > > > I also have CONFIG_BCACHEFS=y
> > > >
> > > > But to put this to bed I added this:
> > > > diff --git a/fs/bcachefs/btree_key_cache.c 
> > > > b/fs/bcachefs/btree_key_cache.c
> > > > index e73162f9af37..459f0871c9a4 100644
> > > > --- a/fs/bcachefs/btree_key_cache.c
> > > > +++ b/fs/bcachefs/btree_key_cache.c
> > > > @@ -821,8 +821,14 @@ static unsigned long
> > > > bch2_btree_key_cache_scan(struct shrinker *shrink,
> > > >         size_t scanned = 0, freed = 0, nr = sc->nr_to_scan;
> > > >         unsigned start, flags;
> > > >         int srcu_idx;
> > > > +       static bool printed;
> > > >
> > > >         mutex_lock(&bc->lock);
> > > > +       if (!printed) {
> > > > +               printk(KERN_EMERG "here\n");
> > > > +               printed = true;
> > > > +       }
> > > > +
> > > >         bc->requested_to_free += sc->nr_to_scan;
> > > >
> > > >         srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
> > > >
> > > >
> > > > it shows up and my procs are still stuck in the kernel
> > > >
> > > > Is it a problem for you to get a similar setup to mine? It's not very
> > > > outlandish -- 24 cores and 24G of ram. You can probably get away with
> > > > both less cores and ram.
> > >
> > > I've tried with 4 GB of ram and with 24, and giving the vm 24 cores -
> > > your test passes for me, latest run was in 71 seconds.
> > >
> > > Host machine is a ryzen 5950x, 16 physical cores, 32 hyperthreaded - the
> > > lack of 24 physical cores is the only possible difference I can see. But
> > > I'm just not seeing the key cache lock contention anymore - I am seeing
> > > massive lock contention on the inode hash table lock, though.
> > >
> > > Here's a call graph profile of part of the last run. There's still a bit
> > > of key cache lock contention, but it's only 2% of cpu time now.
> > >
> >
> > Huh, it does appear fixed after all.
> >
> > I tried it out again, but this time without any of the debug. It *did*
> > finish, in 3 minutes. I ran it again just to be sure.
> > Also the stalls I was seeing while trying to poke also gone (for
> > example bpftrace would hang).
> >
> > The kernel with your fixes saddled with all the debug probably needed
> > significantly more than said 3 minutes and I concluded things went bad
> > before it could legitimately finish. Kind of a PEBKAC here.
> >
> > tl;dr I confirm it is fixed, thanks
> >
> 
> While things do *work*, the walktree part of the test is very slow
> compared to ext4 and bcachefs, both of which do it in less than a
> minute.
> bcachefs repeatedly needs over two.
> 
> mkfs.bcachefs /dev/vdb
> mount.bcachefs.sh -o noatime /dev/vdb /testfs
> sh createtrees /testfs 20
> while true; do sh walktrees /testfs 20; done
> 
> I get times:
> 2.80s user 986.09s system 697% cpu 2:21.84 total
> 2.77s user 993.15s system 748% cpu 2:13.07 total
> 2.95s user 992.67s system 747% cpu 2:13.26 total


ext4 gives me
WATCHDOG 300
mke2fs 1.47.1-rc2 (01-May-2024)
Discarding device blocks: done                            
Creating filesystem with 8386560 4k blocks and 41943040 inodes
Filesystem UUID: 4e72e045-ef0d-4d49-82d8-bc83e853a492
Superblock backups stored on blocks: 
        6552, 19656, 32760, 45864, 58968, 163800, 176904, 321048, 530712, 
        819000, 1592136, 2247336, 4095000, 4776408

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done     

EXT4-fs (vdb): mounted filesystem 4e72e045-ef0d-4d49-82d8-bc83e853a492 r/w with 
ordered data mode. Quota mode: disabled.

createtrees
real    0m27.552s
user    0m6.678s
sys     4m17.175s

walktrees
real    0m34.830s
user    0m2.266s
sys     10m38.617s

and I get similar from bcachefs - except I did just have a run where
createtrees took ~3m with key cache lock contention spiking again, so it
appears that's not completely fixed.

How are you getting sub 3s?

Re: long stalls when creating 20 directory trees of 1 million inodes in parallel

Reply via email to