On Fri, Jun 7, 2024 at 8:10 PM Kent Overstreet
<kent.overstr...@linux.dev> wrote:
>
> On Fri, Jun 07, 2024 at 07:59:10PM +0200, Mateusz Guzik wrote:
> > On Fri, Jun 7, 2024 at 7:53 PM Kent Overstreet
> > <kent.overstr...@linux.dev> wrote:
> > >
> > > On Fri, Jun 07, 2024 at 06:51:05PM +0200, Mateusz Guzik wrote:
> > > > On Fri, Jun 7, 2024 at 6:28 PM Mateusz Guzik <mjgu...@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 7, 2024 at 6:07 PM Kent Overstreet
> > > > > <kent.overstr...@linux.dev> wrote:
> > > > > >
> > > > > > On Fri, Jun 07, 2024 at 12:10:34PM +0200, Mateusz Guzik wrote:
> > > > > > > On Fri, Jun 7, 2024 at 11:13 AM Kent Overstreet
> > > > > > > <kent.overstr...@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jun 07, 2024 at 08:50:40AM +0200, Mateusz Guzik wrote:
> > > > > > > > > On Fri, Jun 7, 2024 at 2:31 AM Kent Overstreet
> > > > > > > > > <kent.overstr...@linux.dev> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 06, 2024 at 08:40:50PM +0200, Mateusz Guzik 
> > > > > > > > > > wrote:
> > > > > > > > > > > So I tried out bcachefs again and it once more fails to 
> > > > > > > > > > > complete
> > > > > > > > > > > parallel creation of 20 mln files -- processes doing the 
> > > > > > > > > > > work use cpu
> > > > > > > > > > > time in the kernel indefinitely.
> > > > > > > > > >
> > > > > > > > > > Hey thanks for the report - can you try my branch? It's 
> > > > > > > > > > behaving
> > > > > > > > > > reasonable well in my testing now
> > > > > > > > > >
> > > > > > > > > > https://evilpiepirate.org/git/bcachefs.git 
> > > > > > > > > > bcachefs-for-upstream
> > > > > > > > >
> > > > > > > > > No dice, booted up top of your tree: 
> > > > > > > > > 6.10.0-rc2-00013-gb7d8959f5ce9
> > > > > > > >
> > > > > > > > That commit completely fixes the lock contention on the btree 
> > > > > > > > key cache
> > > > > > > > lock I was seeing before, are you sure that's the commit you 
> > > > > > > > were
> > > > > > > > running?
> > > > > > >
> > > > > > > Yea, I have CONFIG_LOCALVERSION_AUTO=y and included part of uname 
> > > > > > > -a
> > > > > > > from the running kernel to indicate the commit. That's 
> > > > > > > b7d8959f5ce9
> > > > > > > "bcachefs: Fix reporting of freed objects from key cache 
> > > > > > > shrinker".
> > > > > > >
> > > > > > > I also have CONFIG_BCACHEFS=y
> > > > > > >
> > > > > > > But to put this to bed I added this:
> > > > > > > diff --git a/fs/bcachefs/btree_key_cache.c 
> > > > > > > b/fs/bcachefs/btree_key_cache.c
> > > > > > > index e73162f9af37..459f0871c9a4 100644
> > > > > > > --- a/fs/bcachefs/btree_key_cache.c
> > > > > > > +++ b/fs/bcachefs/btree_key_cache.c
> > > > > > > @@ -821,8 +821,14 @@ static unsigned long
> > > > > > > bch2_btree_key_cache_scan(struct shrinker *shrink,
> > > > > > >         size_t scanned = 0, freed = 0, nr = sc->nr_to_scan;
> > > > > > >         unsigned start, flags;
> > > > > > >         int srcu_idx;
> > > > > > > +       static bool printed;
> > > > > > >
> > > > > > >         mutex_lock(&bc->lock);
> > > > > > > +       if (!printed) {
> > > > > > > +               printk(KERN_EMERG "here\n");
> > > > > > > +               printed = true;
> > > > > > > +       }
> > > > > > > +
> > > > > > >         bc->requested_to_free += sc->nr_to_scan;
> > > > > > >
> > > > > > >         srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
> > > > > > >
> > > > > > >
> > > > > > > it shows up and my procs are still stuck in the kernel
> > > > > > >
> > > > > > > Is it a problem for you to get a similar setup to mine? It's not 
> > > > > > > very
> > > > > > > outlandish -- 24 cores and 24G of ram. You can probably get away 
> > > > > > > with
> > > > > > > both less cores and ram.
> > > > > >
> > > > > > I've tried with 4 GB of ram and with 24, and giving the vm 24 cores 
> > > > > > -
> > > > > > your test passes for me, latest run was in 71 seconds.
> > > > > >
> > > > > > Host machine is a ryzen 5950x, 16 physical cores, 32 hyperthreaded 
> > > > > > - the
> > > > > > lack of 24 physical cores is the only possible difference I can 
> > > > > > see. But
> > > > > > I'm just not seeing the key cache lock contention anymore - I am 
> > > > > > seeing
> > > > > > massive lock contention on the inode hash table lock, though.
> > > > > >
> > > > > > Here's a call graph profile of part of the last run. There's still 
> > > > > > a bit
> > > > > > of key cache lock contention, but it's only 2% of cpu time now.
> > > > > >
> > > > >
> > > > > Huh, it does appear fixed after all.
> > > > >
> > > > > I tried it out again, but this time without any of the debug. It *did*
> > > > > finish, in 3 minutes. I ran it again just to be sure.
> > > > > Also the stalls I was seeing while trying to poke also gone (for
> > > > > example bpftrace would hang).
> > > > >
> > > > > The kernel with your fixes saddled with all the debug probably needed
> > > > > significantly more than said 3 minutes and I concluded things went bad
> > > > > before it could legitimately finish. Kind of a PEBKAC here.
> > > > >
> > > > > tl;dr I confirm it is fixed, thanks
> > > > >
> > > >
> > > > While things do *work*, the walktree part of the test is very slow
> > > > compared to ext4 and bcachefs, both of which do it in less than a
> > > > minute.
> > > > bcachefs repeatedly needs over two.
> > > >
> > > > mkfs.bcachefs /dev/vdb
> > > > mount.bcachefs.sh -o noatime /dev/vdb /testfs
> > > > sh createtrees /testfs 20
> > > > while true; do sh walktrees /testfs 20; done
> > > >
> > > > I get times:
> > > > 2.80s user 986.09s system 697% cpu 2:21.84 total
> > > > 2.77s user 993.15s system 748% cpu 2:13.07 total
> > > > 2.95s user 992.67s system 747% cpu 2:13.26 total
> > >
> > > ext4 gives me
> > > WATCHDOG 300
> > > mke2fs 1.47.1-rc2 (01-May-2024)
> > > Discarding device blocks: done
> > > Creating filesystem with 8386560 4k blocks and 41943040 inodes
> > > Filesystem UUID: 4e72e045-ef0d-4d49-82d8-bc83e853a492
> > > Superblock backups stored on blocks:
> > >         6552, 19656, 32760, 45864, 58968, 163800, 176904, 321048, 530712,
> > >         819000, 1592136, 2247336, 4095000, 4776408
> > >
> > > Allocating group tables: done
> > > Writing inode tables: done
> > > Creating journal (32768 blocks): done
> > > Writing superblocks and filesystem accounting information: done
> > >
> > > EXT4-fs (vdb): mounted filesystem 4e72e045-ef0d-4d49-82d8-bc83e853a492 
> > > r/w with ordered data mode. Quota mode: disabled.
> > >
> > > createtrees
> > > real    0m27.552s
> > > user    0m6.678s
> > > sys     4m17.175s
> > >
> > > walktrees
> > > real    0m34.830s
> > > user    0m2.266s
> > > sys     10m38.617s
> > >
> > > and I get similar from bcachefs - except I did just have a run where
> > > createtrees took ~3m with key cache lock contention spiking again, so it
> > > appears that's not completely fixed.
> > >
> > > How are you getting sub 3s?
> >
> > There is no sub 3s anything but maybe user time.
> >
> > On my hw total real time walktree on bcachefs is about 2 minutes 13
> > seconds for repeated runs.
> > It is about 45 seconds on ext4.
> >
> > Looks like my hw runs into the key cache lock thing very easily.
>
> Ok, I misread.
>
> Does the following patch help? I think the hammering on the key cache
> lock may be correlated with the key cache being mostly empty (and it
> looks like the shrinker code is behaving badly and trying very hard to
> free from a mostly empty cache)
>

I'm going to bench tomorrow.

> re: your inode lock contention patch series - we really just need to
> convert fs/inode.c to an rhashtable.
>

I'm not familiar with rhashtable and don't have interest in learning
the api, big part of the point of the posted patch was how trivial it
is.

That is to say if you are willing to hack it up yourself in the
foreseeable future I'd have no complaints.

> diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
> index e73162f9af37..24e88eebd23c 100644
> --- a/fs/bcachefs/btree_key_cache.c
> +++ b/fs/bcachefs/btree_key_cache.c
> @@ -875,24 +875,22 @@ static unsigned long bch2_btree_key_cache_scan(struct 
> shrinker *shrink,
>
>                         if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
>                                 bc->skipped_dirty++;
> -                               goto next;
>                         } else if (test_bit(BKEY_CACHED_ACCESSED, 
> &ck->flags)) {
>                                 clear_bit(BKEY_CACHED_ACCESSED, &ck->flags);
>                                 bc->skipped_accessed++;
> -                               goto next;
> -                       } else if (bkey_cached_lock_for_evict(ck)) {
> +                       } else if (!bkey_cached_lock_for_evict(ck)) {
> +                               bc->skipped_lock_fail++;
> +                       } else {
>                                 bkey_cached_evict(bc, ck);
>                                 bkey_cached_free(bc, ck);
>                                 bc->moved_to_freelist++;
>                                 freed++;
> -                       } else {
> -                               bc->skipped_lock_fail++;
>                         }
>
>                         scanned++;
>                         if (scanned >= nr)
>                                 break;
> -next:
> +
>                         pos = next;
>                 }
>
> @@ -915,7 +913,8 @@ static unsigned long bch2_btree_key_cache_count(struct 
> shrinker *shrink,
>         struct bch_fs *c = shrink->private_data;
>         struct btree_key_cache *bc = &c->btree_key_cache;
>         long nr = atomic_long_read(&bc->nr_keys) -
> -               atomic_long_read(&bc->nr_dirty);
> +               atomic_long_read(&bc->nr_dirty) -
> +               128;
>
>         return max(0L, nr);
>  }



-- 
Mateusz Guzik <mjguzik gmail.com>

Reply via email to