[ceph-users] Re: cephfs: num_stray growing without bounds (octopus)

Venky Shankar Tue, 09 Aug 2022 07:29:25 -0700

Hi Frank,

On Sun, Aug 7, 2022 at 6:46 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Dhairya,
>
> I have some new results (below) and also some wishes as an operator that 
> might even help with the decision you mentioned in your e-mails:
>
> - Please implement both ways, a possibility to trigger an evaluation manually 
> via a "ceph tell|daemon" command and a periodic evaluation.
> - For the periodic evaluation, please introduce a tuning parameter, for 
> example, mds_gc_interval (in seconds). If set to 0, disable periodic 
> evaluation.


FWIW, reintegration can be triggered with filesystem scrub on pacific
ceph-mds (16.2.8+) daemons. This was planned for octopus backport

        https://github.com/ceph/ceph/pull/44657

but the PR couldn't make it to the octopus release.

>
> Reasons:
>
> - On most production systems, doing this once per 24 hours seems enough (my 
> benchmark is very special, it needs to delete aggressively). The default for 
> mds_gc_interval could therefore be 86400 (24h).
> - On my production system I would probably disable periodic evaluation and 
> rather do a single shot manual evaluation some time after snapshot removal 
> but before users start working to synchronise with snapshot removal (where 
> the "lost" entries are created).
>
> This follows a general software design principle: Whenever there is a choice 
> like this to take, it is best to try to implement an API that can support all 
> use cases and to leave the choice of what fits best for their workloads to 
> the operators. Try not to restrict operators by hard-coding decisions. Rather 
> pick reasonable defaults but also empower operators to tune things to special 
> needs. One-size-fits-all never works.
>
> Now to the results: Indeed, a restart triggers complete removal of all 
> orphaned stray entries:
>
> [root@rit-tceph bench]# ./mds-stray-num
> 962562
> [root@rit-tceph bench]# ceph mds fail 0
> failed mds gid 371425
> [root@rit-tceph bench]# ./mds-stray-num
> 767329
> [root@rit-tceph bench]# ./mds-stray-num
> 766777
> [root@rit-tceph bench]# ./mds-stray-num
> 572430
> [root@rit-tceph bench]# ./mds-stray-num
> 199172
> [root@rit-tceph bench]# ./mds-stray-num
> 0
> # ceph df
> --- RAW STORAGE ---
> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd    2.4 TiB  2.4 TiB  896 MiB    25 GiB       0.99
> TOTAL  2.4 TiB  2.4 TiB  896 MiB    25 GiB       0.99
>
> --- POOLS ---
> POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
> device_health_metrics   1    1  205 KiB        9  616 KiB      0    785 GiB
> fs-meta1                2   64  684 MiB       44  2.0 GiB   0.09    785 GiB
> fs-meta2                3  128      0 B        0      0 B      0    785 GiB
> fs-data                 4  128      0 B        0      0 B      0    1.5 TiB
>
> Good to see that the bookkeeping didn't loose track of anything. I will add a 
> periodic mds fail to my benchmark and report back how all of this works under 
> heavy load.
>
> Best regards and thanks for our help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dhairya Parmar <dpar...@redhat.com>
> Sent: 05 August 2022 22:53:09
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] cephfs: num_stray growing without bounds (octopus)
>
> On Fri, Aug 5, 2022 at 9:12 PM Frank Schilder 
> <fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote:
> Hi Dhairya,
>
> thanks to pointing me to this tracker. I can try an MDS fail to see if it 
> clears the stray buckets or if there are still left-overs. Before doing so:
>
> > Thanks for the logs though. It will help me while writing the patch.
>
> I couldn't see if you were asking for logs. Do you want me to collect 
> something or do you mean the session logs included in my e-mail. Also, is it 
> on purpose to leave out the ceph-user list in CC (e-mail address)?
>
> Nah, the session logs included are good enough. I missed CCing ceph-users. 
> Done now.
>
> For my urgent needs, failing the MDS periodically during the benchmark might 
> be an interesting addition any ways - if this helps with the stray count.
>
> Yeah it might be helpful for now. Do let me know if that works for you.
>
> Thanks for your fast reply and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dhairya Parmar <dpar...@redhat.com<mailto:dpar...@redhat.com>>
> Sent: 05 August 2022 16:10
> To: Frank Schilder
> Subject: Re: [ceph-users] cephfs: num_stray growing without bounds (octopus)
>
> Hi Frank,
>
> This seems to be related to a tracker<https://tracker.ceph.com/issues/53724> 
> that I'm working on. I've got some rough ideas in my mind, a simple solution 
> would be to run a single thread that would regularly evaluate strays (maybe 
> every 1 or 2 minutes?) or a much better approach would be to evaluate strays 
> whenever snapshot removal takes place but it's not that easy as it looks, 
> therefore I'm currently going through the code to understand it's whole 
> process(snapshot removal), I'll try my best to come up with something as soon 
> as possible. Thanks for the logs though. It will help me while writing the 
> patch.
>
> Regards,
> Dhairya
>
> On Fri, Aug 5, 2022 at 6:55 PM Frank Schilder 
> <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>> 
> wrote:
> Dear Gregory, Dan and Patrick,
>
> this is a reply to an older thread about num_stray growing without limits 
> (thread 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW,
>  message 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/FYEN2W4HGMC6CGOCS2BS4PQDRPGUSNOO/).
>  I'm opening a new thread for a better matching subject line.
>
> I now started testing octopus and am afraid I came across a very serious 
> issue with unlimited growth of stray buckets. I'm running a test that puts 
> constant load on a file system by adding a blob of data, creating a snapshot, 
> deleting a blob of data and deleting a snapshot in a cyclic process. A blob 
> of data contains about 330K hard links to make it more interesting.
>
> The benchmark crashed after half a day in rm with "no space left on device", 
> which was due to the stray buckets being too full (old thread). OK, so I 
> increased mds_bal_fragment_size_max and cleaned out all data to start fresh. 
> However, this happened:
>
> [root@rit-tceph ~]# df -h /mnt/adm/cephfs
> Filesystem                             Size  Used Avail Use% Mounted on
> 10.41.24.13,10.41.24.14,10.41.24.15:/  2.5T   35G  2.5T   2% /mnt/adm/cephfs
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/
> /mnt/adm/cephfs/
> /mnt/adm/cephfs/data
> /mnt/adm/cephfs/data/blobs
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/.snap
> /mnt/adm/cephfs/.snap
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/data/.snap
> /mnt/adm/cephfs/data/.snap
>
> [root@rit-tceph ~]# find /mnt/adm/cephfs/data/blobs/.snap
> /mnt/adm/cephfs/data/blobs/.snap
>
> All snapshots were taken in /mnt/adm/cephfs/.snap. Snaptrimming finished a 
> long time ago. Now look at this:
>
> [root@rit-tceph ~]# ssh "tceph-03" "ceph daemon mds.tceph-03 perf dump | jq 
> .mds_cache.num_strays"
> 962562
>
> Whaaaaat?
>
> There is data left over in the fs pools and the stray buckets are cloaked up.
>
> [root@rit-tceph ~]# ceph df
> --- RAW STORAGE ---
> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd    2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38
> TOTAL  2.4 TiB  2.4 TiB  1.4 GiB    35 GiB       1.38
>
> --- POOLS ---
> POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
> device_health_metrics   1    1  170 KiB        9  509 KiB      0    781 GiB
> fs-meta1                2   64  2.2 GiB  160.25k  6.5 GiB   0.28    781 GiB
> fs-meta2                3  128      0 B  802.40k      0 B      0    781 GiB
> fs-data                 4  128      0 B  802.40k      0 B      0    1.5 TiB
>
> There is either a very serious bug with cleaning up stray entries when their 
> last snapshot is deleted, or I'm missing something important here when 
> deleting data. Just for completeness:
>
> [root@rit-tceph ~]# ceph status
>   cluster:
>     id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
>     health: HEALTH_OK
>
>   services:
>     mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 10d)
>     mgr: tceph-01(active, since 10d), standbys: tceph-02, tceph-03
>     mds: fs:1 {0=tceph-03=up:active} 2 up:standby
>     osd: 9 osds: 9 up (since 4d), 9 in (since 4d)
>
>   data:
>     pools:   4 pools, 321 pgs
>     objects: 1.77M objects, 256 MiB
>     usage:   35 GiB used, 2.4 TiB / 2.4 TiB avail
>     pgs:     321 active+clean
>
> I would be most grateful for both, an explanation what happened here and a 
> way to get out of this. To me it looks very much like unlimited growth of 
> garbage that is never cleaned out.
>
> Many thanks and best regads,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Gregory Farnum 
> <gfar...@redhat.com<mailto:gfar...@redhat.com><mailto:gfar...@redhat.com<mailto:gfar...@redhat.com>>>
> Sent: 08 February 2022 18:22
> To: Dan van der Ster
> Cc: Frank Schilder; Patrick Donnelly; ceph-users
> Subject: Re: [ceph-users] Re: cephfs: [ERR] loaded dup inode
>
> On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster 
> <dvand...@gmail.com<mailto:dvand...@gmail.com><mailto:dvand...@gmail.com<mailto:dvand...@gmail.com>>>
>  wrote:
> >
> > On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder 
> > <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>>
> >  wrote:
> > > The reason for this seemingly strange behaviour was an old static 
> > > snapshot taken in an entirely different directory. Apparently, ceph fs 
> > > snapshots are not local to an FS directory sub-tree but always global on 
> > > the entire FS despite the fact that you can only access the sub-tree in 
> > > the snapshot, which easily leads to the wrong conclusion that only data 
> > > below the directory is in the snapshot. As a consequence, the static 
> > > snapshot was accumulating the garbage from the rotating snapshots even 
> > > though these sub-trees were completely disjoint.
> >
> > So are you saying that if I do this I'll have 1M files in stray?
>
> No, happily.
>
> The thing that's happening here post-dates my main previous stretch on
> CephFS and I had forgotten it, but there's a note in the developer
> docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links
> (I fortuitously stumbled across this from an entirely different
> direction/discussion just after seeing this thread and put the pieces
> together!)
>
> Basically, hard links are *the worst*. For everything in filesystems.
> I spent a lot of time trying to figure out how to handle hard links
> being renamed across snapshots[1] and never managed it, and the
> eventual "solution" was to give up and do the degenerate thing:
> If there's a file with multiple hard links, that file is a member of
> *every* snapshot.
>
> Doing anything about this will take a lot of time. There's probably an
> opportunity to improve it for users of the subvolumes library, as
> those subvolumes do get tagged a bit, so I'll see if we can look into
> that. But for generic CephFS, I'm not sure what the solution will look
> like at all.
>
> Sorry folks. :/
> -Greg
>
> [1]: The issue is that, if you have a hard linked file in two places,
> you would expect it to be snapshotted whenever a snapshot covering
> either location occurs. But in CephFS the file can only live in one
> location, and the other location has to just hold a reference to it
> instead. So say you have inode Y at path A, and then hard link it in
> at path B. Given how snapshots work, when you open up Y from A, you
> would need to check all the snapshots that apply from both A and B's
> trees. But 1) opening up other paths is a challenge all on its own,
> and 2) without an inode and its backtrace to provide a lookup resolve
> point, it's impossible to maintain a lookup that scales and is
> possible to keep consistent.
> (Oh, I did just have one idea, but I'm not sure if it would fix every
> issue or just that scalable backtrace lookup:
> https://tracker.ceph.com/issues/54205)
>
> >
> > mkdir /a
> > cd /a
> > for i in {1..1000000}; do touch $i; done  # create 1M files in /a
> > cd ..
> > mkdir /b
> > mkdir /b/.snap/testsnap  # create a snap in the empty dir /b
> > rm -rf /a/
> >
> >
> > Cheers, Dan
> > _______________________________________________
> > ceph-users mailing list -- 
> > ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>
> > To unsubscribe send an email to 
> > ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>>
> >
>
> _______________________________________________
> ceph-users mailing list -- 
> ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io<mailto:ceph-users@ceph.io>>
> To unsubscribe send an email to 
> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>>
>
>
>
> --
> Dhairya Parmar
>
> He/Him/His
>
> Associate Software Engineer, CephFS
>
> Red Hat Inc.<https://www.redhat.com/>
>
> dpar...@redhat.com<mailto:dpar...@redhat.com>
>
> [https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]<https://www.redhat.com/>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Venky

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs: num_stray growing without bounds (octopus)

Reply via email to