[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays

2022-03-05 Thread Arnaud M
Thanks to all

So then I will wait for the update and see if that helps to resolve the
issue

All the best

Arnaud

Le mar. 1 mars 2022 à 11:39, Dan van der Ster  a écrit :

> Hi,
>
> There was a recent (long) thread about this. It might give you some hints:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW/
>
> And about the crash, it could be related to
> https://tracker.ceph.com/issues/51824
>
> Cheers, dan
>
>
> On Tue, Mar 1, 2022 at 11:30 AM Arnaud M 
> wrote:
> >
> > Hello Dan
> >
> > Thanks a lot for the answer
> >
> > i do remove the the snap everydays (I keep them for one month)
> > But the "num_strays" never seems to reduce.
> >
> > I know I can do a listing of the folder with "find . -ls".
> >
> > So my question is: is there a way to find the directory causing the
> strays so I can "find . ls" them ? I would prefer not to do it on my whole
> cluster as it will take time (several days and more if i need to do it also
> on every snap) and will certainly overload the mds.
> >
> > Please let me know if there is a way to spot the source of strays ? So I
> can find the folder/snap with the biggest strays ?
> >
> > And what about the scrub of ~mdsdir who crashes every times with the
> error:
> >
> > {
> > "damage_type": "dir_frag",
> > "id": 3776355973,
> > "ino": 1099567262916,
> > "frag": "*",
> > "path": "~mds0/stray3/1000350ecc4"
> > },
> >
> > Again, thanks for your help, that is really appreciated
> >
> > All the best
> >
> > Arnaud
> >
> > Le mar. 1 mars 2022 à 11:02, Dan van der Ster  a
> écrit :
> >>
> >> Hi,
> >>
> >> stray files are created when you have hardlinks to deleted files, or
> >> snapshots of deleted files.
> >> You need to delete the snapshots, or "reintegrate" the hardlinks by
> >> recursively listing the relevant files.
> >>
> >> BTW, in pacific there isn't a big problem with accumulating lots of
> >> stray files. (Before pacific there was a default limit of 1M strays,
> >> but that is now removed).
> >>
> >> Cheers, dan
> >>
> >> On Tue, Mar 1, 2022 at 1:04 AM Arnaud M 
> wrote:
> >> >
> >> > Hello to everyone
> >> >
> >> > Our ceph cluster is healthy and everything seems to go well but we
> have a
> >> > lot of num_strays
> >> >
> >> > ceph tell mds.0 perf dump | grep stray
> >> > "num_strays": 1990574,
> >> > "num_strays_delayed": 0,
> >> > "num_strays_enqueuing": 0,
> >> > "strays_created": 3,
> >> > "strays_enqueued": 17,
> >> > "strays_reintegrated": 0,
> >> > "strays_migrated": 0,
> >> >
> >> > And num_strays doesn't seems to reduce whatever we do (scrub / or
> scrub
> >> > ~mdsdir)
> >> > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error
> >> >
> >> > {
> >> > "damage_type": "dir_frag",
> >> > "id": 3775653237,
> >> > "ino": 1099569233128,
> >> > "frag": "*",
> >> > "path": "~mds0/stray3/100036efce8"
> >> > },
> >> > {
> >> > "damage_type": "dir_frag",
> >> > "id": 3776355973,
> >> > "ino": 1099567262916,
> >> > "frag": "*",
> >> > "path": "~mds0/stray3/1000350ecc4"
> >> > },
> >> > {
> >> > "damage_type": "dir_frag",
> >> > "id": 3776485071,
> >> > "ino": 1099559071399,
> >> > "frag": "*",
> >> > "path": "~mds0/stray4/10002d3eea7"
> >> > },
> >> >
> >> > And just before the end of the ~mdsdir scrub the mds crashes and I
> have to
> >> > do a
> >> >
> >> > ceph mds repaired 0 to have the filesystem back online
> >> >
> >> > A lot of them. Do you have any ideas of what those errors are and how
> >> > should I handle them ?
> >> >
> >> > We have a lot of data in our cephfs cluster 350 TB+ and we takes
> snapshot
> >> > everyday of / and keep them for 1 month (rolling)
> >> >
> >> > here is our cluster state
> >> >
> >> > ceph -s
> >> >   cluster:
> >> > id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9
> >> > health: HEALTH_WARN
> >> > 78 pgs not deep-scrubbed in time
> >> > 70 pgs not scrubbed in time
> >> >
> >> >   services:
> >> > mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2
> (age 10d)
> >> > mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys:
> >> > ceph-g-112-1.ksojnh
> >> > mds: 1/1 daemons up, 1 standby
> >> > osd: 67 osds: 67 up (since 14m), 67 in (since 7d)
> >> >
> >> >   data:
> >> > volumes: 1/1 healthy
> >> > pools:   5 pools, 609 pgs
> >> > objects: 186.86M objects, 231 TiB
> >> > usage:   351 TiB used, 465 TiB / 816 TiB avail
> >> > pgs: 502 active+clean
> >> >  82  active+clean+snaptrim_wait
> >> >  20  active+clean+snaptrim
> >> >  4   active+clean+scrubbing+deep
> >> >  1   active+clean+scrubbing+deep+snaptrim_wait
> >> >
> >> >   io:
> >> > client:   8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr
> >> >
> 

[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays

2022-03-01 Thread Dan van der Ster
Hi,

There was a recent (long) thread about this. It might give you some hints:
   
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW/

And about the crash, it could be related to
https://tracker.ceph.com/issues/51824

Cheers, dan


On Tue, Mar 1, 2022 at 11:30 AM Arnaud M  wrote:
>
> Hello Dan
>
> Thanks a lot for the answer
>
> i do remove the the snap everydays (I keep them for one month)
> But the "num_strays" never seems to reduce.
>
> I know I can do a listing of the folder with "find . -ls".
>
> So my question is: is there a way to find the directory causing the strays so 
> I can "find . ls" them ? I would prefer not to do it on my whole cluster as 
> it will take time (several days and more if i need to do it also on every 
> snap) and will certainly overload the mds.
>
> Please let me know if there is a way to spot the source of strays ? So I can 
> find the folder/snap with the biggest strays ?
>
> And what about the scrub of ~mdsdir who crashes every times with the error:
>
> {
> "damage_type": "dir_frag",
> "id": 3776355973,
> "ino": 1099567262916,
> "frag": "*",
> "path": "~mds0/stray3/1000350ecc4"
> },
>
> Again, thanks for your help, that is really appreciated
>
> All the best
>
> Arnaud
>
> Le mar. 1 mars 2022 à 11:02, Dan van der Ster  a écrit :
>>
>> Hi,
>>
>> stray files are created when you have hardlinks to deleted files, or
>> snapshots of deleted files.
>> You need to delete the snapshots, or "reintegrate" the hardlinks by
>> recursively listing the relevant files.
>>
>> BTW, in pacific there isn't a big problem with accumulating lots of
>> stray files. (Before pacific there was a default limit of 1M strays,
>> but that is now removed).
>>
>> Cheers, dan
>>
>> On Tue, Mar 1, 2022 at 1:04 AM Arnaud M  wrote:
>> >
>> > Hello to everyone
>> >
>> > Our ceph cluster is healthy and everything seems to go well but we have a
>> > lot of num_strays
>> >
>> > ceph tell mds.0 perf dump | grep stray
>> > "num_strays": 1990574,
>> > "num_strays_delayed": 0,
>> > "num_strays_enqueuing": 0,
>> > "strays_created": 3,
>> > "strays_enqueued": 17,
>> > "strays_reintegrated": 0,
>> > "strays_migrated": 0,
>> >
>> > And num_strays doesn't seems to reduce whatever we do (scrub / or scrub
>> > ~mdsdir)
>> > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error
>> >
>> > {
>> > "damage_type": "dir_frag",
>> > "id": 3775653237,
>> > "ino": 1099569233128,
>> > "frag": "*",
>> > "path": "~mds0/stray3/100036efce8"
>> > },
>> > {
>> > "damage_type": "dir_frag",
>> > "id": 3776355973,
>> > "ino": 1099567262916,
>> > "frag": "*",
>> > "path": "~mds0/stray3/1000350ecc4"
>> > },
>> > {
>> > "damage_type": "dir_frag",
>> > "id": 3776485071,
>> > "ino": 1099559071399,
>> > "frag": "*",
>> > "path": "~mds0/stray4/10002d3eea7"
>> > },
>> >
>> > And just before the end of the ~mdsdir scrub the mds crashes and I have to
>> > do a
>> >
>> > ceph mds repaired 0 to have the filesystem back online
>> >
>> > A lot of them. Do you have any ideas of what those errors are and how
>> > should I handle them ?
>> >
>> > We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot
>> > everyday of / and keep them for 1 month (rolling)
>> >
>> > here is our cluster state
>> >
>> > ceph -s
>> >   cluster:
>> > id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9
>> > health: HEALTH_WARN
>> > 78 pgs not deep-scrubbed in time
>> > 70 pgs not scrubbed in time
>> >
>> >   services:
>> > mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d)
>> > mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys:
>> > ceph-g-112-1.ksojnh
>> > mds: 1/1 daemons up, 1 standby
>> > osd: 67 osds: 67 up (since 14m), 67 in (since 7d)
>> >
>> >   data:
>> > volumes: 1/1 healthy
>> > pools:   5 pools, 609 pgs
>> > objects: 186.86M objects, 231 TiB
>> > usage:   351 TiB used, 465 TiB / 816 TiB avail
>> > pgs: 502 active+clean
>> >  82  active+clean+snaptrim_wait
>> >  20  active+clean+snaptrim
>> >  4   active+clean+scrubbing+deep
>> >  1   active+clean+scrubbing+deep+snaptrim_wait
>> >
>> >   io:
>> > client:   8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr
>> >
>> > My questions are about the damage found on the ~mdsdir scrub, should I
>> > worry about it ? What does it mean ? It seems to be linked with my issue of
>> > the high number of strays, is it right ? How to fix it and how to reduce
>> > num_stray ?
>> >
>> > Thank for all
>> >
>> > All the best
>> >
>> > Arnaud
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to 

[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays

2022-03-01 Thread Arnaud M
Hello Dan

Thanks a lot for the answer

i do remove the the snap everydays (I keep them for one month)
But the "num_strays" never seems to reduce.

I know I can do a listing of the folder with "find . -ls".

So my question is: is there a way to find the directory causing the strays
so I can "find . ls" them ? I would prefer not to do it on my whole cluster
as it will take time (several days and more if i need to do it also on
every snap) and will certainly overload the mds.

Please let me know if there is a way to spot the source of strays ? So I
can find the folder/snap with the biggest strays ?

And what about the scrub of ~mdsdir who crashes every times with the error:

{
"damage_type": "dir_frag",
"id": 3776355973,
"ino": 1099567262916,
"frag": "*",
"path": "~mds0/stray3/1000350ecc4"
},

Again, thanks for your help, that is really appreciated

All the best

Arnaud

Le mar. 1 mars 2022 à 11:02, Dan van der Ster  a écrit :

> Hi,
>
> stray files are created when you have hardlinks to deleted files, or
> snapshots of deleted files.
> You need to delete the snapshots, or "reintegrate" the hardlinks by
> recursively listing the relevant files.
>
> BTW, in pacific there isn't a big problem with accumulating lots of
> stray files. (Before pacific there was a default limit of 1M strays,
> but that is now removed).
>
> Cheers, dan
>
> On Tue, Mar 1, 2022 at 1:04 AM Arnaud M 
> wrote:
> >
> > Hello to everyone
> >
> > Our ceph cluster is healthy and everything seems to go well but we have a
> > lot of num_strays
> >
> > ceph tell mds.0 perf dump | grep stray
> > "num_strays": 1990574,
> > "num_strays_delayed": 0,
> > "num_strays_enqueuing": 0,
> > "strays_created": 3,
> > "strays_enqueued": 17,
> > "strays_reintegrated": 0,
> > "strays_migrated": 0,
> >
> > And num_strays doesn't seems to reduce whatever we do (scrub / or scrub
> > ~mdsdir)
> > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error
> >
> > {
> > "damage_type": "dir_frag",
> > "id": 3775653237,
> > "ino": 1099569233128,
> > "frag": "*",
> > "path": "~mds0/stray3/100036efce8"
> > },
> > {
> > "damage_type": "dir_frag",
> > "id": 3776355973,
> > "ino": 1099567262916,
> > "frag": "*",
> > "path": "~mds0/stray3/1000350ecc4"
> > },
> > {
> > "damage_type": "dir_frag",
> > "id": 3776485071,
> > "ino": 1099559071399,
> > "frag": "*",
> > "path": "~mds0/stray4/10002d3eea7"
> > },
> >
> > And just before the end of the ~mdsdir scrub the mds crashes and I have
> to
> > do a
> >
> > ceph mds repaired 0 to have the filesystem back online
> >
> > A lot of them. Do you have any ideas of what those errors are and how
> > should I handle them ?
> >
> > We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot
> > everyday of / and keep them for 1 month (rolling)
> >
> > here is our cluster state
> >
> > ceph -s
> >   cluster:
> > id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9
> > health: HEALTH_WARN
> > 78 pgs not deep-scrubbed in time
> > 70 pgs not scrubbed in time
> >
> >   services:
> > mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age
> 10d)
> > mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys:
> > ceph-g-112-1.ksojnh
> > mds: 1/1 daemons up, 1 standby
> > osd: 67 osds: 67 up (since 14m), 67 in (since 7d)
> >
> >   data:
> > volumes: 1/1 healthy
> > pools:   5 pools, 609 pgs
> > objects: 186.86M objects, 231 TiB
> > usage:   351 TiB used, 465 TiB / 816 TiB avail
> > pgs: 502 active+clean
> >  82  active+clean+snaptrim_wait
> >  20  active+clean+snaptrim
> >  4   active+clean+scrubbing+deep
> >  1   active+clean+scrubbing+deep+snaptrim_wait
> >
> >   io:
> > client:   8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr
> >
> > My questions are about the damage found on the ~mdsdir scrub, should I
> > worry about it ? What does it mean ? It seems to be linked with my issue
> of
> > the high number of strays, is it right ? How to fix it and how to reduce
> > num_stray ?
> >
> > Thank for all
> >
> > All the best
> >
> > Arnaud
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays

2022-03-01 Thread Arnaud M
I am using ceph pacific (16.2.5)

Does anyone have an idea about my issues ?

Thanks again to everyone

All the best

Arnaud

Le mar. 1 mars 2022 à 01:04, Arnaud M  a écrit :

> Hello to everyone
>
> Our ceph cluster is healthy and everything seems to go well but we have a
> lot of num_strays
>
> ceph tell mds.0 perf dump | grep stray
> "num_strays": 1990574,
> "num_strays_delayed": 0,
> "num_strays_enqueuing": 0,
> "strays_created": 3,
> "strays_enqueued": 17,
> "strays_reintegrated": 0,
> "strays_migrated": 0,
>
> And num_strays doesn't seems to reduce whatever we do (scrub / or scrub
> ~mdsdir)
> And when we scrub ~mdsdir (force,recursive,repair) we get thoses error
>
> {
> "damage_type": "dir_frag",
> "id": 3775653237,
> "ino": 1099569233128,
> "frag": "*",
> "path": "~mds0/stray3/100036efce8"
> },
> {
> "damage_type": "dir_frag",
> "id": 3776355973,
> "ino": 1099567262916,
> "frag": "*",
> "path": "~mds0/stray3/1000350ecc4"
> },
> {
> "damage_type": "dir_frag",
> "id": 3776485071,
> "ino": 1099559071399,
> "frag": "*",
> "path": "~mds0/stray4/10002d3eea7"
> },
>
> And just before the end of the ~mdsdir scrub the mds crashes and I have to
> do a
>
> ceph mds repaired 0 to have the filesystem back online
>
> A lot of them. Do you have any ideas of what those errors are and how
> should I handle them ?
>
> We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot
> everyday of / and keep them for 1 month (rolling)
>
> here is our cluster state
>
> ceph -s
>   cluster:
> id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9
> health: HEALTH_WARN
> 78 pgs not deep-scrubbed in time
> 70 pgs not scrubbed in time
>
>   services:
> mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d)
> mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys:
> ceph-g-112-1.ksojnh
> mds: 1/1 daemons up, 1 standby
> osd: 67 osds: 67 up (since 14m), 67 in (since 7d)
>
>   data:
> volumes: 1/1 healthy
> pools:   5 pools, 609 pgs
> objects: 186.86M objects, 231 TiB
> usage:   351 TiB used, 465 TiB / 816 TiB avail
> pgs: 502 active+clean
>  82  active+clean+snaptrim_wait
>  20  active+clean+snaptrim
>  4   active+clean+scrubbing+deep
>  1   active+clean+scrubbing+deep+snaptrim_wait
>
>   io:
> client:   8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr
>
> My questions are about the damage found on the ~mdsdir scrub, should I
> worry about it ? What does it mean ? It seems to be linked with my issue of
> the high number of strays, is it right ? How to fix it and how to reduce
> num_stray ?
>
> Thank for all
>
> All the best
>
> Arnaud
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io