[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays
Thanks to all So then I will wait for the update and see if that helps to resolve the issue All the best Arnaud Le mar. 1 mars 2022 à 11:39, Dan van der Ster a écrit : > Hi, > > There was a recent (long) thread about this. It might give you some hints: > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW/ > > And about the crash, it could be related to > https://tracker.ceph.com/issues/51824 > > Cheers, dan > > > On Tue, Mar 1, 2022 at 11:30 AM Arnaud M > wrote: > > > > Hello Dan > > > > Thanks a lot for the answer > > > > i do remove the the snap everydays (I keep them for one month) > > But the "num_strays" never seems to reduce. > > > > I know I can do a listing of the folder with "find . -ls". > > > > So my question is: is there a way to find the directory causing the > strays so I can "find . ls" them ? I would prefer not to do it on my whole > cluster as it will take time (several days and more if i need to do it also > on every snap) and will certainly overload the mds. > > > > Please let me know if there is a way to spot the source of strays ? So I > can find the folder/snap with the biggest strays ? > > > > And what about the scrub of ~mdsdir who crashes every times with the > error: > > > > { > > "damage_type": "dir_frag", > > "id": 3776355973, > > "ino": 1099567262916, > > "frag": "*", > > "path": "~mds0/stray3/1000350ecc4" > > }, > > > > Again, thanks for your help, that is really appreciated > > > > All the best > > > > Arnaud > > > > Le mar. 1 mars 2022 à 11:02, Dan van der Ster a > écrit : > >> > >> Hi, > >> > >> stray files are created when you have hardlinks to deleted files, or > >> snapshots of deleted files. > >> You need to delete the snapshots, or "reintegrate" the hardlinks by > >> recursively listing the relevant files. > >> > >> BTW, in pacific there isn't a big problem with accumulating lots of > >> stray files. (Before pacific there was a default limit of 1M strays, > >> but that is now removed). > >> > >> Cheers, dan > >> > >> On Tue, Mar 1, 2022 at 1:04 AM Arnaud M > wrote: > >> > > >> > Hello to everyone > >> > > >> > Our ceph cluster is healthy and everything seems to go well but we > have a > >> > lot of num_strays > >> > > >> > ceph tell mds.0 perf dump | grep stray > >> > "num_strays": 1990574, > >> > "num_strays_delayed": 0, > >> > "num_strays_enqueuing": 0, > >> > "strays_created": 3, > >> > "strays_enqueued": 17, > >> > "strays_reintegrated": 0, > >> > "strays_migrated": 0, > >> > > >> > And num_strays doesn't seems to reduce whatever we do (scrub / or > scrub > >> > ~mdsdir) > >> > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error > >> > > >> > { > >> > "damage_type": "dir_frag", > >> > "id": 3775653237, > >> > "ino": 1099569233128, > >> > "frag": "*", > >> > "path": "~mds0/stray3/100036efce8" > >> > }, > >> > { > >> > "damage_type": "dir_frag", > >> > "id": 3776355973, > >> > "ino": 1099567262916, > >> > "frag": "*", > >> > "path": "~mds0/stray3/1000350ecc4" > >> > }, > >> > { > >> > "damage_type": "dir_frag", > >> > "id": 3776485071, > >> > "ino": 1099559071399, > >> > "frag": "*", > >> > "path": "~mds0/stray4/10002d3eea7" > >> > }, > >> > > >> > And just before the end of the ~mdsdir scrub the mds crashes and I > have to > >> > do a > >> > > >> > ceph mds repaired 0 to have the filesystem back online > >> > > >> > A lot of them. Do you have any ideas of what those errors are and how > >> > should I handle them ? > >> > > >> > We have a lot of data in our cephfs cluster 350 TB+ and we takes > snapshot > >> > everyday of / and keep them for 1 month (rolling) > >> > > >> > here is our cluster state > >> > > >> > ceph -s > >> > cluster: > >> > id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9 > >> > health: HEALTH_WARN > >> > 78 pgs not deep-scrubbed in time > >> > 70 pgs not scrubbed in time > >> > > >> > services: > >> > mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 > (age 10d) > >> > mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys: > >> > ceph-g-112-1.ksojnh > >> > mds: 1/1 daemons up, 1 standby > >> > osd: 67 osds: 67 up (since 14m), 67 in (since 7d) > >> > > >> > data: > >> > volumes: 1/1 healthy > >> > pools: 5 pools, 609 pgs > >> > objects: 186.86M objects, 231 TiB > >> > usage: 351 TiB used, 465 TiB / 816 TiB avail > >> > pgs: 502 active+clean > >> > 82 active+clean+snaptrim_wait > >> > 20 active+clean+snaptrim > >> > 4 active+clean+scrubbing+deep > >> > 1 active+clean+scrubbing+deep+snaptrim_wait > >> > > >> > io: > >> > client: 8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr > >> > >
[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays
Hi, There was a recent (long) thread about this. It might give you some hints: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW/ And about the crash, it could be related to https://tracker.ceph.com/issues/51824 Cheers, dan On Tue, Mar 1, 2022 at 11:30 AM Arnaud M wrote: > > Hello Dan > > Thanks a lot for the answer > > i do remove the the snap everydays (I keep them for one month) > But the "num_strays" never seems to reduce. > > I know I can do a listing of the folder with "find . -ls". > > So my question is: is there a way to find the directory causing the strays so > I can "find . ls" them ? I would prefer not to do it on my whole cluster as > it will take time (several days and more if i need to do it also on every > snap) and will certainly overload the mds. > > Please let me know if there is a way to spot the source of strays ? So I can > find the folder/snap with the biggest strays ? > > And what about the scrub of ~mdsdir who crashes every times with the error: > > { > "damage_type": "dir_frag", > "id": 3776355973, > "ino": 1099567262916, > "frag": "*", > "path": "~mds0/stray3/1000350ecc4" > }, > > Again, thanks for your help, that is really appreciated > > All the best > > Arnaud > > Le mar. 1 mars 2022 à 11:02, Dan van der Ster a écrit : >> >> Hi, >> >> stray files are created when you have hardlinks to deleted files, or >> snapshots of deleted files. >> You need to delete the snapshots, or "reintegrate" the hardlinks by >> recursively listing the relevant files. >> >> BTW, in pacific there isn't a big problem with accumulating lots of >> stray files. (Before pacific there was a default limit of 1M strays, >> but that is now removed). >> >> Cheers, dan >> >> On Tue, Mar 1, 2022 at 1:04 AM Arnaud M wrote: >> > >> > Hello to everyone >> > >> > Our ceph cluster is healthy and everything seems to go well but we have a >> > lot of num_strays >> > >> > ceph tell mds.0 perf dump | grep stray >> > "num_strays": 1990574, >> > "num_strays_delayed": 0, >> > "num_strays_enqueuing": 0, >> > "strays_created": 3, >> > "strays_enqueued": 17, >> > "strays_reintegrated": 0, >> > "strays_migrated": 0, >> > >> > And num_strays doesn't seems to reduce whatever we do (scrub / or scrub >> > ~mdsdir) >> > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error >> > >> > { >> > "damage_type": "dir_frag", >> > "id": 3775653237, >> > "ino": 1099569233128, >> > "frag": "*", >> > "path": "~mds0/stray3/100036efce8" >> > }, >> > { >> > "damage_type": "dir_frag", >> > "id": 3776355973, >> > "ino": 1099567262916, >> > "frag": "*", >> > "path": "~mds0/stray3/1000350ecc4" >> > }, >> > { >> > "damage_type": "dir_frag", >> > "id": 3776485071, >> > "ino": 1099559071399, >> > "frag": "*", >> > "path": "~mds0/stray4/10002d3eea7" >> > }, >> > >> > And just before the end of the ~mdsdir scrub the mds crashes and I have to >> > do a >> > >> > ceph mds repaired 0 to have the filesystem back online >> > >> > A lot of them. Do you have any ideas of what those errors are and how >> > should I handle them ? >> > >> > We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot >> > everyday of / and keep them for 1 month (rolling) >> > >> > here is our cluster state >> > >> > ceph -s >> > cluster: >> > id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9 >> > health: HEALTH_WARN >> > 78 pgs not deep-scrubbed in time >> > 70 pgs not scrubbed in time >> > >> > services: >> > mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d) >> > mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys: >> > ceph-g-112-1.ksojnh >> > mds: 1/1 daemons up, 1 standby >> > osd: 67 osds: 67 up (since 14m), 67 in (since 7d) >> > >> > data: >> > volumes: 1/1 healthy >> > pools: 5 pools, 609 pgs >> > objects: 186.86M objects, 231 TiB >> > usage: 351 TiB used, 465 TiB / 816 TiB avail >> > pgs: 502 active+clean >> > 82 active+clean+snaptrim_wait >> > 20 active+clean+snaptrim >> > 4 active+clean+scrubbing+deep >> > 1 active+clean+scrubbing+deep+snaptrim_wait >> > >> > io: >> > client: 8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr >> > >> > My questions are about the damage found on the ~mdsdir scrub, should I >> > worry about it ? What does it mean ? It seems to be linked with my issue of >> > the high number of strays, is it right ? How to fix it and how to reduce >> > num_stray ? >> > >> > Thank for all >> > >> > All the best >> > >> > Arnaud >> > ___ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to
[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays
Hello Dan Thanks a lot for the answer i do remove the the snap everydays (I keep them for one month) But the "num_strays" never seems to reduce. I know I can do a listing of the folder with "find . -ls". So my question is: is there a way to find the directory causing the strays so I can "find . ls" them ? I would prefer not to do it on my whole cluster as it will take time (several days and more if i need to do it also on every snap) and will certainly overload the mds. Please let me know if there is a way to spot the source of strays ? So I can find the folder/snap with the biggest strays ? And what about the scrub of ~mdsdir who crashes every times with the error: { "damage_type": "dir_frag", "id": 3776355973, "ino": 1099567262916, "frag": "*", "path": "~mds0/stray3/1000350ecc4" }, Again, thanks for your help, that is really appreciated All the best Arnaud Le mar. 1 mars 2022 à 11:02, Dan van der Ster a écrit : > Hi, > > stray files are created when you have hardlinks to deleted files, or > snapshots of deleted files. > You need to delete the snapshots, or "reintegrate" the hardlinks by > recursively listing the relevant files. > > BTW, in pacific there isn't a big problem with accumulating lots of > stray files. (Before pacific there was a default limit of 1M strays, > but that is now removed). > > Cheers, dan > > On Tue, Mar 1, 2022 at 1:04 AM Arnaud M > wrote: > > > > Hello to everyone > > > > Our ceph cluster is healthy and everything seems to go well but we have a > > lot of num_strays > > > > ceph tell mds.0 perf dump | grep stray > > "num_strays": 1990574, > > "num_strays_delayed": 0, > > "num_strays_enqueuing": 0, > > "strays_created": 3, > > "strays_enqueued": 17, > > "strays_reintegrated": 0, > > "strays_migrated": 0, > > > > And num_strays doesn't seems to reduce whatever we do (scrub / or scrub > > ~mdsdir) > > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error > > > > { > > "damage_type": "dir_frag", > > "id": 3775653237, > > "ino": 1099569233128, > > "frag": "*", > > "path": "~mds0/stray3/100036efce8" > > }, > > { > > "damage_type": "dir_frag", > > "id": 3776355973, > > "ino": 1099567262916, > > "frag": "*", > > "path": "~mds0/stray3/1000350ecc4" > > }, > > { > > "damage_type": "dir_frag", > > "id": 3776485071, > > "ino": 1099559071399, > > "frag": "*", > > "path": "~mds0/stray4/10002d3eea7" > > }, > > > > And just before the end of the ~mdsdir scrub the mds crashes and I have > to > > do a > > > > ceph mds repaired 0 to have the filesystem back online > > > > A lot of them. Do you have any ideas of what those errors are and how > > should I handle them ? > > > > We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot > > everyday of / and keep them for 1 month (rolling) > > > > here is our cluster state > > > > ceph -s > > cluster: > > id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9 > > health: HEALTH_WARN > > 78 pgs not deep-scrubbed in time > > 70 pgs not scrubbed in time > > > > services: > > mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age > 10d) > > mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys: > > ceph-g-112-1.ksojnh > > mds: 1/1 daemons up, 1 standby > > osd: 67 osds: 67 up (since 14m), 67 in (since 7d) > > > > data: > > volumes: 1/1 healthy > > pools: 5 pools, 609 pgs > > objects: 186.86M objects, 231 TiB > > usage: 351 TiB used, 465 TiB / 816 TiB avail > > pgs: 502 active+clean > > 82 active+clean+snaptrim_wait > > 20 active+clean+snaptrim > > 4 active+clean+scrubbing+deep > > 1 active+clean+scrubbing+deep+snaptrim_wait > > > > io: > > client: 8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr > > > > My questions are about the damage found on the ~mdsdir scrub, should I > > worry about it ? What does it mean ? It seems to be linked with my issue > of > > the high number of strays, is it right ? How to fix it and how to reduce > > num_stray ? > > > > Thank for all > > > > All the best > > > > Arnaud > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays
I am using ceph pacific (16.2.5) Does anyone have an idea about my issues ? Thanks again to everyone All the best Arnaud Le mar. 1 mars 2022 à 01:04, Arnaud M a écrit : > Hello to everyone > > Our ceph cluster is healthy and everything seems to go well but we have a > lot of num_strays > > ceph tell mds.0 perf dump | grep stray > "num_strays": 1990574, > "num_strays_delayed": 0, > "num_strays_enqueuing": 0, > "strays_created": 3, > "strays_enqueued": 17, > "strays_reintegrated": 0, > "strays_migrated": 0, > > And num_strays doesn't seems to reduce whatever we do (scrub / or scrub > ~mdsdir) > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error > > { > "damage_type": "dir_frag", > "id": 3775653237, > "ino": 1099569233128, > "frag": "*", > "path": "~mds0/stray3/100036efce8" > }, > { > "damage_type": "dir_frag", > "id": 3776355973, > "ino": 1099567262916, > "frag": "*", > "path": "~mds0/stray3/1000350ecc4" > }, > { > "damage_type": "dir_frag", > "id": 3776485071, > "ino": 1099559071399, > "frag": "*", > "path": "~mds0/stray4/10002d3eea7" > }, > > And just before the end of the ~mdsdir scrub the mds crashes and I have to > do a > > ceph mds repaired 0 to have the filesystem back online > > A lot of them. Do you have any ideas of what those errors are and how > should I handle them ? > > We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot > everyday of / and keep them for 1 month (rolling) > > here is our cluster state > > ceph -s > cluster: > id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9 > health: HEALTH_WARN > 78 pgs not deep-scrubbed in time > 70 pgs not scrubbed in time > > services: > mon: 3 daemons, quorum ceph-r-112-1,ceph-g-112-3,ceph-g-112-2 (age 10d) > mgr: ceph-g-112-2.ghcodb(active, since 4d), standbys: > ceph-g-112-1.ksojnh > mds: 1/1 daemons up, 1 standby > osd: 67 osds: 67 up (since 14m), 67 in (since 7d) > > data: > volumes: 1/1 healthy > pools: 5 pools, 609 pgs > objects: 186.86M objects, 231 TiB > usage: 351 TiB used, 465 TiB / 816 TiB avail > pgs: 502 active+clean > 82 active+clean+snaptrim_wait > 20 active+clean+snaptrim > 4 active+clean+scrubbing+deep > 1 active+clean+scrubbing+deep+snaptrim_wait > > io: > client: 8.8 MiB/s rd, 39 MiB/s wr, 25 op/s rd, 54 op/s wr > > My questions are about the damage found on the ~mdsdir scrub, should I > worry about it ? What does it mean ? It seems to be linked with my issue of > the high number of strays, is it right ? How to fix it and how to reduce > num_stray ? > > Thank for all > > All the best > > Arnaud > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io