[ceph-users] Re: Pgs troubleshooting

Eugen Block Thu, 31 Jul 2025 11:43:06 -0700

To use the objectstore tool within the container you don’t have tospecify the cluster’s FSID because it’s mapped into the container. Byusing the objectstore tool you might have changed the ownership of thedirectory, change it back to the previous state. Other OSDs will showyou which uid/user and/or gid/group that is.


Zitat von "GLE, Vivien" <vivien....@inist.fr>:

I'm sorry for the confusion !

I paste the wrong output.

ceph-objectstore-tool --data-path /var/lib/ceph/Id/osd.1 --op list--pgid 11.4 --no-mon-config


OSD.1 log

2025-07-31T12:06:56.273+0000 7a9c2bf47680 0 set uid:gid to 167:167(ceph:ceph)2025-07-31T12:06:56.273+0000 7a9c2bf47680 0 ceph version 19.2.2(0eceb0defba60152a8182f7bd87d164b639885b8) squid (stable), processceph-osd, pid 72025-07-31T12:06:56.273+0000 7a9c2bf47680 0 pidfile_write: ignoreempty --pid-file2025-07-31T12:06:56.274+0000 7a9c2bf47680 1 bdev(0x57bd64210e00/var/lib/ceph/osd/ceph-1/block) open path/var/lib/ceph/osd/ceph-1/block2025-07-31T12:06:56.274+0000 7a9c2bf47680 -1 bdev(0x57bd64210e00/var/lib/ceph/osd/ceph-1/block) open open got: (13) Permission denied2025-07-31T12:06:56.274+0000 7a9c2bf47680 -1 ** ERROR: unable toopen OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file ordirectory


----------------------

I retried on OSD.2 with PG 2.1 to see if I disabled instead of juststopped the OSD.2 before objectstore-tool operation will changesomething but same error occurred




________________________________
De : Eugen Block <ebl...@nde.ag>
Envoyé : jeudi 31 juillet 2025 13:27:51
À : GLE, Vivien
Cc : ceph-users@ceph.io
Objet : Re: [ceph-users] Re: Pgs troubleshooting

Why did you look at OSD.2? According to the query output you provided
I would have looked at OSD.1 (acting set). And you pasted the output
of PG 11.4, now you’re trying to list PG 2.1, that is quite confusing.


Zitat von "GLE, Vivien" <vivien....@inist.fr>:

I dont get why is he searching in this path because there is nothing
and this is the command I used to check bluestore


ceph-objectstore-tool --data-path /var/lib/ceph/"ID"/osd.2 --op list
--pgid 2.1 --no-mon-config

________________________________
De : GLE, Vivien
Envoyé : jeudi 31 juillet 2025 09:38:25
À : Eugen Block
Cc : ceph-users@ceph.io
Objet : RE: [ceph-users] Re: Pgs troubleshooting


Hi,

Or could reducing min_size to 1 help here (Thanks, Anthony)? I’m not
entirely sure and am on vacation. 😅 it could be worth a try. But don’t
forget to reset min_size back to 2 afterwards.



Did it but nothing really changed, how many time should I wait to
see if it does something ?

No, you use the ceph-objectstore-tool to export the PG from the intact
OSD (you need to stop it though, set noout flag), make sure you have
enough disk space.



I stopped my OSD and noout to check if my PG is stored in bluestore
(he is not) but when I tried to restart my OSD, OSD superblock was
gone


2025-07-31T08:33:14.696+0000 7f0c7c889680  1 bdev(0x60945520ae00
/var/lib/ceph/osd/ceph-2/block) open path
/var/lib/ceph/osd/ceph-2/block
2025-07-31T08:33:14.697+0000 7f0c7c889680 -1 bdev(0x60945520ae00
/var/lib/ceph/osd/ceph-2/block) open open got: (13) Permission denied
2025-07-31T08:33:14.697+0000 7f0c7c889680 -1  ** ERROR: unable to
open OSD superblock on /var/lib/ceph/osd/ceph-2: (2) No such file or
directory

Did I miss something?

Thanks
Vivien




________________________________
De : Eugen Block <ebl...@nde.ag>
Envoyé : mercredi 30 juillet 2025 16:56:50
À : GLE, Vivien
Cc : ceph-users@ceph.io
Objet : [ceph-users] Re: Pgs troubleshooting

Or could reducing min_size to 1 help here (Thanks, Anthony)? I’m not
entirely sure and am on vacation. 😅 it could be worth a try. But don’t
forget to reset min_size back to 2 afterwards.

Zitat von "GLE, Vivien" <vivien....@inist.fr>:

Hi,

did the two replaced OSDs fail at the sime time (before they were
completely drained)? This would most likely mean that both those
failed OSDs contained the other two replicas of this PG



Unfortunately yes

This would most likely mean that both those
failed OSDs contained the other two replicas of this PG. A pg query
should show which OSDs are missing.



If I understand well I need to move my PG on the OSD 1 ?


ceph -w


 osd.1 [ERR] 11.4 has 2 objects unfound and apparently lost


ceph pg query 11.4



     "up": [
                    1,
                    4,
                    5
                ],
                "acting": [
                    1,
                    4,
                    5
                ],
                "avail_no_missing": [],
                "object_location_counts": [
                    {
                        "shards": "3,4,5",
                        "objects": 2
                    }
                ],
                "blocked_by": [],
                "up_primary": 1,
                "acting_primary": 1,
                "purged_snaps": []
            },



Thanks


Vivien

________________________________
De : Eugen Block <ebl...@nde.ag>
Envoyé : mardi 29 juillet 2025 16:48:41
À : ceph-users@ceph.io
Objet : [ceph-users] Re: Pgs troubleshooting

Hi,

did the two replaced OSDs fail at the sime time (before they were
completely drained)? This would most likely mean that both those
failed OSDs contained the other two replicas of this PG. A pg query
should show which OSDs are missing.
You could try with objectstore-tool to export the PG from the
remaining OSD and import it on different OSDs. Or you mark the data as
lost if you don't care about the data and want a healthy state quickly.

Regards,
Eugen

Zitat von "GLE, Vivien" <vivien....@inist.fr>:

Thanks for your help ! This is my new pg stat with no more peering
pgs (after rebooting some OSD)

ceph pg stat ->

498 pgs: 1 active+recovery_unfound+degraded, 3
recovery_unfound+undersized+degraded+remapped+peered, 14
active+clean+scrubbing+deep, 480 active+clean;

36 GiB data, 169 GiB used, 6.2 TiB / 6.4 TiB avail; 8.8 KiB/s rd, 0
B/s wr, 12 op/s; 715/41838 objects degraded (1.709%); 5/13946
objects unfound (0.036%)

ceph pg ls recovery_unfound -> shows that PG are replica 3, tried to
repair but nothing happened


ceph -w ->

osd.1 [ERR] 11.4 has 2 objects unfound and apparently lost



________________________________
De : Frédéric Nass <frederic.n...@clyso.com>
Envoyé : mardi 29 juillet 2025 14:03:37
À : GLE, Vivien
Cc : ceph-users@ceph.io
Objet : Re: [ceph-users] Pgs troubleshooting

Hi Vivien,

Unless you ran 'ceph pg stat' command when peering was occuring, the
37 peering PGs might indicate a temporary peering issue with one or
more OSDs. If that's the case then restarting associated OSDs could
help with the peering or ceph pg. You could list those PGs and
associated OSDs with 'ceph pg ls peering' and trigger peering by
either restarting one common OSD or by using 'ceph pg repeer <pg_id>'.

Regarding the unfound object and its associated backfill_unfound PG,
you could identify this PG with 'ceph pg ls backfill_unfound' and
investigate this PG with 'ceph pg <pg_id> query'. Depending on the
output, you could try running a 'ceph pg repair <pg_id>'. Could you
confirm that this PG is not part of a size=2 pool?

Best regards,
Frédéric.

--
Frédéric Nass
Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/

https://clyso.com |frederic.n...@clyso.com<mailto:frederic.n...@clyso.com>



Le mar. 29 juil. 2025 à 14:19, GLE, Vivien
<vivien....@inist.fr<mailto:vivien....@inist.fr>> a écrit :
Hi,

After replacing 2 OSD (data corruption), this is the stats of my
testing ceph cluster

ceph pg stat

498 pgs: 37 peering, 1 active+remapped+backfilling, 1
active+clean+remapped, 1 active+recovery_wait+undersized+remapped, 1
backfill_unfound+undersized+degraded+remapped+peered, 1
remapped+peering, 12 active+clean+scrubbing+deep, 1
active+undersized, 442 active+clean, 1
active+recovering+undersized+remapped

34 GiB data, 175 GiB used, 6.2 TiB / 6.4 TiB avail; 1.7 KiB/s rd, 1
op/s; 31/39768 objects degraded (0.078%); 6/39768 objects misplaced
(0.015%); 1/13256 objects unfound (0.008%)

ceph osd stat
7 osds: 7 up (since 20h), 7 in (since 20h); epoch: e427538; 4 remapped pgs

Anyone had an idea of where to start to get a healthy cluster ?

Thanks !

Vivien


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Pgs troubleshooting

Reply via email to