Hello Roman, I am Not sure if i could be a help but perhaps this Commands can help to find the objects in question...
Ceph Heath Detail rados list-inconsistent-pg rbd rados list-inconsistent-obj 2.10d I guess it is also interresting to know you use bluestore or filestore... Hth - Mehmet Am 4. Oktober 2018 14:06:07 MESZ schrieb Roman Steinhart <ro...@aternos.org>: >Hi all, > >since some weeks we have a small problem with one of the PG's on our >ceph cluster. >Every time the pg 2.10d is deep scrubbing it fails because of this: >2018-08-06 19:36:28.080707 osd.14 osd.14 *.*.*.110:6809/3935 133 : >cluster [ERR] 2.10d scrub stat mismatch, got 397/398 objects, 0/0 >clones, 397/398 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 >whiteouts, 2609281919/2609293215 bytes, 0/0 hit_set_archive bytes. >2018-08-06 19:36:28.080905 osd.14 osd.14 *.*.*.110:6809/3935 134 : >cluster [ERR] 2.10d scrub 1 errors >As far as I understand ceph is missing an object on that osd.14 which >should be stored on this osd. A small ceph pg repair 2.10d fixes the >problem but as soon as a deep scrubbing job for that pg is running >again(manual or automatically) the problem is back again. >I tried to find out which object is missing, but a small search leads >me to the result that there is no real way to find out which objects >are stored in this PG or which object exactly is missing. >That's why I've gone for some "unconventional" methods. >I completely removed OSD.14 from the cluster. I waited until everything >was balanced and then added the OSD again. >Unfortunately the problem is still there. > >Some weeks later we've added a huge amount of OSD's to our cluster >which had a big impact on the crush map. >Since then the PG 2.10d was running on two other OSD's -> [119,93] (We >have a replica of 2) >Still the same error message, but another OSD: >2018-10-03 03:39:22.776521 7f12d9979700 -1 log_channel(cluster) log >[ERR] : 2.10d scrub stat mismatch, got 728/729 objects, 0/0 clones, >728/729 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 >whiteouts, 7281369687/7281381269 bytes, 0/0 hit_set_archive bytes. > >As a first step it would be enough for me to find out which the >problematic object is. Then I am able to check if the object is >critical, if any recovery is required or if I am able to just drop that >object(That would be 90% of the case) >I hope anyone is able to help me to get rid of this. >It's not really a problem for us. Ceph runs despite this message >without further problems. >It's just a bit annoying that every time the error occurs our >monitoring triggers a big alarm because Ceph is in ERROR status. :) > >Thanks in advance, >Roman
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com