FWIW, I saw a similar problem on a cluster ~1 ago and noticed that the PG affected with "stat mismatch" was the very last PG of the pool (4.1fff in my case, with pg_num = 8192). I recall thinking that it looked more like a bug than a hardware issue and, assuming your pool has 1024 PGs, you may be hitting the same issue.
It happened 2 or 3 times and then went away, possibly thanks to software updates (currently on 14.2.21). Eric > On 11 Oct 2021, at 18:44, Simon Ironside <sirons...@caffetine.org> wrote: > > Bump for any pointers here? > > tl;dr - I've got a single PG that keeps going inconsistent (stat mismatch). > It always repairs ok but comes back every day now when it's scrubbed. > > If there's no suggestions I'll try upgrading to 14.2.22 and then reweighting > the other OSDs (I've already done the primary) that serve this PG to 0 to try > to force its recreation. > > Thanks, > Simon. > > On 22/09/2021 18:50, Simon Ironside wrote: >> Hi All, >> I have a recurring single PG that keeps going inconsistent. A scrub is >> enough to pick up the problem. The primary OSD log shows something like: >> 2021-09-22 18:08:18.502 7f5bdcb11700 0 log_channel(cluster) log [DBG] : >> 1.3ff scrub starts >> 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : >> 1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 >> dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, >> 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive >> bytes. >> 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : >> 1.3ff scrub 1 errors >> It always repairs ok when I run ceph pg repair 1.3ff: >> 2021-09-22 18:08:47.533 7f5bdcb11700 0 log_channel(cluster) log [DBG] : >> 1.3ff repair starts >> 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : >> 1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 >> dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, >> 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive >> bytes. >> 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : >> 1.3ff repair 1 errors, 1 fixed >> It's happened multiple times and always with the same PG number, no other PG >> is doing this. It's a Nautilus v14.2.5 cluster using spinning disks with >> separate DB/WAL on SSDs. I don't believe there's an underlying hardware >> problem but in a bid to make sure I reweighted the primary OSD for this PG >> to 0 to get it to move to another disk. The backfilling is complete but on >> manually scrubbing the PG again it showed inconsistent as above. >> In case it's relevant the only major activity I've performed recently has >> been gradually adding new OSD nodes and disks to the cluster, prior to this >> it had been up without issue for well over a year. The primary OSD for this >> PG was on the first new OSD I added when this issue first presented. The >> inconsistent PG issue didn't start happening immediately after adding it >> though, it was some weeks later. >> Any suggestions as to how I can get rid of this problem? >> Should I try reweighting the other two OSDs for this PG to 0? >> Or is this a known bug that requires some specific work or just an upgrade? >> Thanks, >> Simon. > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io