[ceph-users] Re: One PG keeps going inconsistent (stat mismatch)

Eric Petit Mon, 11 Oct 2021 20:48:49 -0700

FWIW, I saw a similar problem on a cluster ~1 ago and noticed that the PG 
affected with "stat mismatch" was the very last PG of the pool (4.1fff in my 
case, with pg_num = 8192). I recall thinking that it looked more like a bug 
than a hardware issue and, assuming your pool has 1024 PGs, you may be hitting 
the same issue.


It happened 2 or 3 times and then went away, possibly thanks to software 
updates (currently on 14.2.21).

Eric

> On 11 Oct 2021, at 18:44, Simon Ironside <sirons...@caffetine.org> wrote:
> 
> Bump for any pointers here?
> 
> tl;dr - I've got a single PG that keeps going inconsistent (stat mismatch). 
> It always repairs ok but comes back every day now when it's scrubbed.
> 
> If there's no suggestions I'll try upgrading to 14.2.22 and then reweighting 
> the other OSDs (I've already done the primary) that serve this PG to 0 to try 
> to force its recreation.
> 
> Thanks,
> Simon.
> 
> On 22/09/2021 18:50, Simon Ironside wrote:
>> Hi All,
>> I have a recurring single PG that keeps going inconsistent. A scrub is 
>> enough to pick up the problem. The primary OSD log shows something like:
>> 2021-09-22 18:08:18.502 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
>> 1.3ff scrub starts
>> 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
>> 1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 
>> dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 
>> 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive 
>> bytes.
>> 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
>> 1.3ff scrub 1 errors
>> It always repairs ok when I run ceph pg repair 1.3ff:
>> 2021-09-22 18:08:47.533 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
>> 1.3ff repair starts
>> 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
>> 1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 
>> dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 
>> 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive 
>> bytes.
>> 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
>> 1.3ff repair 1 errors, 1 fixed
>> It's happened multiple times and always with the same PG number, no other PG 
>> is doing this. It's a Nautilus v14.2.5 cluster using spinning disks with 
>> separate DB/WAL on SSDs. I don't believe there's an underlying hardware 
>> problem but in a bid to make sure I reweighted the primary OSD for this PG 
>> to 0 to get it to move to another disk. The backfilling is complete but on 
>> manually scrubbing the PG again it showed inconsistent as above.
>> In case it's relevant the only major activity I've performed recently has 
>> been gradually adding new OSD nodes and disks to the cluster, prior to this 
>> it had been up without issue for well over a year. The primary OSD for this 
>> PG was on the first new OSD I added when this issue first presented. The 
>> inconsistent PG issue didn't start happening immediately after adding it 
>> though, it was some weeks later.
>> Any suggestions as to how I can get rid of this problem?
>> Should I try reweighting the other two OSDs for this PG to 0?
>> Or is this a known bug that requires some specific work or just an upgrade?
>> Thanks,
>> Simon.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: One PG keeps going inconsistent (stat mismatch)

Reply via email to