[ceph-users] Re: One PG keeps going inconsistent (stat mismatch)

2021-10-11 Thread Simon Ironside

Bump for any pointers here?

tl;dr - I've got a single PG that keeps going inconsistent (stat 
mismatch). It always repairs ok but comes back every day now when it's 
scrubbed.


If there's no suggestions I'll try upgrading to 14.2.22 and then 
reweighting the other OSDs (I've already done the primary) that serve 
this PG to 0 to try to force its recreation.


Thanks,
Simon.

On 22/09/2021 18:50, Simon Ironside wrote:

Hi All,

I have a recurring single PG that keeps going inconsistent. A scrub is 
enough to pick up the problem. The primary OSD log shows something like:


2021-09-22 18:08:18.502 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
1.3ff scrub starts
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 
3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 
whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff scrub 1 errors


It always repairs ok when I run ceph pg repair 1.3ff:

2021-09-22 18:08:47.533 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
1.3ff repair starts
2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 
3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 
whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff repair 1 errors, 1 fixed


It's happened multiple times and always with the same PG number, no 
other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning 
disks with separate DB/WAL on SSDs. I don't believe there's an 
underlying hardware problem but in a bid to make sure I reweighted the 
primary OSD for this PG to 0 to get it to move to another disk. The 
backfilling is complete but on manually scrubbing the PG again it showed 
inconsistent as above.


In case it's relevant the only major activity I've performed recently 
has been gradually adding new OSD nodes and disks to the cluster, prior 
to this it had been up without issue for well over a year. The primary 
OSD for this PG was on the first new OSD I added when this issue first 
presented. The inconsistent PG issue didn't start happening immediately 
after adding it though, it was some weeks later.


Any suggestions as to how I can get rid of this problem?
Should I try reweighting the other two OSDs for this PG to 0?
Or is this a known bug that requires some specific work or just an upgrade?

Thanks,
Simon.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: One PG keeps going inconsistent (stat mismatch)

2021-10-11 Thread Eric Petit
FWIW, I saw a similar problem on a cluster ~1 ago and noticed that the PG 
affected with "stat mismatch" was the very last PG of the pool (4.1fff in my 
case, with pg_num = 8192). I recall thinking that it looked more like a bug 
than a hardware issue and, assuming your pool has 1024 PGs, you may be hitting 
the same issue.

It happened 2 or 3 times and then went away, possibly thanks to software 
updates (currently on 14.2.21).

Eric

> On 11 Oct 2021, at 18:44, Simon Ironside  wrote:
> 
> Bump for any pointers here?
> 
> tl;dr - I've got a single PG that keeps going inconsistent (stat mismatch). 
> It always repairs ok but comes back every day now when it's scrubbed.
> 
> If there's no suggestions I'll try upgrading to 14.2.22 and then reweighting 
> the other OSDs (I've already done the primary) that serve this PG to 0 to try 
> to force its recreation.
> 
> Thanks,
> Simon.
> 
> On 22/09/2021 18:50, Simon Ironside wrote:
>> Hi All,
>> I have a recurring single PG that keeps going inconsistent. A scrub is 
>> enough to pick up the problem. The primary OSD log shows something like:
>> 2021-09-22 18:08:18.502 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
>> 1.3ff scrub starts
>> 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
>> 1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 
>> dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 
>> 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive 
>> bytes.
>> 2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
>> 1.3ff scrub 1 errors
>> It always repairs ok when I run ceph pg repair 1.3ff:
>> 2021-09-22 18:08:47.533 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
>> 1.3ff repair starts
>> 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
>> 1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 3243/3244 
>> dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 whiteouts, 
>> 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 hit_set_archive 
>> bytes.
>> 2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
>> 1.3ff repair 1 errors, 1 fixed
>> It's happened multiple times and always with the same PG number, no other PG 
>> is doing this. It's a Nautilus v14.2.5 cluster using spinning disks with 
>> separate DB/WAL on SSDs. I don't believe there's an underlying hardware 
>> problem but in a bid to make sure I reweighted the primary OSD for this PG 
>> to 0 to get it to move to another disk. The backfilling is complete but on 
>> manually scrubbing the PG again it showed inconsistent as above.
>> In case it's relevant the only major activity I've performed recently has 
>> been gradually adding new OSD nodes and disks to the cluster, prior to this 
>> it had been up without issue for well over a year. The primary OSD for this 
>> PG was on the first new OSD I added when this issue first presented. The 
>> inconsistent PG issue didn't start happening immediately after adding it 
>> though, it was some weeks later.
>> Any suggestions as to how I can get rid of this problem?
>> Should I try reweighting the other two OSDs for this PG to 0?
>> Or is this a known bug that requires some specific work or just an upgrade?
>> Thanks,
>> Simon.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io