[ceph-users] Re: Spurious Read Errors: 0x6706be76

Jay Sullivan Tue, 06 Jul 2021 08:10:50 -0700

Thanks, Igor. That's super helpful!

I'm trying to temporarily disable the BLUESTORE_SPURIOUS_READ_ERRORS warning 
while we further investigate our underlying cause. In short: I'm trying to keep 
this cluster from going into HEALTH_WARN every day or so. I'm afraid that my 
attention to monitoring will lose sensitivity and I'll miss something more 
actionable (like a drive failure).


https://docs.ceph.com/en/latest/rados/operations/health-checks/#bluestore-spurious-read-errors
 

I found the above page in the documentation and ran "ceph config set osd 
bluestore_warn_on_spurious_read_errors false". However, I'm still seeing the 
cluster report on spurious read errors even after restarting all OSD, MON, and 
MGR services.

# ceph config dump | grep spur
  osd                                advanced  
bluestore_warn_on_spurious_read_errors    false

Am I missing a step in how to fully apply this setting?

Thanks!

~Jay


-----Original Message-----
From: Igor Fedotov <ifedo...@suse.de> 
Sent: Tuesday, June 22, 2021 7:46 AM
To: Jay Sullivan <jps...@rit.edu>; ceph-users@ceph.io
Subject: Re: [ceph-users] Spurious Read Errors: 0x6706be76

Hi Jay,

this alert was introduced in Pacific indeed. That's probably why you haven't 
seen it before.

And it definitely implies read retries, the following output mentions that 
explicitly:

HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] 
BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors
      osd.117  reads with retries: 1

"reads with retries" is actually a replica of "bluestore_reads_with_retries" 
perf counter at the corresponding OSD hence one can monitor it directly with 
"ceph daemon osd.N perf dump" 
command.


Additionally one can increase "debug bluestore" log level to 5 to get relevant 
logging output in OSD log, here is the code line to print it:

     dout(5) << __func__ << " read at 0x" << std::hex << offset << "~" 
<< length
             << " failed " << std::dec << retry_count << " times before 
succeeding" << dendl;


Thanks,

Igor

On 6/22/2021 2:10 AM, Jay Sullivan wrote:
> In the week since upgrading one of our clusters from Nautilus 14.2.21 to 
> Pacific 16.2.4 I've seen four spurious read errors that always have the same 
> bad checksum of 0x6706be76. I've never seen this in any of our clusters 
> before. Here's an example of what I'm seeing in the logs:
>
> ceph-osd.132.log:2021-06-20T22:53:20.584-0400 7fde2e4fc700 -1 
> bluestore(/var/lib/ceph/osd/ceph-132) _verify_csum bad crc32c/0x1000 checksum 
> at blob offset 0x0, got 0x6706be76, expected 0xee74a56a, device location 
> [0x18c81b40000~1000], logical extent 0x200000~1000, object 
> #29:2d8210bf:::rbd_data.94f4232ae8944a.0000000000026c57:head#
>
> I'm not seeing any indication of inconsistent PGs, only the spurious read 
> error. I don't see an explicit indication of a retry in the logs following 
> the above message. Bluestore code to retry three times was introduced in 2018 
> following a similar issue with the same checksum: 
> https://tracker.ceph.com/issues/22464
>
> Here's an example of what my health detail looks like:
>
> HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] 
> BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors
>       osd.117  reads with retries: 1
>
> I followed this (unresolved) thread, too: 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DRBVFQLZ5ZYMNPKLAWS5AR4Z2MJQYLLC/
>
> I do have swap enabled, but I don't think memory pressure is an issue with 
> 30GB available out of 96GB (and no sign I've been close to summoning the 
> OOMkiller). The OSDs that have thrown the cluster into HEALTH_WARN with the 
> spurious read errors are busy 12TB rotational HDDs and I _think_ it's only 
> happening during a deep scrub. We're on Ubuntu 18.04; uname: 5.4.0-74-generic 
> #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 
> GNU/Linux.
>
> Does Pacific retry three times on a spurious read error? Would I see an 
> indication of a retry in the logs?
>
> Thanks!
>
> ~Jay
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Spurious Read Errors: 0x6706be76

Reply via email to