Thanks, Igor. That's super helpful! I'm trying to temporarily disable the BLUESTORE_SPURIOUS_READ_ERRORS warning while we further investigate our underlying cause. In short: I'm trying to keep this cluster from going into HEALTH_WARN every day or so. I'm afraid that my attention to monitoring will lose sensitivity and I'll miss something more actionable (like a drive failure).
https://docs.ceph.com/en/latest/rados/operations/health-checks/#bluestore-spurious-read-errors I found the above page in the documentation and ran "ceph config set osd bluestore_warn_on_spurious_read_errors false". However, I'm still seeing the cluster report on spurious read errors even after restarting all OSD, MON, and MGR services. # ceph config dump | grep spur osd advanced bluestore_warn_on_spurious_read_errors false Am I missing a step in how to fully apply this setting? Thanks! ~Jay -----Original Message----- From: Igor Fedotov <ifedo...@suse.de> Sent: Tuesday, June 22, 2021 7:46 AM To: Jay Sullivan <jps...@rit.edu>; ceph-users@ceph.io Subject: Re: [ceph-users] Spurious Read Errors: 0x6706be76 Hi Jay, this alert was introduced in Pacific indeed. That's probably why you haven't seen it before. And it definitely implies read retries, the following output mentions that explicitly: HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors osd.117 reads with retries: 1 "reads with retries" is actually a replica of "bluestore_reads_with_retries" perf counter at the corresponding OSD hence one can monitor it directly with "ceph daemon osd.N perf dump" command. Additionally one can increase "debug bluestore" log level to 5 to get relevant logging output in OSD log, here is the code line to print it: dout(5) << __func__ << " read at 0x" << std::hex << offset << "~" << length << " failed " << std::dec << retry_count << " times before succeeding" << dendl; Thanks, Igor On 6/22/2021 2:10 AM, Jay Sullivan wrote: > In the week since upgrading one of our clusters from Nautilus 14.2.21 to > Pacific 16.2.4 I've seen four spurious read errors that always have the same > bad checksum of 0x6706be76. I've never seen this in any of our clusters > before. Here's an example of what I'm seeing in the logs: > > ceph-osd.132.log:2021-06-20T22:53:20.584-0400 7fde2e4fc700 -1 > bluestore(/var/lib/ceph/osd/ceph-132) _verify_csum bad crc32c/0x1000 checksum > at blob offset 0x0, got 0x6706be76, expected 0xee74a56a, device location > [0x18c81b40000~1000], logical extent 0x200000~1000, object > #29:2d8210bf:::rbd_data.94f4232ae8944a.0000000000026c57:head# > > I'm not seeing any indication of inconsistent PGs, only the spurious read > error. I don't see an explicit indication of a retry in the logs following > the above message. Bluestore code to retry three times was introduced in 2018 > following a similar issue with the same checksum: > https://tracker.ceph.com/issues/22464 > > Here's an example of what my health detail looks like: > > HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] > BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors > osd.117 reads with retries: 1 > > I followed this (unresolved) thread, too: > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DRBVFQLZ5ZYMNPKLAWS5AR4Z2MJQYLLC/ > > I do have swap enabled, but I don't think memory pressure is an issue with > 30GB available out of 96GB (and no sign I've been close to summoning the > OOMkiller). The OSDs that have thrown the cluster into HEALTH_WARN with the > spurious read errors are busy 12TB rotational HDDs and I _think_ it's only > happening during a deep scrub. We're on Ubuntu 18.04; uname: 5.4.0-74-generic > #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 > GNU/Linux. > > Does Pacific retry three times on a spurious read error? Would I see an > indication of a retry in the logs? > > Thanks! > > ~Jay > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io