[ceph-users] Re: OSD lost: firmware bug in Kingston SSDs?

Andrew Walker-Brown Thu, 13 May 2021 02:30:51 -0700

Hi Frank,

Glad to hear the testing went well and the Kingston SSDs behaved!  Fingers 
crossed your issue was just a corner case...

Cheers,

A.

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Frank Schilder<mailto:fr...@dtu.dk>
Sent: 13 May 2021 10:15
To: Andrew Walker-Brown<mailto:andrew_jbr...@hotmail.com>; 
ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: Re: OSD lost: firmware bug in Kingston SSDs?

Hi Andrew,

I did a few power-out tests  by pulling the power cord of a server several 
times. This server contains a mix of disks, including the Kingston SSDs (also 
the one that failed before). Every time, all OSDs recovered and an initiated 
deep scrub did not find silent corruptions either. The test was done under 
production load.

Looks like the OSD crash I observed was caused by special and hopefully rare 
circumstances.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: 06 May 2021 15:27:14
To: Andrew Walker-Brown; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD lost: firmware bug in Kingston SSDs?

Hi Andrew,

thanks, that is reassuring. To be sure, I plan to do a few power out tests with 
this server. Never had any issues with that so far, its the first time I see a 
corrupted OSD.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Andrew Walker-Brown <andrew_jbr...@hotmail.com>
Sent: 06 May 2021 15:23:30
To: Frank Schilder; ceph-users@ceph.io
Subject: RE: OSD lost: firmware bug in Kingston SSDs?

Hi Frank,

I’m running the same SSDs (approx. 20) in Dell servers on HBA330’s.  Haven’t 
had any issues and have suffered at least one power outage.  Just checking the 
wcache setting and it shows as enabled.

Running Octopus 15.1.9 and docker containers.  Originally part of a Proxmox 
cluster but now standalone Ceph.

Cheers,

A

Sent from 
Mail<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&amp;data=04%7C01%7C%7C4ab9432f42c849ff166208d915efb10d%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637564941470092873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kgh98gPGirs9BFdSM5n6T3F9UhXUiux3aU1ScITUEf4%3D&amp;reserved=0>
 for Windows 10

From: Frank Schilder<mailto:fr...@dtu.dk>
Sent: 06 May 2021 10:11
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] OSD lost: firmware bug in Kingston SSDs?

Hi all,

I lost 2 OSDs deployed on a single Kingston SSD in a rather strange way and am 
wondering if anyone has made similar observations or is aware of a firmware bug 
with these disks.

Disk model: KINGSTON SEDC500M3840G (it ought to be a DC grade model with super 
capacitors)
Smartctl does not report any drive errors.
Performance per TB is as expected, OSDs are "ceph-volume lvm batch" bluestore 
deployed, everything collocated.

Short version: I disable volatile write cache on all OSD disks, but the 
Kingston disks seem to behave as if this cache is *not* disabled. Smartctl and 
hdparm report wcache=off though. The OSD loss looks like what unflushed write 
cache during power loss would result in. I'm afraid now that our cluster might 
be vulnerable to power loss.

Long version:

Our disks are on Dell HBA330 Mini controllers and are in state "non-raid". The 
controller itself has no cache and is HBA-mode only.

Log entry:

The iDRAC log shows that the disk was removed from a drive group:

---
PDR5     Disk 6 in Backplane 2 of Integrated Storage Controller 1 is removed.
Detailed Description: A physical disk has been removed from the disk group. 
This alert can also be caused by loose or defective cables or by problems with 
the enclosure.
---

The iDRAC did not report the disk as failed and neither as "removed from drive 
bay". I reseated the disk and it came back as healthy. I assume it was a 
problem with connectivity to the back-plane (chassis). If I now try to start up 
the OSDs on this disk, I get the error:

starting osd.581 at - osd_data /var/lib/ceph/osd/ceph-581 
/var/lib/ceph/osd/ceph-581/journal
starting osd.580 at - osd_data /var/lib/ceph/osd/ceph-580 
/var/lib/ceph/osd/ceph-580/journal
2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluefs mount failed to replay log: (5) 
Input/output error
2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluestore(/var/lib/ceph/osd/ceph-581) 
_open_db failed bluefs mount: (5) Input/output error
2021-05-06 09:23:47.630 7fead5a1fb80 -1 osd.581 0 OSD:init: unable to mount 
object store
2021-05-06 09:23:47.630 7fead5a1fb80 -1  ** ERROR: osd init failed: (5) 
Input/output error

I have removed disks of active OSDs before without any bluestore corruption 
happening. While it is very well possible that this particular "disconnect" 
event may lead to a broken OSD, there is also another observation where the 
Kingston disks stick out compared with other SSD OSDs, which make me suspicious 
of this being a disk cache firmware problem:

The I/O indicator LED lights up with significantly lower frequency than for all 
other SSD types on the same pool even though we have 2 instead of 1 OSD 
deployed on the Kingstons (the other disks are 2TB Micron Pro). While this 
could be due to a wiring difference I'm starting to suspect that this might be 
an indication of volatile caching.

Does anyone using Kingston DC-M-SSDs have similar or contradicting experience?
How did these disks handle power outages?
Any recommendations?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD lost: firmware bug in Kingston SSDs?

Reply via email to