[ceph-users] lost OSD due to failing disk

Magnus Hagdorn Wed, 13 Jan 2016 04:26:52 -0800

Hi there,

we recently had a problem with two OSDs failing because of I/O errors ofthe underlying disks. We run a small ceph cluster with 3 nodes and 18OSDs in total. All 3 nodes are dell poweredge r515 servers with PERCH700 (MegaRAID SAS 2108) RAID controllers. All disks are configured assingle disk RAID 0 arrays. A disk on two separate nodes started showingI/O errors reported by SMART, with one of the disks reporting prefailure SMART error. The node with the failing disk also reported XFSI/O errors. In both cases the OSD daemons kept running although cephreported that they were slow to respond. When we started to look intothis we first tried restarted the OSDs. They then failed straight away.We ended up with data loss. We are running ceph 0.80.5 on ScientificLinux 6.6 with a replication level of 2. We had hoped that loosing disksdue to hardware failure would be recoverable.


Is this a known issue with the RAID controllers, version of ceph?

Regards
magnus


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] lost OSD due to failing disk

Reply via email to