We experienced a Ceph failure causing the system to become unresponsive with no 
IOPS or throughput due to a problematic OSD process on one node. This resulted 
in slow operations and no IOPS for all other OSDs in the cluster. The incident 
timeline is as follows:

Alert triggered for OSD problem.
6 out of 12 OSDs on the node were down.
Soft restart attempted, but smartmontools process stuck while shutting down 
server.
Hard restart attempted and service resumed as usual.

Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17  octopus 
(stable).

Questions:
1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty node 
and automatically abandon its resources?
2. Did we miss any patches or bug fixes?
3. Suggestions for improvements to quickly detect and avoid similar issues in 
the future?
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to