Hello,

You might consider checking the iowait (during the problem), and the dmesg (after it recovered). Maybe an issue with the given sata/sas/nvme port?


Regards,

Denes


On 11/29/2017 06:24 PM, Matthew Vernon wrote:
Hi,

We have a 3,060 OSD ceph cluster (running Jewel
10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
host), and having ops blocking on it for some time. It will then behave
for a bit, and then go back to doing this.

It's always the same OSD, and we've tried replacing the underlying disk.

The logs have lots of entries of the form

2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15

I've had a brief poke through the collectd metrics for this osd (and
comparing them with other OSDs on the same host) but other than showing
spikes in latency for that OSD (iostat et al show no issues with the
underlying disk) there's nothing obviously explanatory.

I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
is what googling for the above message suggests), but that just said
"unchangeable", and didn't seem to make any difference.

Any ideas? Other metrics to consider? ...

Thanks,

Matthew



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to