[ceph-users] Ceph outage - monitoring options

Adrian Saul Mon, 21 Nov 2016 23:34:08 -0800

Hi All,
  We have a jewel cluster (10.2.1) that we built up in a POC state (2 clients 
also being mons, 12 SSD OSDs on 3 hosts, 20 SATA OSDs on 3 hosts).   We have 
connected up our "prod" environment to it and performed a migration for all the 
OSDs so it is now 114 OSDs (36 SSD, 78 NL-SAS with another 26 waiting in for 
replacement of 2 DOA journals).


The migration was relatively clean, except for when I removed the old hosts 
from the crush map - even though the OSDs were already out and removed, the 
host entries still had weight and the rebalancing when the buckets were removed 
kicked causes some sort of lockup with the kernel RBD clients.  Once they were 
kicked the main issue settled down and all was good.

We then had one SAS OSD machine have a single network flap for 8 seconds where 
it lost all network (unknown by me until later).   By the time I was looking at 
the issue it was like a ghostly silence - no OSDs down, none out, no failed 
PGs, no repairing - just blocked requests and nothing else.   After jumping 
between multiple machines and finding nothing of note I got desperate and 
restarted all OSDs on a host by host basis.  That resulted in a number of PGs 
becoming inactive, but no recovery or any sort of other improvement and the 
blocked requests continued.

I tried digging blocked requests out from random OSDs - most seemed to just say 
"waiting for peered" with no other real information.  After a while I started 
finding a general pattern pointing to one machine as being a peer for some of 
the pgs that were out  (I wish I had done ceph health detail earlier, because 
it would have clued me into a list of PGs owned by that host faster).  Finding 
nothing wrong and having seen the OSDs already been restarted (monitors etc saw 
them go down and back up fine but the blocked requests remained), I chose to 
reboot the server.

As soon as that was done its like the cluster jumped back to life and 
discovered it was broken and started repairing.  Once the down host was back 
online it joined again and the cluster recovered quickly back as if nothing had 
happened.

When I later dug into the logs I found the network flap, followed by the 
building blocked requests and by most of the OSDs on that host complaining of 
heartbeat_check failures.

Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic 0000:0a:00.0 enp10s0: Link DOWN
Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic 0000:0b:00.0 enp11s0: Link DOWN
Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic 0000:0c:00.0 enp12s0: Link DOWN
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic 0000:0a:00.0 enp10s0: Link UP
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic 0000:0b:00.0 enp11s0: Link UP
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic 0000:0c:00.0 enp12s0: Link UP

Nov 21 16:38:45 ceph-glb-sata-06 ceph-osd: 2016-11-21 16:38:45.900016 
7f66a918c700 -1 osd.103 69295 heartbeat_check: no reply from osd.68 since back 
2016-11-21 16:38:25.707045 front 2016-11-21 16:38:43.908106 (cutoff 2016-11-21 
16:38:25.900010)
Nov 21 16:38:45 ceph-glb-sata-06 ceph-osd: 2016-11-21 16:38:45.900038 
7f66a918c700 -1 osd.103 69295 heartbeat_check: no reply from osd.70 since back 
2016-11-21 16:38:25.707045 front 2016-11-21 16:38:43.908106 (cutoff 2016-11-21 
16:38:25.900010)

2016-11-21 16:38:49.751243 7f138a653700 -1 osd.98 69295 heartbeat_check: no 
reply from osd.74 since back 2016-11-21 16:38:29.391091 front 2016-11-21 
16:38:47.293162 (cutoff 2016-11-21 16:38:29.751238)
2016-11-21 16:38:49.751264 7f138a653700 -1 osd.98 69295 heartbeat_check: no 
reply from osd.77 since back 2016-11-21 16:38:29.391091 front 2016-11-21 
16:38:47.293162 (cutoff 2016-11-21 16:38:29.751238)

In some of the blocked ops I found references to waiting on rw locks:

2016-11-21 16:43:42.787670 7f138a653700  0 log_channel(cluster) log [WRN] : 4 
slow requests, 4 included below; oldest blocked for > 120.843670 secs
2016-11-21 16:43:42.787673 7f138a653700  0 log_channel(cluster) do_log log to 
syslog
2016-11-21 16:43:42.787707 7f138a653700  0 log_channel(cluster) log [WRN] : 
slow request 120.843670 seconds old, received at 2016-11-21 16:41:41.943943: 
osd_op(client.2714100.1:69471470 1.3140a3e7 
rbd_data.159a26238e1f29.0000000000018502 [set-alloc-hint object_size 4194304 
write_size 4194304,write 3683840~510464] snapc 0=[] 
ondisk+write+ignore_cache+ignore_overlay+known_if_redirected e69295) currently 
waiting for subops from 89
2016-11-21 16:43:42.787710 7f138a653700  0 log_channel(cluster) do_log log to 
syslog
2016-11-21 16:43:42.787725 7f138a653700  0 log_channel(cluster) log [WRN] : 
slow request 60.181839 seconds old, received at 2016-11-21 16:42:42.605773: 
osd_op(client.2714100.1:69528164 1.3140a3e7 
rbd_data.159a26238e1f29.0000000000018502 [set-alloc-hint object_size 4194304 
write_size 4194304,write 3683840~510464] snapc 0=[] 
ondisk+write+ignore_cache+ignore_overlay+known_if_redirected e69295) currently 
waiting for subops from 89
2016-11-21 16:43:42.787740 7f138a653700  0 log_channel(cluster) do_log log to 
syslog
2016-11-21 16:43:42.787811 7f138a653700  0 log_channel(cluster) log [WRN] : 
slow request 60.181151 seconds old, received at 2016-11-21 16:42:42.606462: 
osd_op(osd.22.51681:2352097 1.3140a3e7 rbd_data.159a26238e1f29.0000000000018502 
[list-snaps] snapc 0=[] 
ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e69295) currently waiting for rw locks
2016-11-21 16:43:42.787813 7f138a653700  0 log_channel(cluster) do_log log to 
syslog
2016-11-21 16:43:42.787875 7f138a653700  0 log_channel(cluster) log [WRN] : 
slow request 60.181123 seconds old, received at 2016-11-21 16:42:42.606490: 
osd_op(osd.22.51681:2352098 1.3140a3e7 rbd_data.159a26238e1f29.0000000000018502 
[copy-get max 8388608] snapc 0=[] 
ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e69295) currently waiting for rw locks
2016-11-21 16:43:42.787877 7f138a653700  0 log_channel(cluster) do_log log to 
syslog

I am assuming there is something underlying here that broke network 
connectivity, however at the time I was able to connect to the machine fine, 
able to restart the services and have the monitors see that activity and also 
ping through the replication and public networks.   Granted that its not 
probably cephs fault why connectivity was broken. However....

Why when the OSDs were restarted would they go straight back into that failed 
state where they were still blocked even though they came up and in?
Why would other OSDs not have marked them as bad or unresponsive at that point 
given many were waiting on blocked operations?  Is the fact they could connect 
enough or are there other conditions that would not cause them to be marked as 
out by other OSDs?
Is there any recommended configuration that would help improve detect the case 
when an OSD is unresponsive and more quickly move to out it from the osdmap to 
keep the cluster operational  (timeout tuning, flags to be set etc?)

Any help appreciated.  It's a little scary kicking into production and having 
an outage that I cant explain why cephs redundancy didn't kick in.

Cheers,
 Adrian





Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph outage - monitoring options

Reply via email to