[ceph-users] OSD out/down detection

Adrian Saul Sun, 19 Jun 2016 19:13:58 -0700

Hi All,
 We have a Jewel (10.2.1) cluster on Centos 7 - I am using an  elrepo 4.4.1 
kernel on all machines and we have an issue where some of the machines hang - 
not sure if its hardware or OS but essentially the host including the console 
is unresponsive and can only be recovered with a hardware reset.  Unfortunately 
nothing useful is logged so I am still trying to figure out what is going on to 
cause this.   But the result for ceph is that if an OSD host goes down like 
this we have run into an issue where only some of its OSDs are marked down.    
In the instance on the weekend, the host had 8 OSDs and only 5 got marked as 
down - this lead to the kRBD devices jamming up trying to send IO to 
non-responsive OSDs that stayed marked up.


The machine went into a slow death - lots of reports of slow or blocked 
requests:

2016-06-19 09:37:49.070810 osd.36 10.145.2.15:6802/31359 65 : cluster [WRN] 2 
slow requests, 2 included below; oldest blocked for > 30.297258 secs
2016-06-19 09:37:54.071542 osd.36 10.145.2.15:6802/31359 82 : cluster [WRN] 112 
slow requests, 5 included below; oldest blocked for > 35.297988 secs
2016-06-19 09:37:54.071737 osd.6 10.145.2.15:6801/21836 221 : cluster [WRN] 253 
slow requests, 5 included below; oldest blocked for > 35.325155 secs
2016-06-19 09:37:59.072570 osd.6 10.145.2.15:6801/21836 251 : cluster [WRN] 262 
slow requests, 5 included below; oldest blocked for > 40.325986 secs

And then when the monitors did report them down the OSDs disputed that:

2016-06-19 09:38:35.821716 mon.0 10.145.2.13:6789/0 244970 : cluster [INF] 
osd.6 10.145.2.15:6801/21836 failed (2 reporters from different host after 
20.000365 >= grace 20.000000)
2016-06-19 09:38:36.950556 mon.0 10.145.2.13:6789/0 244978 : cluster [INF] 
osd.22 10.145.2.15:6806/21826 failed (2 reporters from different host after 
21.613336 >= grace 20.000000)
2016-06-19 09:38:36.951133 mon.0 10.145.2.13:6789/0 244980 : cluster [INF] 
osd.31 10.145.2.15:6812/21838 failed (2 reporters from different host after 
21.613781 >= grace 20.836511)
2016-06-19 09:38:36.951636 mon.0 10.145.2.13:6789/0 244982 : cluster [INF] 
osd.36 10.145.2.15:6802/31359 failed (2 reporters from different host after 
21.614259 >= grace 20.000000)

2016-06-19 09:38:37.156088 osd.36 10.145.2.15:6802/31359 346 : cluster [WRN] 
map e28730 wrongly marked me down
2016-06-19 09:38:36.002076 osd.6 10.145.2.15:6801/21836 473 : cluster [WRN] map 
e28729 wrongly marked me down
2016-06-19 09:38:37.046885 osd.22 10.145.2.15:6806/21826 374 : cluster [WRN] 
map e28730 wrongly marked me down
2016-06-19 09:38:37.050635 osd.31 10.145.2.15:6812/21838 351 : cluster [WRN] 
map e28730 wrongly marked me down

But shortly after

2016-06-19 09:43:39.940985 mon.0 10.145.2.13:6789/0 245305 : cluster [INF] 
osd.6 out (down for 303.951251)
2016-06-19 09:43:39.941061 mon.0 10.145.2.13:6789/0 245306 : cluster [INF] 
osd.22 out (down for 302.908528)
2016-06-19 09:43:39.941099 mon.0 10.145.2.13:6789/0 245307 : cluster [INF] 
osd.31 out (down for 302.908527)
2016-06-19 09:43:39.941152 mon.0 10.145.2.13:6789/0 245308 : cluster [INF] 
osd.36 out (down for 302.908527)

2016-06-19 10:09:10.648924 mon.0 10.145.2.13:6789/0 247076 : cluster [INF] 
osd.23 10.145.2.15:6814/21852 failed (2 reporters from different host after 
20.000378 >= grace 20.000000)
2016-06-19 10:09:10.887220 osd.23 10.145.2.15:6814/21852 176 : cluster [WRN] 
map e28848 wrongly marked me down
2016-06-19 10:14:15.160513 mon.0 10.145.2.13:6789/0 247422 : cluster [INF] 
osd.23 out (down for 304.288018)

By the time the issue was eventually escalated and I was able to do something 
about it I manual marked the remaining host OSDs down (which seemed to unclog 
RBD):

2016-06-19 15:25:06.171395 mon.0 10.145.2.13:6789/0 267212 : cluster [INF] 
osd.7 10.145.2.15:6808/21837 failed (2 reporters from different host after 
22.000367 >= grace 20.000000)
2016-06-19 15:25:06.171905 mon.0 10.145.2.13:6789/0 267214 : cluster [INF] 
osd.24 10.145.2.15:6800/21813 failed (2 reporters from different host after 
22.000748 >= grace 20.710981)
2016-06-19 15:25:06.172426 mon.0 10.145.2.13:6789/0 267216 : cluster [INF] 
osd.37 10.145.2.15:6810/31936 failed (2 reporters from different host after 
22.001167 >= grace 20.000000)

The question I have is why might the these 3 OSDs, despite not being responsive 
for over 5 hours, stayed in the cluster?  The CRUSH map for all pools is to 
have the hosts as fault boundaries, so I would have expected other host OSDs to 
be reporting these as unresponsive and reporting them.  On the OSD logs nothing 
was logged in the hour prior to the failure, and on the other OSDs it seems 
like they noticed all the other OSDs timing out but the 3 that stayed up it 
seemed to be actively attempting backfills.

Any ideas on how I can improve detection of this condition?

Cheers,
 Adrian


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD out/down detection

Reply via email to