Hi list,

since our upgrade 14.2.9 -> 14.2.10 we observe flapping OSDs:
* The mons claim every few minutes:
2020-08-07 09:49:09.783648 osd.243 (osd.243) 246 : cluster [WRN] Monitor daemon 
marked osd.243 down, but it is still running
2020-08-07 10:04:40.753704 osd.243 (osd.243) 248 : cluster [WRN] Monitor daemon 
marked osd.243 down, but it is still running
2020-08-07 10:07:21.187945 osd.253 (osd.253) 469 : cluster [WRN] Monitor daemon 
marked osd.253 down, but it is still running

2020-08-07 10:04:35.440547 mon.cephmon01 (mon.0) 390132 : cluster [DBG] osd.243 
reported failed by osd.33
2020-08-07 10:04:35.508412 mon.cephmon01 (mon.0) 390133 : cluster [DBG] osd.243 
reported failed by osd.187
2020-08-07 10:04:35.508529 mon.cephmon01 (mon.0) 390134 : cluster [INF] osd.243 
failed (root=default,datacenter=of,row=row-of-02,host=cephosd16) (2 reporters 
from different host after 44.000150 >= grace 25.935545)
2020-08-07 10:04:35.695171 mon.cephmon01 (mon.0) 390135 : cluster [DBG] osd.243 
reported failed by osd.203
2020-08-07 10:04:35.771704 mon.cephmon01 (mon.0) 390136 : cluster [DBG] osd.243 
reported failed by osd.163
2020-08-07 10:04:41.588530 mon.cephmon01 (mon.0) 390148 : cluster [INF] osd.243 
[v2:10.198.10.16:6882/6611,v1:10.198.10.16:6885/6611] boot
2020-08-07 10:04:40.753704 osd.243 (osd.243) 248 : cluster [WRN] Monitor daemon 
marked osd.243 down, but it is still running
2020-08-07 10:04:40.753712 osd.243 (osd.243) 249 : cluster [DBG] map e2683535 
wrongly marked me down at e2683534

osd.33 says:
2020-08-07 10:04:35.437 7fcaaa4f3700 -1 osd.33 2683533 heartbeat_check: no 
reply from 10.198.10.16:6802 osd.243 since back 2020-08-07 10:03:51.223911 
front 2020-08-07 10:03:51.224322 (oldest deadline 2020-08-07 10:04:35.322704)

osd.243 says:
2020-08-07 10:03:55.065 7f0d33911700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15
2020-08-07 10:03:55.065 7f0d34112700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15
[.. ~3000(!) Lines ..]
2020-08-07 10:04:33.644 7f0d33110700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15
2020-08-07 10:04:33.688 7f0d13acb700  0 bluestore(/var/lib/ceph/osd/ceph-243) 
log_latency_fn slow operation observed for upper_bound, latency = 20.9013s, 
after =  omap_iterator(cid = 19.58a_head, oid = #19:51a21a27:::
.dir.default.223091333.1.3:head#)
2020-08-07 10:04:33.688 7f0d13acb700  1 heartbeat_map reset_timeout 
'OSD::osd_op_tp thread 0x7f0d13acb700' had timed out after 15
2020-08-07 10:04:40.748 7f0d2279b700  0 log_channel(cluster) log [WRN] : 
Monitor daemon marked osd.243 down, but it is still running
2020-08-07 10:04:40.748 7f0d2279b700  0 log_channel(cluster) log [DBG] : map 
e2683535 wrongly marked me down at e2683534


* as a consequence, old deep-scrubs did not finish, because they would be 
interrupted -> ' pgs not deep-scrubbed in time'

for the latter, I increased the op-thread-timeout back to the pre 12(!).2.11 
value of 30

i`m am not sure, if we really have a problem, but it does not look healthy.

Any ideas, thoughts?

regards,
Ingo

-- 
Ingo Reimann 
Teamleiter Technik
[ https://www.dunkel.de/ ] 
Dunkel GmbH 
Philipp-Reis-Straße 2 
65795 Hattersheim 
Fon: +49 6190 889-100 
Fax: +49 6190 889-399 
eMail: supp...@dunkel.de 
https://www.Dunkel.de/  Amtsgericht Frankfurt/Main 
HRB: 37971 
Geschäftsführer: Axel Dunkel 
Ust-ID: DE 811622001
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to