I didn't find anything obvious in the release notes about this issue we see to 
have, but I don't understand it really.

We have seen logs indicating some kind of heartbeat issue with OSDs, but we 
don't believe there is any issues with the networking between the nodes, which 
are mostly idle as well:

2018-07-17 17:41:32.903871 I | osd12: 2018-07-17 17:41:32.903793 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6866 osd.219 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903875 I | osd12: 2018-07-17 17:41:32.903795 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6922 osd.220 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903878 I | osd12: 2018-07-17 17:41:32.903798 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6901 osd.221 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903880 I | osd12: 2018-07-17 17:41:32.903800 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6963 osd.222 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)
2018-07-17 17:41:32.903884 I | osd12: 2018-07-17 17:41:32.903803 7fffef198700 
-1 osd.12 4296 heartbeat_check: no reply from 7.129.220.44:6907 osd.224 ever on 
either front or back, first ping sent 2018-07-17 17:41:09.893761 (cutoff 
2018-07-17 17:41:12.903604)

Is there a way to resolve this issue, which seems to be the root cause of the 
OSDs being marked as failed.

Thanks in advance for any help,
-Bryan

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bryan 
Banister
Sent: Tuesday, July 17, 2018 12:08 PM
To: Tom W <to...@ukfast.co.uk>; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs 
failed, then marked down, then booted, then failed again

Note: External Email
________________________________
Hi Tom,

We're apparently running ceph version 12.2.5 on a Rook based cluster.  We have 
EC pools on large 8TB HDDs and metadata on bluestore OSDs on NVMe drives.

I'll look at the release notes.

Thanks!
-Bryan

From: Tom W [mailto:to...@ukfast.co.uk]
Sent: Tuesday, July 17, 2018 12:05 PM
To: Bryan Banister 
<bbanis...@jumptrading.com<mailto:bbanis...@jumptrading.com>>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: Cluster in bad shape, seemingly endless cycle of OSDs failed, then 
marked down, then booted, then failed again

Note: External Email
________________________________

Hi Bryan,



What version of Ceph are you currently running on, and do you run any erasure 
coded pools or bluestore OSDs? Might be worth having a quick glance over the 
recent changelogs:



http://docs.ceph.com/docs/master/releases/luminous/



Tom

________________________________
From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Banister 
<bbanis...@jumptrading.com<mailto:bbanis...@jumptrading.com>>
Sent: 17 July 2018 18:00:05
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Cluster in bad shape, seemingly endless cycle of OSDs 
failed, then marked down, then booted, then failed again

Hi all,

We're still very new to managing Ceph and seem to have cluster that is in an 
endless loop of failing OSDs, then marking them down, then booting them again:

Here are some example logs:
2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491973 >= grace 20.010293)
2018-07-17 16:48:28.976730 mon.rook-ceph-mon7 [INF] osd.84 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491916 >= grace 20.010293)
2018-07-17 16:48:28.976785 mon.rook-ceph-mon7 [INF] osd.85 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491870 >= grace 20.011151)
2018-07-17 16:48:28.976843 mon.rook-ceph-mon7 [INF] osd.86 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
61.491828 >= grace 20.010293)
2018-07-17 16:48:28.976890 mon.rook-ceph-mon7 [INF] Marking osd.1 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976913 mon.rook-ceph-mon7 [INF] Marking osd.2 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976933 mon.rook-ceph-mon7 [INF] Marking osd.3 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976954 mon.rook-ceph-mon7 [INF] Marking osd.4 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.976979 mon.rook-ceph-mon7 [INF] Marking osd.9 out (has been 
down for 605 seconds)
2018-07-17 16:48:28.977000 mon.rook-ceph-mon7 [INF] Marking osd.10 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977020 mon.rook-ceph-mon7 [INF] Marking osd.11 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977040 mon.rook-ceph-mon7 [INF] Marking osd.12 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977059 mon.rook-ceph-mon7 [INF] Marking osd.13 out (has 
been down for 605 seconds)
2018-07-17 16:48:28.977079 mon.rook-ceph-mon7 [INF] Marking osd.14 out (has 
been down for 605 seconds)
2018-07-17 16:48:30.889316 mon.rook-ceph-mon7 [INF] osd.55 
7.129.218.12:6920/90761 boot
2018-07-17 16:48:31.113052 mon.rook-ceph-mon7 [WRN] Health check update: 
4946/8854434 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:31.113087 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 7951/8854434 objects degraded (0.090%), 88 pgs 
degraded, 273 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:32.763546 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10439 pgs inactive, 8994 pgs down, 1639 pgs peering, 
88 pgs incomplete, 3430 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:32.763578 mon.rook-ceph-mon7 [WRN] Health check update: 29 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 16:48:34.096178 mon.rook-ceph-mon7 [INF] osd.88 failed 
(root=default,host=carg-kubelet-osd04) (3 reporters from different host after 
66.612054 >= grace 20.010283)
2018-07-17 16:48:34.108020 mon.rook-ceph-mon7 [WRN] Health check update: 112 
osds down (OSD_DOWN)
2018-07-17 16:48:38.736108 mon.rook-ceph-mon7 [WRN] Health check update: 
4946/8843715 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:38.736140 mon.rook-ceph-mon7 [WRN] Health check update: 
Reduced data availability: 10415 pgs inactive, 9000 pgs down, 1635 pgs peering, 
88 pgs incomplete, 3418 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:38.736166 mon.rook-ceph-mon7 [WRN] Health check update: 
Degraded data redundancy: 7949/8843715 objects degraded (0.090%), 86 pgs 
degraded, 267 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:40.430146 mon.rook-ceph-mon7 [WRN] Health check update: 111 
osds down (OSD_DOWN)
2018-07-17 16:48:40.812579 mon.rook-ceph-mon7 [INF] osd.117 
7.129.217.10:6833/98090 boot
2018-07-17 16:48:42.427204 mon.rook-ceph-mon7 [INF] osd.115 
7.129.217.10:6940/98114 boot
2018-07-17 16:48:42.427297 mon.rook-ceph-mon7 [INF] osd.100 
7.129.217.10:6899/98091 boot
2018-07-17 16:48:42.427502 mon.rook-ceph-mon7 [INF] osd.95 
7.129.217.10:6901/98092 boot

Not sure this is going to fix itself.  Any ideas on how to handle this 
situation??

Thanks in advance!
-Bryan


________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential, or privileged information and/or 
personal data. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, or copying of this email is strictly 
prohibited, and requested to notify the sender immediately and destroy this 
email and any attachments. Email transmission cannot be guaranteed to be secure 
or error-free. The Company, therefore, does not make any guarantees as to the 
completeness or accuracy of this email or any attachments. This email is for 
informational purposes only and does not constitute a recommendation, offer, 
request, or solicitation of any kind to buy, sell, subscribe, redeem, or 
perform any type of transaction of a financial product. Personal data, as 
defined by applicable data privacy laws, contained in this email may be 
processed by the Company, and any of its affiliated or related companies, for 
potential ongoing compliance and/or business-related purposes. You may have 
rights regarding your personal data; for information on exercising these rights 
or the Company's treatment of personal data, please email 
datareque...@jumptrading.com<mailto:datareque...@jumptrading.com>.

________________________________

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named 
person(s). If you are not the intended recipient, notify the sender 
immediately, delete this email from your system and do not disclose or use for 
any purpose. We may monitor all incoming and outgoing emails in line with 
current legislation. We have taken steps to ensure that this email and 
attachments are free from any virus, but it remains your responsibility to 
ensure that viruses do not adversely affect you

________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential, or privileged information and/or 
personal data. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, or copying of this email is strictly 
prohibited, and requested to notify the sender immediately and destroy this 
email and any attachments. Email transmission cannot be guaranteed to be secure 
or error-free. The Company, therefore, does not make any guarantees as to the 
completeness or accuracy of this email or any attachments. This email is for 
informational purposes only and does not constitute a recommendation, offer, 
request, or solicitation of any kind to buy, sell, subscribe, redeem, or 
perform any type of transaction of a financial product. Personal data, as 
defined by applicable data privacy laws, contained in this email may be 
processed by the Company, and any of its affiliated or related companies, for 
potential ongoing compliance and/or business-related purposes. You may have 
rights regarding your personal data; for information on exercising these rights 
or the Company's treatment of personal data, please email 
datareque...@jumptrading.com<mailto:datareque...@jumptrading.com>.

________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential, or privileged information and/or 
personal data. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, or copying of this email is strictly 
prohibited, and requested to notify the sender immediately and destroy this 
email and any attachments. Email transmission cannot be guaranteed to be secure 
or error-free. The Company, therefore, does not make any guarantees as to the 
completeness or accuracy of this email or any attachments. This email is for 
informational purposes only and does not constitute a recommendation, offer, 
request, or solicitation of any kind to buy, sell, subscribe, redeem, or 
perform any type of transaction of a financial product. Personal data, as 
defined by applicable data privacy laws, contained in this email may be 
processed by the Company, and any of its affiliated or related companies, for 
potential ongoing compliance and/or business-related purposes. You may have 
rights regarding your personal data; for information on exercising these rights 
or the Company's treatment of personal data, please email 
datareque...@jumptrading.com.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to