Hi Jared,
    did you have find a solution to your problem ? It appear that I have the same osd problem, and tcpdump captures won't show any solution.

All OSD nodes produced logs like

2017-12-14 11:25:11.756552 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.155:6817 osd.46 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201) 2017-12-14 11:25:11.756558 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.155:6815 osd.48 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201) 2017-12-14 11:25:11.756564 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.156:6805 osd.50 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201)

Sometime OSD Process was shutdown and respawn, sometime just shutdown.

We used Ubuntu 14.04 (one node is on 16.04) and ceph version 10.2.10.

Thanks
Tristan





On Fri, Jul 28, 2017 at 6:06 AM, Jared Watts <Jared.Watts at quantum.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>> wrote:
/I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are
up/in). />/ceph status and ceph osd tree output can be found at: />//>/https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12 />//>//>//>/In osd.4 log, 
I see many of these: />//>/2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 10.32.0.3:6807 osd.15 ever on either front or back, first 
ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>/2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 
10.32.0.3:6811 osd.16 ever on either front or back, first ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>//>//>/ From osd.4, those 
endpoints look reachable: />//>// # nc -vz 10.32.0.3 6807 />//>/10.32.0.3 (10.32.0.3:6807) open />//>// # nc -vz 10.32.0.3 6811 />//>/10.32.0.3 
(10.32.0.3:6811) open />//>//>//>/What else can I look at to determine why most of the OSDs cannot />/communicate? http://tracker.ceph.com/issues/16092 indicates this 
behavior />/is a networking or hardware issue, what else can I check there? I can turn />/on extra logging as needed. Thanks! /
Do a packet capture on both machines at the same time and verify the
packets are arriving as expected.

//>//>/_______________________________________________ />/ceph-users mailing list />/ceph-users at lists.ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> 
/>/http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com />//

<<attachment: tristan_letoullec.vcf>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to