Hi Jared,did you have find a solution to your problem ? It appear that I have the same osd problem, and tcpdump captures won't show any solution.
All OSD nodes produced logs like2017-12-14 11:25:11.756552 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.155:6817 osd.46 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201) 2017-12-14 11:25:11.756558 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.155:6815 osd.48 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201) 2017-12-14 11:25:11.756564 7f0cc5905700 -1 osd.49 29546 heartbeat_check: no reply from 172.16.5.156:6805 osd.50 since back 2017-12-14 11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 11:24:51.756201)
Sometime OSD Process was shutdown and respawn, sometime just shutdown. We used Ubuntu 14.04 (one node is on 16.04) and ceph version 10.2.10. Thanks TristanOn Fri, Jul 28, 2017 at 6:06 AM, Jared Watts <Jared.Watts at quantum.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>> wrote:
/I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are
up/in). />/ceph status and ceph osd tree output can be found at: />//>/https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12 />//>//>//>/In osd.4 log, I see many of these: />//>/2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 10.32.0.3:6807 osd.15 ever on either front or back, first ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>/2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 10.32.0.3:6811 osd.16 ever on either front or back, first ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>//>//>/ From osd.4, those endpoints look reachable: />//>// # nc -vz 10.32.0.3 6807 />//>/10.32.0.3 (10.32.0.3:6807) open />//>// # nc -vz 10.32.0.3 6811 />//>/10.32.0.3 (10.32.0.3:6811) open />//>//>//>/What else can I look at to determine why most of the OSDs cannot />/communicate? http://tracker.ceph.com/issues/16092 indicates this behavior />/is a networking or hardware issue, what else can I check there? I can turn />/on extra logging as needed. Thanks! / Do a packet capture on both machines at the same time and verify the packets are arriving as expected.
//>//>/_______________________________________________ />/ceph-users mailing list />/ceph-users at lists.ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> />/http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com />//
<<attachment: tristan_letoullec.vcf>>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com