I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are up/in).  
ceph status and ceph osd tree output can be found at:
https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12

In osd.4 log, I see many of these:
2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no reply 
from 10.32.0.3:6807 osd.15 ever on either front or back, first ping sent 
2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)
2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no reply 
from 10.32.0.3:6811 osd.16 ever on either front or back, first ping sent 
2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)

From osd.4, those endpoints look reachable:
/ # nc -vz 10.32.0.3 6807
10.32.0.3 (10.32.0.3:6807) open
/ # nc -vz 10.32.0.3 6811
10.32.0.3 (10.32.0.3:6811) open

What else can I look at to determine why most of the OSDs cannot communicate?  
http://tracker.ceph.com/issues/16092 indicates this behavior is a networking or 
hardware issue, what else can I check there?  I can turn on extra logging as 
needed.  Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to