Re: [ceph-users] how to troubleshoot "heartbeat_check: no reply" in OSD log

2017-12-14 Thread Tristan Le Toullec

Hi Jared,
    did you have find a solution to your problem ? It appear that I 
have the same osd problem, and tcpdump captures won't show any solution.


All OSD nodes produced logs like

2017-12-14 11:25:11.756552 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.155:6817 osd.46 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)
2017-12-14 11:25:11.756558 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.155:6815 osd.48 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)
2017-12-14 11:25:11.756564 7f0cc5905700 -1 osd.49 29546 heartbeat_check: 
no reply from 172.16.5.156:6805 osd.50 since back 2017-12-14 
11:24:44.252310 front 2017-12-14 11:24:44.252310 (cutoff 2017-12-14 
11:24:51.756201)


Sometime OSD Process was shutdown and respawn, sometime just shutdown.

We used Ubuntu 14.04 (one node is on 16.04) and ceph version 10.2.10.

Thanks
Tristan





On Fri, Jul 28, 2017 at 6:06 AM, Jared Watts > wrote:
/I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are 

up/in). />/ceph status and ceph osd tree output can be found at: />//>/https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12 />//>//>//>/In osd.4 log, 
I see many of these: />//>/2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 10.32.0.3:6807 osd.15 ever on either front or back, first 
ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>/2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no />/reply from 
10.32.0.3:6811 osd.16 ever on either front or back, first ping />/sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) />//>//>//>/ From osd.4, those 
endpoints look reachable: />//>// # nc -vz 10.32.0.3 6807 />//>/10.32.0.3 (10.32.0.3:6807) open />//>// # nc -vz 10.32.0.3 6811 />//>/10.32.0.3 
(10.32.0.3:6811) open />//>//>//>/What else can I look at to determine why most of the OSDs cannot />/communicate? http://tracker.ceph.com/issues/16092 indicates this 
behavior />/is a networking or hardware issue, what else can I check there? I can turn />/on extra logging as needed. Thanks! /
Do a packet capture on both machines at the same time and verify the
packets are arriving as expected.

//>//>/___ />/ceph-users mailing list />/ceph-users at lists.ceph.com 

 
/>/http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com />//

<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to troubleshoot "heartbeat_check: no reply" in OSD log

2017-07-27 Thread Brad Hubbard
On Fri, Jul 28, 2017 at 6:06 AM, Jared Watts  wrote:
> I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are up/in).
> ceph status and ceph osd tree output can be found at:
>
> https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12
>
>
>
> In osd.4 log, I see many of these:
>
> 2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no
> reply from 10.32.0.3:6807 osd.15 ever on either front or back, first ping
> sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)
>
> 2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no
> reply from 10.32.0.3:6811 osd.16 ever on either front or back, first ping
> sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)
>
>
>
> From osd.4, those endpoints look reachable:
>
> / # nc -vz 10.32.0.3 6807
>
> 10.32.0.3 (10.32.0.3:6807) open
>
> / # nc -vz 10.32.0.3 6811
>
> 10.32.0.3 (10.32.0.3:6811) open
>
>
>
> What else can I look at to determine why most of the OSDs cannot
> communicate?  http://tracker.ceph.com/issues/16092 indicates this behavior
> is a networking or hardware issue, what else can I check there?  I can turn
> on extra logging as needed.  Thanks!

Do a packet capture on both machines at the same time and verify the
packets are arriving as expected.

>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to troubleshoot "heartbeat_check: no reply" in OSD log

2017-07-27 Thread Jared Watts
I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are up/in).  
ceph status and ceph osd tree output can be found at:
https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12

In osd.4 log, I see many of these:
2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no reply 
from 10.32.0.3:6807 osd.15 ever on either front or back, first ping sent 
2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)
2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no reply 
from 10.32.0.3:6811 osd.16 ever on either front or back, first ping sent 
2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850)

From osd.4, those endpoints look reachable:
/ # nc -vz 10.32.0.3 6807
10.32.0.3 (10.32.0.3:6807) open
/ # nc -vz 10.32.0.3 6811
10.32.0.3 (10.32.0.3:6811) open

What else can I look at to determine why most of the OSDs cannot communicate?  
http://tracker.ceph.com/issues/16092 indicates this behavior is a networking or 
hardware issue, what else can I check there?  I can turn on extra logging as 
needed.  Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com