[ceph-users] Re: Can't join new mon - lossy channel, failing

2023-08-17 Thread Josef Johansson

Hi,

I'm running ceph version 15.2.16 
(a6b69e817d6c9e6f02d0a7ac3043ba9cdbda1bdf) octopus (stable), that would 
mean I am not running the fix.


Glad to know that an upgrade will solve the issue!

Med vänliga hälsningar
Josef Johansson

On 8/16/23 12:05, Konstantin Shalygin wrote:

Hi,


On 16 Aug 2023, at 11:30, Josef Johansson  wrote:

Let's do some serious necromancy here.

I just had this exact problem. Turns out that after rebooting all nodes (one at 
the time of course), the monitor could join perfectly.

Why? You tell me. We did not see any traces of the ip address in any dumps that 
we could get a hold of. I restarted all ceph-mgr beforehand as well.

What is your release?
This deadlock may be fixed via [1]


[1] 
https://clickprotection.net/?url=https%3A%2F%2Ftracker.ceph.com%2Fissues%2F55355=1d78=jo...@oderland.se=1692180357=7cc6ffbf-3c1c-11ee-a802-2712c787f5cc=69b8fc89
k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2023-08-17 Thread Josef Johansson

Hi,

Let's do some serious necromancy here.

I just had this exact problem. Turns out that after rebooting all nodes 
(one at the time of course), the monitor could join perfectly.


Why? You tell me. We did not see any traces of the ip address in any 
dumps that we could get a hold of. I restarted all ceph-mgr beforehand 
as well.


Med vänliga hälsningar
Josef Johansson

On 10/5/21 15:37, Konstantin Shalygin wrote:

As last resort we've change ipaddr of this host, and mon successfully joined to 
quorum. When revert ipaddr back - mon can't join, we think there something on 
switch side or on old mon's side. From old mon's I was checked new mon process 
connectivity via telnet - all works
It's good to make a some reproducer of this network problem to know what 
exactly message of ceph protocol is broken



k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2023-08-16 Thread Konstantin Shalygin


> On 16 Aug 2023, at 13:23, Josef Johansson  wrote:
> 
> I'm running ceph version 15.2.16 (a6b69e817d6c9e6f02d0a7ac3043ba9cdbda1bdf) 
> octopus (stable), that would mean I am not running the fix.
> 
> Glad to know that an upgrade will solve the issue!

I'm not 100% sure that this tracker, exactly fix exactly this [ipaddr deadlock 
"somewhere"] issue, but looks very similar


Thanks!
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2023-08-16 Thread Konstantin Shalygin
Hi,

> On 16 Aug 2023, at 11:30, Josef Johansson  wrote:
> 
> Let's do some serious necromancy here.
> 
> I just had this exact problem. Turns out that after rebooting all nodes (one 
> at the time of course), the monitor could join perfectly.
> 
> Why? You tell me. We did not see any traces of the ip address in any dumps 
> that we could get a hold of. I restarted all ceph-mgr beforehand as well.

What is your release?
This deadlock may be fixed via [1]


[1] https://tracker.ceph.com/issues/55355
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-05 Thread Konstantin Shalygin
As last resort we've change ipaddr of this host, and mon successfully joined to 
quorum. When revert ipaddr back - mon can't join, we think there something on 
switch side or on old mon's side. From old mon's I was checked new mon process 
connectivity via telnet - all works
It's good to make a some reproducer of this network problem to know what 
exactly message of ceph protocol is broken



k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-04 Thread Stefan Kooman

On 10/4/21 15:58, Konstantin Shalygin wrote:



On 4 Oct 2021, at 16:38, Stefan Kooman > wrote:


What procedure are you following to add the mon?


# ceph mon dump
epoch 10
fsid 677f4be1-cd98-496d-8b50-1f99df0df670
last_changed 2021-09-11 10:04:23.890922
created 2018-05-18 20:43:43.260897
min_mon_release 14 (nautilus)
0: [v2:10.40.0.81:3300/0,v1:10.40.0.81:6789/0] mon.ceph-01
1: [v2:10.40.0.83:3300/0,v1:10.40.0.83:6789/0] mon.ceph-03
2: [v2:10.40.0.86:3300/0,v1:10.40.0.86:6789/0] mon.ceph-06
dumped monmap epoch 10


sudo -u ceph ceph mon getmap -o /tmp/monmap.map
got monmap epoch 10
sudo -u ceph ceph-mon -i mon2 --mkfs --monmap /tmp/monmap.map
chown -R ceph:ceph /var/lib/ceph/mon/ceph-mon2
systemctl start ceph-mon@mon2


I'm missing the part where keyring is downloaded and used:

ceph auth get mon. -o /tmp/keyring
ceph mon getmap -o /tmp/monmap
chown -R ceph:ceph /var/lib/ceph/mon/ceph-mon2
ceph-mon -i mon2 --mkfs --monmap /tmp/monmap --keyring /tmp/keyring 
--setuser ceph --setgroup ceph


Gr. Stefan

P.s. I have noticed, that sometimes you first need to reboot the node 
before you are able to start the monitor with systemd.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-04 Thread Stefan Kooman

On 10/4/21 15:27, Konstantin Shalygin wrote:

Hi,

I was make a mkfs for new mon, but mon stuck on probing. On debug I see: fault 
on lossy channel, failing. This is a bad (lossy) network (crc mismatch)?


What procedure are you following to add the mon?

Is this physical hardware? Or a (cloned) VM?

Have you checked that you can connect with the other monitors on port 
6789 / 3300 (and vice versa)?


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-04 Thread Konstantin Shalygin
This cluster isn't use cephx. ceph.conf global settings disable it


k

Sent from my iPhone

> On 4 Oct 2021, at 17:46, Stefan Kooman  wrote:
> 
> I'm missing the part where keyring is downloaded and used:
> 
> ceph auth get mon. -o /tmp/keyring
> ceph mon getmap -o /tmp/monmap
> chown -R ceph:ceph /var/lib/ceph/mon/ceph-mon2
> ceph-mon -i mon2 --mkfs --monmap /tmp/monmap --keyring /tmp/keyring --setuser 
> ceph --setgroup ceph

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-04 Thread Konstantin Shalygin
After this I see only logs to stderr, what exactly I should looking for? Some 
grep keyword?


k

Sent from my iPhone

> On 4 Oct 2021, at 17:37, Vladimir Bashkirtsev  
> wrote:
> 
> I guess:
> 
> strace ceph-mon -d --id mon2 --setuser ceph --setgroup ceph
> 
> should do.
> 
> 
> 
> Try -f instead of -d if you are overwhelmed with output to get mon debug 
> output to log file.
> 
> 
> 
> Regards,
> 
> Vladimir
> 
> On 5/10/21 01:27, Konstantin Shalygin wrote:
>> 
>>> On 4 Oct 2021, at 17:07, Vladimir Bashkirtsev  
>>> wrote:
>>> 
>>> This line bothers me:
>>> 
>>> [v2:10.40.0.81:6898/2507925,v1:10.40.0.81:6899/2507925] conn(0x560287e4 
>>> 0x560287e56000 crc :-1 s=READY pgs=16872 cs=0 l=1 rev1=1 rx=0 
>>> tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) 
>>> Operation not permitted)
>>> 
>>> May be it is good idea to run mon under strace and see why your network 
>>> does not permit the frame read? msgr2 will show the message you have 
>>> referred to in case if no data is actually received from the network.
>> 
>> Do you know how exactly start strace do determine this?
>> 
>> 
>> 
>> k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-04 Thread Vladimir Bashkirtsev

I guess:

strace ceph-mon -d --id mon2 --setuser ceph --setgroup ceph

should do.


Try -f instead of -d if you are overwhelmed with output to get mon debug 
output to log file.



Regards,

Vladimir

On 5/10/21 01:27, Konstantin Shalygin wrote:


On 4 Oct 2021, at 17:07, Vladimir Bashkirtsev 
 wrote:


This line bothers me:

[v2:10.40.0.81:6898/2507925,v1:10.40.0.81:6899/2507925] 
conn(0x560287e4 0x560287e56000 crc :-1 s=READY pgs=16872 cs=0 l=1 
rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble 
failed r=-1 ((1) Operation not permitted)


May be it is good idea to run mon under strace and see why your 
network does not permit the frame read? msgr2 will show the message 
you have referred to in case if no data is actually received from the 
network.


Do you know how exactly start strace do determine this?



k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-04 Thread Konstantin Shalygin


> On 4 Oct 2021, at 17:07, Vladimir Bashkirtsev  
> wrote:
> 
> This line bothers me:
> 
> [v2:10.40.0.81:6898/2507925,v1:10.40.0.81:6899/2507925] conn(0x560287e4 
> 0x560287e56000 crc :-1 s=READY pgs=16872 cs=0 l=1 rev1=1 rx=0 
> tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) 
> Operation not permitted)
> 
> May be it is good idea to run mon under strace and see why your network does 
> not permit the frame read? msgr2 will show the message you have referred to 
> in case if no data is actually received from the network.

Do you know how exactly start strace do determine this?



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-04 Thread Vladimir Bashkirtsev
This line bothers me:

[v2:10.40.0.81:6898/2507925,v1:10.40.0.81:6899/2507925] conn(0x560287e4 
0x560287e56000 crc :-1 s=READY pgs=16872 cs=0 l=1 rev1=1 rx=0 
tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) 
Operation not permitted)

May be it is good idea to run mon under strace and see why your network does 
not permit the frame read? msgr2 will show the message you have referred to in 
case if no data is actually received from the network.

Regards,
Vladimir

On 5 October 2021 12:27:10 am AEDT, Konstantin Shalygin  wrote:
>Hi,
>
>I was make a mkfs for new mon, but mon stuck on probing. On debug I see: fault 
>on lossy channel, failing. This is a bad (lossy) network (crc mismatch)?
>
>
>2021-10-04 16:22:24.707 7f5952761700 10 mon.mon2@-1(probing) e10 probing other 
>monitors
>2021-10-04 16:22:24.707 7f5952761700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] send_to--> mon 
>[v2:10.40.0.81:3300/0,v1:10.40.0.81:6789/0] -- mon_probe(probe 
>677f4be1-cd98-496d-8b50-1f99df0df670 name mon2 new mon_release 14) v7 -- ?+0 
>0x5602864cd480
>2021-10-04 16:22:24.707 7f5952761700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] --> 
>[v2:10.40.0.81:3300/0,v1:10.40.0.81:6789/0] -- mon_probe(probe 
>677f4be1-cd98-496d-8b50-1f99df0df670 name mon2 new mon_release 14) v7 -- 
>0x5602864cd480 con 0x560285455600
>2021-10-04 16:22:24.707 7f5952761700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] send_to--> mon 
>[v2:10.40.0.83:3300/0,v1:10.40.0.83:6789/0] -- mon_probe(probe 
>677f4be1-cd98-496d-8b50-1f99df0df670 name mon2 new mon_release 14) v7 -- ?+0 
>0x5602893ffc00
>2021-10-04 16:22:24.707 7f5952761700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] --> 
>[v2:10.40.0.83:3300/0,v1:10.40.0.83:6789/0] -- mon_probe(probe 
>677f4be1-cd98-496d-8b50-1f99df0df670 name mon2 new mon_release 14) v7 -- 
>0x5602893ffc00 con 0x560285455a80
>2021-10-04 16:22:24.707 7f5952761700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] send_to--> mon 
>[v2:10.40.0.86:3300/0,v1:10.40.0.86:6789/0] -- mon_probe(probe 
>677f4be1-cd98-496d-8b50-1f99df0df670 name mon2 new mon_release 14) v7 -- ?+0 
>0x560288e98a00
>2021-10-04 16:22:24.707 7f5952761700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] --> 
>[v2:10.40.0.86:3300/0,v1:10.40.0.86:6789/0] -- mon_probe(probe 
>677f4be1-cd98-496d-8b50-1f99df0df670 name mon2 new mon_release 14) v7 -- 
>0x560288e98a00 con 0x5602862d8000
>2021-10-04 16:22:24.707 7f594ff5c700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] <== mon.1 v2:10.40.0.83:3300/0 581 
> mon_probe(reply 677f4be1-cd98-496d-8b50-1f99df0df670 name ceph-03 quorum 
>0,1,2 paxos( fc 127723108 lc 127723840 ) mon_release 14) v7  504+0+0 (crc 
>0 0 0) 0x560287a94f00 con 0x560285455a80
>2021-10-04 16:22:24.707 7f594ff5c700 10 mon.mon2@-1(probing) e10 handle_probe 
>mon_probe(reply 677f4be1-cd98-496d-8b50-1f99df0df670 name ceph-03 quorum 0,1,2 
>paxos( fc 127723108 lc 127723840 ) mon_release 14) v7
>2021-10-04 16:22:24.707 7f594ff5c700 10 mon.mon2@-1(probing) e10 
>handle_probe_reply mon.1 v2:10.40.0.83:3300/0 mon_probe(reply 
>677f4be1-cd98-496d-8b50-1f99df0df670 name ceph-03 quorum 0,1,2 paxos( fc 
>127723108 lc 127723840 ) mon_release 14) v7
>2021-10-04 16:22:24.707 7f594ff5c700 10 mon.mon2@-1(probing) e10  monmap is 
>e10: 3 mons at 
>{ceph-01=[v2:10.40.0.81:3300/0,v1:10.40.0.81:6789/0],ceph-03=[v2:10.40.0.83:3300/0,v1:10.40.0.83:6789/0],ceph-06=[v2:10.40.0.86:3300/0,v1:10.40.0.86:6789/0]}
>2021-10-04 16:22:24.707 7f594ff5c700 10 mon.mon2@-1(probing) e10  peer name is 
>ceph-03
>2021-10-04 16:22:24.707 7f594ff5c700 10 mon.mon2@-1(probing) e10  existing 
>quorum 0,1,2
>2021-10-04 16:22:24.707 7f594ff5c700 10 mon.mon2@-1(probing) e10  peer paxos 
>version 127723840 vs my version 127723835 (ok)
>2021-10-04 16:22:24.707 7f594ff5c700 10 mon.mon2@-1(probing) e10  ready to 
>join, but i'm not in the monmap or my addr is blank, trying to join
>2021-10-04 16:22:24.707 7f594ff5c700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] send_to--> mon 
>[v2:10.40.0.81:3300/0,v1:10.40.0.81:6789/0] -- mon_join(mon2 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0]) v2 -- ?+0 0x5602864001c0
>2021-10-04 16:22:24.707 7f594ff5c700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] --> 
>[v2:10.40.0.81:3300/0,v1:10.40.0.81:6789/0] -- mon_join(mon2 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0]) v2 -- 0x5602864001c0 con 
>0x560285455600
>2021-10-04 16:22:24.707 7f594ff5c700  1 -- 
>[v2:10.40.0.82:3300/0,v1:10.40.0.82:6789/0] <== mon.2 v2:10.40.0.86:3300/0 574 
> mon_probe(reply 677f4be1-cd98-496d-8b50-1f99df0df670 name ceph-06 quorum 
>0,1,2 paxos( fc 127723108 lc 127723840 ) mon_release 14) v7  504+0+0 (crc 
>0 0 0) 0x56028aa25480 con 0x5602862d8000
>2021-10-04 16:22:24.707 7f594ff5c700 10 mon.mon2@-1(probing) e10 handle_probe 
>mon_probe(reply 677f4be1-cd98-496d-8b50-1f99df0df670 name ceph-06 quorum 0,1,2 
>paxos( fc 127723108 lc 127723840 ) mon_release 14) v7
>2021-10-04 16:22:24.707 7f594ff5c700 

[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-04 Thread Konstantin Shalygin



> On 4 Oct 2021, at 16:38, Stefan Kooman  wrote:
> 
> What procedure are you following to add the mon?

# ceph mon dump
epoch 10
fsid 677f4be1-cd98-496d-8b50-1f99df0df670
last_changed 2021-09-11 10:04:23.890922
created 2018-05-18 20:43:43.260897
min_mon_release 14 (nautilus)
0: [v2:10.40.0.81:3300/0,v1:10.40.0.81:6789/0] mon.ceph-01
1: [v2:10.40.0.83:3300/0,v1:10.40.0.83:6789/0] mon.ceph-03
2: [v2:10.40.0.86:3300/0,v1:10.40.0.86:6789/0] mon.ceph-06
dumped monmap epoch 10


sudo -u ceph ceph mon getmap -o /tmp/monmap.map
got monmap epoch 10
sudo -u ceph ceph-mon -i mon2 --mkfs --monmap /tmp/monmap.map
chown -R ceph:ceph /var/lib/ceph/mon/ceph-mon2
systemctl start ceph-mon@mon2

> 
> Is this physical hardware? Or a (cloned) VM?

This is hardware

> 
> Have you checked that you can connect with the other monitors on port 6789 / 
> 3300 (and vice versa)?

Yes, of course:

root@mon2:/var/lib/ceph
# telnet 10.40.0.81 3300
Trying 10.40.0.81...
Connected to 10.40.0.81.
Escape character is '^]'.
ceph v2
Terminated
root@mon2:/var/lib/ceph
# telnet 10.40.0.83 3300
Trying 10.40.0.83...
Connected to 10.40.0.83.
Escape character is '^]'.
ceph v2
Terminated
root@mon2:/var/lib/ceph
# telnet 10.40.0.86 3300
Trying 10.40.0.86...
Connected to 10.40.0.86.
Escape character is '^]'.
ceph v2
Terminated




k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io