[ceph-users] Re: Monitor unable to join existing cluster, stuck at probing

2019-10-18 Thread Mathijs Smit
Thank you for taking time to reply to my issue.

I have increased the log level to 10/10 for both the messenger and monitor 
debug and see the following pattern return in the logs. However I do not 
understand the severe high level log that is produced to deduct the problem.

My I again ask for advice?

Log output:

2019-10-18 10:58:28.962 7fd81fc02700  4 mon.mon4@-1(probing) e0 probe_timeout 
0x55de1e9c51a0
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 bootstrap
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 
sync_reset_requester
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 
unregister_cluster_logger - not registered
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 
cancel_probe_timeout (none scheduled)
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 _reset
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 
cancel_probe_timeout (none scheduled)
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 timecheck_finish
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 
scrub_event_cancel
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 scrub_reset
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 
cancel_probe_timeout (none scheduled)
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 
reset_probe_timeout 0x55de1e9c5260 after 2 seconds
2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 probing other 
monitors
2019-10-18 10:58:28.962 7fd81fc02700  1 -- 10.200.1.104:6789/0 _send_message--> 
mon.0 10.200.1.101:6789/0 -- mon_probe(probe 
aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- ?+0 0x55de1e9e5400
2019-10-18 10:58:28.962 7fd81fc02700  1 -- 10.200.1.104:6789/0 --> 
10.200.1.101:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe 
name mon4 new) v6 -- 0x55de1e9e5400 con 0
2019-10-18 10:58:28.962 7fd81fc02700  1 -- 10.200.1.104:6789/0 _send_message--> 
mon.1 10.200.1.102:6789/0 -- mon_probe(probe 
aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- ?+0 0x55de1e9e5680
2019-10-18 10:58:28.962 7fd81fc02700  1 -- 10.200.1.104:6789/0 --> 
10.200.1.102:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe 
name mon4 new) v6 -- 0x55de1e9e5680 con 0
2019-10-18 10:58:28.962 7fd81fc02700  1 -- 10.200.1.104:6789/0 _send_message--> 
mon.2 10.200.1.103:6789/0 -- mon_probe(probe 
aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- ?+0 0x55de1e9e5900
2019-10-18 10:58:28.962 7fd81fc02700  1 -- 10.200.1.104:6789/0 --> 
10.200.1.103:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe 
name mon4 new) v6 -- 0x55de1e9e5900 con 0
2019-10-18 10:58:28.962 7fd81abf8700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.101:6789/0 conn(0x55de1e7d3e00 :-1 s=STATE_OPEN pgs=2274435 cs=1 
l=0).handle_write
2019-10-18 10:58:28.962 7fd819bf6700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.102:6789/0 conn(0x55de1e7d4400 :-1 s=STATE_OPEN pgs=2284339 cs=1 
l=0).handle_write
2019-10-18 10:58:28.962 7fd81a3f7700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.103:6789/0 conn(0x55de1e7d4a00 :-1 s=STATE_OPEN pgs=2288108 cs=1 
l=0).handle_write
2019-10-18 10:58:28.963 7fd81abf8700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.101:6789/0 conn(0x55de1e7d3e00 :-1 s=STATE_OPEN pgs=2274435 cs=1 
l=0)._try_send sent bytes 136 remaining bytes 0
2019-10-18 10:58:28.963 7fd81abf8700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.101:6789/0 conn(0x55de1e7d3e00 :-1 s=STATE_OPEN pgs=2274435 cs=1 
l=0).write_message sending 0x55de1e9e5400 done.
2019-10-18 10:58:28.963 7fd81a3f7700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.103:6789/0 conn(0x55de1e7d4a00 :-1 s=STATE_OPEN pgs=2288108 cs=1 
l=0)._try_send sent bytes 136 remaining bytes 0
2019-10-18 10:58:28.963 7fd81a3f7700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.103:6789/0 conn(0x55de1e7d4a00 :-1 s=STATE_OPEN pgs=2288108 cs=1 
l=0).write_message sending 0x55de1e9e5900 done.
2019-10-18 10:58:28.963 7fd819bf6700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.102:6789/0 conn(0x55de1e7d4400 :-1 s=STATE_OPEN pgs=2284339 cs=1 
l=0)._try_send sent bytes 136 remaining bytes 0
2019-10-18 10:58:28.963 7fd819bf6700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.102:6789/0 conn(0x55de1e7d4400 :-1 s=STATE_OPEN pgs=2284339 cs=1 
l=0).write_message sending 0x55de1e9e5680 done.
2019-10-18 10:58:28.963 7fd81abf8700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.101:6789/0 conn(0x55de1e7d3e00 :-1 s=STATE_OPEN_TAG_ACK pgs=2274435 
cs=1 l=0).handle_ack got ack seq 20 >= 20 on 0x55de1e9e5400 mon_probe(probe 
aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6
2019-10-18 10:58:28.963 7fd81a3f7700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.103:6789/0 conn(0x55de1e7d4a00 :-1 s=STATE_OPEN_TAG_ACK pgs=2288108 
cs=1 l=0).handle_ack got ack seq 20 >= 20 on 0x55de1e9e5900 mon_probe(probe 
aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6
2019-10-18 10:58:28.963 7fd819bf6700 10 -- 10.200.1.104:6789/0 >> 
10.200.1.102:6789/0 conn(0x55de1e7d4400 :-1 s=STATE_OPEN_TAG_ACK pgs=2284339 
cs

[ceph-users] Re: Monitor unable to join existing cluster, stuck at probing

2019-10-16 Thread Paul Emmerich
This does sound like a network problem. Try increasing log levels for
mon (debug_mon = 10/10) and maybe the messenger (debug_ms=5/5 or
10/10, very noisy, to see where it is stuck)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Oct 16, 2019 at 1:42 PM  wrote:
>
> Hi,
>
> I'm currently working on upgrading my existing monitors within my cluster. 
> During the first deployment of this production cluster I made some choices 
> that in hindsight where not the best. But, it worked, I learned and now I 
> wish to mediate my previous bad choices.
>
> The cluster exists of three monitors that are currently in quorum and I wish 
> to upgrade each of them by fully removing them from the cluster and rejoining 
> them after a complete reinstall of the os (new hostname, new ip). Therefore I 
> want to maintain quorum by temporary adding a monitor but this won't go as 
> planned as the monitor will join, with `ceph-deploy add monitor mon4` but 
> never leave the probing state (see log below).
>
> I have verified all networking and firewall settings and don't notice any 
> connection errors, neither do I see any weird hostnames or ip-addresses in 
> the existing monmap on all the hosts. Also manually confirmed that all the 
> keys on the cluster are the same, so don't suspect a authentication error.
>
> Hope someone has any guidance.
>
> Thx.
>
> Log from mon4 > /var/log/ceph/ceph-mon.mon4.log
>
> 2019-10-16 11:21:51.960 7fc709c73a00  0 mon.mon4 does not exist in monmap, 
> will attempt to join an existing cluster
> 2019-10-16 11:21:51.962 7fc709c73a00  0 using public_addr 10.200.1.104:0/0 -> 
> 10.200.1.104:6789/0
> 2019-10-16 11:21:51.963 7fc709c73a00  0 starting mon.mon4 rank -1 at public 
> addr 10.200.1.104:6789/0 at bind addr 10.200.1.104:6789/0 mon_data 
> /var/lib/ceph/mon/ceph-mon4 fsid aaf1547b-8944-4f48-b354-93659202c6fe
> 2019-10-16 11:21:51.964 7fc709c73a00  0 starting mon.mon4 rank -1 at 
> 10.200.1.104:6789/0 mon_data /var/lib/ceph/mon/ceph-mon4 fsid 
> aaf1547b-8944-4f48-b354-93659202c6fe
> 2019-10-16 11:21:51.965 7fc709c73a00  1 mon.mon4@-1(probing) e0 preinit fsid 
> aaf1547b-8944-4f48-b354-93659202c6fe
> 2019-10-16 11:21:51.965 7fc709c73a00  1 mon.mon4@-1(probing) e0  
> initial_members mon1,mon2,mon3,mon4, filtering seed monmap
> 2019-10-16 11:21:51.965 7fc709c73a00  1 mon.mon4@-1(probing).mds e0 Unable to 
> load 'last_metadata'
> 2019-10-16 11:21:51.967 7fc709c73a00  0 mon.mon4@-1(probing) e0  my rank is 
> now 3 (was -1)
> 2019-10-16 11:21:54.054 7fc6f934b700  0 log_channel(audit) log [DBG] : 
> from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
> 2019-10-16 11:21:54.054 7fc6f934b700  0 log_channel(audit) log [DBG] : 
> from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
> 2019-10-16 11:21:54.300 7fc6f934b700  0 log_channel(audit) log [DBG] : 
> from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
> 2019-10-16 11:21:54.300 7fc6f934b700  0 log_channel(audit) log [DBG] : 
> from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
> 2019-10-16 11:22:26.967 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 
> get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 
> at 2019-10-16 11:21:54.055387)
> 2019-10-16 11:22:31.967 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 
> get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 
> at 2019-10-16 11:21:54.055387)
> 2019-10-16 11:22:36.967 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 
> get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 
> at 2019-10-16 11:21:54.055387)
> 2019-10-16 11:22:37.478 7fc6f934b700  0 log_channel(audit) log [DBG] : 
> from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
> 2019-10-16 11:22:37.478 7fc6f934b700  0 log_channel(audit) log [DBG] : 
> from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
> 2019-10-16 11:22:41.968 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 
> get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 
> at 2019-10-16 11:21:54.055387)
> 2019-10-16 11:22:46.968 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 
> get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 
> at 2019-10-16 11:21:54.055387)
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io