[ceph-users] Re: Monitor unable to join existing cluster, stuck at probing
Thank you for taking time to reply to my issue. I have increased the log level to 10/10 for both the messenger and monitor debug and see the following pattern return in the logs. However I do not understand the severe high level log that is produced to deduct the problem. My I again ask for advice? Log output: 2019-10-18 10:58:28.962 7fd81fc02700 4 mon.mon4@-1(probing) e0 probe_timeout 0x55de1e9c51a0 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 bootstrap 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 sync_reset_requester 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 unregister_cluster_logger - not registered 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 cancel_probe_timeout (none scheduled) 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 _reset 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 cancel_probe_timeout (none scheduled) 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 timecheck_finish 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 scrub_event_cancel 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 scrub_reset 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 cancel_probe_timeout (none scheduled) 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 reset_probe_timeout 0x55de1e9c5260 after 2 seconds 2019-10-18 10:58:28.962 7fd81fc02700 10 mon.mon4@-1(probing) e0 probing other monitors 2019-10-18 10:58:28.962 7fd81fc02700 1 -- 10.200.1.104:6789/0 _send_message--> mon.0 10.200.1.101:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- ?+0 0x55de1e9e5400 2019-10-18 10:58:28.962 7fd81fc02700 1 -- 10.200.1.104:6789/0 --> 10.200.1.101:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- 0x55de1e9e5400 con 0 2019-10-18 10:58:28.962 7fd81fc02700 1 -- 10.200.1.104:6789/0 _send_message--> mon.1 10.200.1.102:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- ?+0 0x55de1e9e5680 2019-10-18 10:58:28.962 7fd81fc02700 1 -- 10.200.1.104:6789/0 --> 10.200.1.102:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- 0x55de1e9e5680 con 0 2019-10-18 10:58:28.962 7fd81fc02700 1 -- 10.200.1.104:6789/0 _send_message--> mon.2 10.200.1.103:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- ?+0 0x55de1e9e5900 2019-10-18 10:58:28.962 7fd81fc02700 1 -- 10.200.1.104:6789/0 --> 10.200.1.103:6789/0 -- mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 -- 0x55de1e9e5900 con 0 2019-10-18 10:58:28.962 7fd81abf8700 10 -- 10.200.1.104:6789/0 >> 10.200.1.101:6789/0 conn(0x55de1e7d3e00 :-1 s=STATE_OPEN pgs=2274435 cs=1 l=0).handle_write 2019-10-18 10:58:28.962 7fd819bf6700 10 -- 10.200.1.104:6789/0 >> 10.200.1.102:6789/0 conn(0x55de1e7d4400 :-1 s=STATE_OPEN pgs=2284339 cs=1 l=0).handle_write 2019-10-18 10:58:28.962 7fd81a3f7700 10 -- 10.200.1.104:6789/0 >> 10.200.1.103:6789/0 conn(0x55de1e7d4a00 :-1 s=STATE_OPEN pgs=2288108 cs=1 l=0).handle_write 2019-10-18 10:58:28.963 7fd81abf8700 10 -- 10.200.1.104:6789/0 >> 10.200.1.101:6789/0 conn(0x55de1e7d3e00 :-1 s=STATE_OPEN pgs=2274435 cs=1 l=0)._try_send sent bytes 136 remaining bytes 0 2019-10-18 10:58:28.963 7fd81abf8700 10 -- 10.200.1.104:6789/0 >> 10.200.1.101:6789/0 conn(0x55de1e7d3e00 :-1 s=STATE_OPEN pgs=2274435 cs=1 l=0).write_message sending 0x55de1e9e5400 done. 2019-10-18 10:58:28.963 7fd81a3f7700 10 -- 10.200.1.104:6789/0 >> 10.200.1.103:6789/0 conn(0x55de1e7d4a00 :-1 s=STATE_OPEN pgs=2288108 cs=1 l=0)._try_send sent bytes 136 remaining bytes 0 2019-10-18 10:58:28.963 7fd81a3f7700 10 -- 10.200.1.104:6789/0 >> 10.200.1.103:6789/0 conn(0x55de1e7d4a00 :-1 s=STATE_OPEN pgs=2288108 cs=1 l=0).write_message sending 0x55de1e9e5900 done. 2019-10-18 10:58:28.963 7fd819bf6700 10 -- 10.200.1.104:6789/0 >> 10.200.1.102:6789/0 conn(0x55de1e7d4400 :-1 s=STATE_OPEN pgs=2284339 cs=1 l=0)._try_send sent bytes 136 remaining bytes 0 2019-10-18 10:58:28.963 7fd819bf6700 10 -- 10.200.1.104:6789/0 >> 10.200.1.102:6789/0 conn(0x55de1e7d4400 :-1 s=STATE_OPEN pgs=2284339 cs=1 l=0).write_message sending 0x55de1e9e5680 done. 2019-10-18 10:58:28.963 7fd81abf8700 10 -- 10.200.1.104:6789/0 >> 10.200.1.101:6789/0 conn(0x55de1e7d3e00 :-1 s=STATE_OPEN_TAG_ACK pgs=2274435 cs=1 l=0).handle_ack got ack seq 20 >= 20 on 0x55de1e9e5400 mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 2019-10-18 10:58:28.963 7fd81a3f7700 10 -- 10.200.1.104:6789/0 >> 10.200.1.103:6789/0 conn(0x55de1e7d4a00 :-1 s=STATE_OPEN_TAG_ACK pgs=2288108 cs=1 l=0).handle_ack got ack seq 20 >= 20 on 0x55de1e9e5900 mon_probe(probe aaf1547b-8944-4f48-b354-93659202c6fe name mon4 new) v6 2019-10-18 10:58:28.963 7fd819bf6700 10 -- 10.200.1.104:6789/0 >> 10.200.1.102:6789/0 conn(0x55de1e7d4400 :-1 s=STATE_OPEN_TAG_ACK pgs=2284339 cs
[ceph-users] Re: Monitor unable to join existing cluster, stuck at probing
This does sound like a network problem. Try increasing log levels for mon (debug_mon = 10/10) and maybe the messenger (debug_ms=5/5 or 10/10, very noisy, to see where it is stuck) Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Oct 16, 2019 at 1:42 PM wrote: > > Hi, > > I'm currently working on upgrading my existing monitors within my cluster. > During the first deployment of this production cluster I made some choices > that in hindsight where not the best. But, it worked, I learned and now I > wish to mediate my previous bad choices. > > The cluster exists of three monitors that are currently in quorum and I wish > to upgrade each of them by fully removing them from the cluster and rejoining > them after a complete reinstall of the os (new hostname, new ip). Therefore I > want to maintain quorum by temporary adding a monitor but this won't go as > planned as the monitor will join, with `ceph-deploy add monitor mon4` but > never leave the probing state (see log below). > > I have verified all networking and firewall settings and don't notice any > connection errors, neither do I see any weird hostnames or ip-addresses in > the existing monmap on all the hosts. Also manually confirmed that all the > keys on the cluster are the same, so don't suspect a authentication error. > > Hope someone has any guidance. > > Thx. > > Log from mon4 > /var/log/ceph/ceph-mon.mon4.log > > 2019-10-16 11:21:51.960 7fc709c73a00 0 mon.mon4 does not exist in monmap, > will attempt to join an existing cluster > 2019-10-16 11:21:51.962 7fc709c73a00 0 using public_addr 10.200.1.104:0/0 -> > 10.200.1.104:6789/0 > 2019-10-16 11:21:51.963 7fc709c73a00 0 starting mon.mon4 rank -1 at public > addr 10.200.1.104:6789/0 at bind addr 10.200.1.104:6789/0 mon_data > /var/lib/ceph/mon/ceph-mon4 fsid aaf1547b-8944-4f48-b354-93659202c6fe > 2019-10-16 11:21:51.964 7fc709c73a00 0 starting mon.mon4 rank -1 at > 10.200.1.104:6789/0 mon_data /var/lib/ceph/mon/ceph-mon4 fsid > aaf1547b-8944-4f48-b354-93659202c6fe > 2019-10-16 11:21:51.965 7fc709c73a00 1 mon.mon4@-1(probing) e0 preinit fsid > aaf1547b-8944-4f48-b354-93659202c6fe > 2019-10-16 11:21:51.965 7fc709c73a00 1 mon.mon4@-1(probing) e0 > initial_members mon1,mon2,mon3,mon4, filtering seed monmap > 2019-10-16 11:21:51.965 7fc709c73a00 1 mon.mon4@-1(probing).mds e0 Unable to > load 'last_metadata' > 2019-10-16 11:21:51.967 7fc709c73a00 0 mon.mon4@-1(probing) e0 my rank is > now 3 (was -1) > 2019-10-16 11:21:54.054 7fc6f934b700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch > 2019-10-16 11:21:54.054 7fc6f934b700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished > 2019-10-16 11:21:54.300 7fc6f934b700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch > 2019-10-16 11:21:54.300 7fc6f934b700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished > 2019-10-16 11:22:26.967 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 > get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 > at 2019-10-16 11:21:54.055387) > 2019-10-16 11:22:31.967 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 > get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 > at 2019-10-16 11:21:54.055387) > 2019-10-16 11:22:36.967 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 > get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 > at 2019-10-16 11:21:54.055387) > 2019-10-16 11:22:37.478 7fc6f934b700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch > 2019-10-16 11:22:37.478 7fc6f934b700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished > 2019-10-16 11:22:41.968 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 > get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 > at 2019-10-16 11:21:54.055387) > 2019-10-16 11:22:46.968 7fc6f5ad5700 -1 mon.mon4@3(probing) e0 > get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 > at 2019-10-16 11:21:54.055387) > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io