> > I did not mean to have a back network configured but it is taken down. Of > course this won't work. What I mean is that you: > > 1. remove the cluster network definition from the cluster config (ceph.conf > and/or ceph config ...) > 2. restart OSDs to apply the change > 3. remove the physical network > > Step 2 will most likely require down time as you write, because during the > transition some OSDs will think all OSDs listen on 2 while other OSDs think > everyone is listening on 1 network. If you can afford to take all clients > down and do a full cluster restart, this is doable. If you set > noout,nodown,pause and maybe some other flags > (norebalance,nobackfill,norecover), wait for all client *and* recovery I/O to > complete, it is probably possible to do this transition without disconnecting > clients by just restarting all OSDs failure domain by failure domain.
Perhaps temporarily setting mon_osd_min_down_reporters to a large number would help avoid flapping. I fear at least some [RBD] clients would still experience timeouts / kernel panics though. > > After the transition things should work fine with just 1 network. > > In any case, my recommendation would be to keep both networks if they are on > different VLAN IDs. Then, nothing special is required to do the transition > and this is what I did to simplify the physical networking (two logical > networks, identical physical networking). > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Stefan Kooman <ste...@bit.nl> > Sent: 13 May 2020 07:40 > To: ceph-users@ceph.io > Subject: [ceph-users] Re: Cluster network and public network > > On 2020-05-12 18:59, Anthony D'Atri wrote: >> >>> I think, however, that a disappearing back network has no real >>> consequences as the heartbeats always go over both. >> >> FWIW this has not been my experience, at least through Luminous. >> >> What I’ve seen is that when the cluster/replication net is configured but >> unavailable, OSD heartbeats fail and peers report them to the mons as down. >> The mons send out a map accordingly, and the affected OSDs report “I’m not >> dead yet!”. Flap flap flap. > > +1. This has also been my experience. And it's quit hard to debug as > well (confusing / seemingly contradictory messages). > > It uses the back network to replicate data ... and as long as it can't > (client) IO wont go through. > > Gr. Stefan > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io