On Fri, Aug 24, 2018 at 08:58:12AM +0200, Claudio Jeker wrote:
> On Wed, Aug 22, 2018 at 12:12:10AM +0200, Remi Locherer wrote:
> > On Tue, Aug 21, 2018 at 05:54:18PM +0100, Stuart Henderson wrote:
> > > On 2018/08/21 17:16, Remi Locherer wrote:
> > > > Hi tech,
> > > > 
> > > > recently we had a short outage in our network. A script started an 
> > > > additional
> > > > ospfd instance because the -n flag for config test was missing.
> > > 
> > > This is a problem with bgpd as well, last time I did this it killed one 
> > > of the
> > > *other* routers on the network (i.e. not just the one where I 
> > > accidentally ran
> > > 2x bgpd...).
> > > 
> > > > What then happend was not nice:
> > > > - The new ospfd unlinked the control socket of the first ospfd
> > > > - The new ospfd removed all routes from the first ospfd
> > > > - The new ospfd was not able to build up an adjacency and therefore 
> > > > could
> > > >   not install the routes needed for a recovery.
> > > > - Both ospfd instances were running but non-functional.
> > > > 
> > > > Of course the faulty script is fixed by now. ;-)
> > > > 
> > > > It would be nice if ospfd could prevent such a situation.
> > > > 
> > > > Below diff does these things:
> > > > - Detect a running ospfd by first doing a connect on the control socket.
> > > > - Do not delete the control socket on exit.
> > > >   - This could delete the socket of another instance.
> > > >   - Unlinking the socket on shutdown will be in the way once we add 
> > > > pledge
> > > >     to the main process. It was removed recently from various daemons.
> > > 
> > > This all sounds very sensible.
> > > 
> > > > - Do not delete routes added by another process even if they have
> > > >   prio RTP_OSPF. Without this the new ospfd will remove all the routes
> > > >   of the first one.
> > > 
> > > I'm unsure about this, the above changes stop the new ospfd from running
> > > don't they, so that shouldn't be a problem?
> > 
> > It stops to late. kr_init happens before and kill all existing routes with
> > priority 32. And again in the shutdown function of ospfd.
> > > 
> > > If an ospfd blows up for whatever reason, it would be quite inconvenient
> > > if it needs manual route tweaks rather than just 'rcctl start ospfd' to 
> > > fix it ..
> > 
> > Yes, this is not optimal.
> > 
> > The new diff below defers kr_init until the ospf engine notifies the parent
> > that the control socket is ready. In case the ospf engine exits because the
> > control socket is already in use no routes are known that could be removed.
> > 
> > With this ospfd keeps the behaviour of removing foreign routes with
> > priority 32.
> > 
> > Better?
> > 
> 
> Why are we not checking the control socket in the parent?
> Also it may be better to create the control socket in the parent and pass
> it to the ospfe. This is what bgpd is doing and allows to change the path
> during runtime with a config reload.

This makes sense to me. I'll come up with a new diff once I found some
time for it.

But I'm not sure about changing the socket path with a reload. I plan to
pledge (stdio rpath sendfd wroute) and eventually unveil (read ospfd.conf)
the main process.

> 
> Could there be a case where this causes ospfd to hang on start in the
> connect? Not sure if we can sleep doing a connect() to a AF_UNIX socket.

Reply via email to