On Fri, Aug 24, 2018 at 08:58:12AM +0200, Claudio Jeker wrote: > On Wed, Aug 22, 2018 at 12:12:10AM +0200, Remi Locherer wrote: > > On Tue, Aug 21, 2018 at 05:54:18PM +0100, Stuart Henderson wrote: > > > On 2018/08/21 17:16, Remi Locherer wrote: > > > > Hi tech, > > > > > > > > recently we had a short outage in our network. A script started an > > > > additional > > > > ospfd instance because the -n flag for config test was missing. > > > > > > This is a problem with bgpd as well, last time I did this it killed one > > > of the > > > *other* routers on the network (i.e. not just the one where I > > > accidentally ran > > > 2x bgpd...). > > > > > > > What then happend was not nice: > > > > - The new ospfd unlinked the control socket of the first ospfd > > > > - The new ospfd removed all routes from the first ospfd > > > > - The new ospfd was not able to build up an adjacency and therefore > > > > could > > > > not install the routes needed for a recovery. > > > > - Both ospfd instances were running but non-functional. > > > > > > > > Of course the faulty script is fixed by now. ;-) > > > > > > > > It would be nice if ospfd could prevent such a situation. > > > > > > > > Below diff does these things: > > > > - Detect a running ospfd by first doing a connect on the control socket. > > > > - Do not delete the control socket on exit. > > > > - This could delete the socket of another instance. > > > > - Unlinking the socket on shutdown will be in the way once we add > > > > pledge > > > > to the main process. It was removed recently from various daemons. > > > > > > This all sounds very sensible. > > > > > > > - Do not delete routes added by another process even if they have > > > > prio RTP_OSPF. Without this the new ospfd will remove all the routes > > > > of the first one. > > > > > > I'm unsure about this, the above changes stop the new ospfd from running > > > don't they, so that shouldn't be a problem? > > > > It stops to late. kr_init happens before and kill all existing routes with > > priority 32. And again in the shutdown function of ospfd. > > > > > > If an ospfd blows up for whatever reason, it would be quite inconvenient > > > if it needs manual route tweaks rather than just 'rcctl start ospfd' to > > > fix it .. > > > > Yes, this is not optimal. > > > > The new diff below defers kr_init until the ospf engine notifies the parent > > that the control socket is ready. In case the ospf engine exits because the > > control socket is already in use no routes are known that could be removed. > > > > With this ospfd keeps the behaviour of removing foreign routes with > > priority 32. > > > > Better? > > > > Why are we not checking the control socket in the parent? > Also it may be better to create the control socket in the parent and pass > it to the ospfe. This is what bgpd is doing and allows to change the path > during runtime with a config reload.
This makes sense to me. I'll come up with a new diff once I found some time for it. But I'm not sure about changing the socket path with a reload. I plan to pledge (stdio rpath sendfd wroute) and eventually unveil (read ospfd.conf) the main process. > > Could there be a case where this causes ospfd to hang on start in the > connect? Not sure if we can sleep doing a connect() to a AF_UNIX socket.