Package: ifupdown
Version: 0.8.36
Severity: important
Tags: patch

Hi!

Systemd has a class of boot-time races which can result in deadlock
while the ifupdown-pre.service is waiting for udevadm settle - in most
of the cases where that occurs ifupdown is an innocent victim of the
interactions between other things with poorly specified or insufficient
dependency and ordering relationships - but when those get trapped on
either side of ifupdown (reasonably enough) waiting for the initial
set of network devices to become available, people get locked out of
their remote machines after udevadm settle times out, ifupdown-pre
'fails', and then the networking.service is simply not started.


It seems there have been many instances, and many permutations, of
people effected by this class of systemd race-to-deadlock bugs, they
can be intermittent, very hard to get to the bottom of, and in almost
all of the reported cases I've found so far, people just gave up
trying to diagnose them and masked the ifupdown-pre.service as a
workaround.

But in almost all of those cases that's the wrong kludge as there was
nothing which had actually failed about waiting for the network devices
to be available, and nothing which would have subsequently prevented
networking.service from successfully starting ...

So I'd like to suggest a much better workaround, which should be the
default in ifupdown instead, is simply to change:

diff --git a/debian/networking.service b/debian/networking.service
index 593172b..b645409 100644
--- a/debian/networking.service
+++ b/debian/networking.service
@@ -2,7 +2,7 @@
 Description=Raise network interfaces
 Documentation=man:interfaces(5)
 DefaultDependencies=no
-Requires=ifupdown-pre.service
+Wants=ifupdown-pre.service
 Wants=network.target
 After=local-fs.target network-pre.target apparmor.service 
systemd-sysctl.service systemd-modules-load.service ifupdown-pre.service
 Before=network.target shutdown.target network-online.target


With this, networking.service will still wait for ifupdown-pre to
complete, either normally or by systemd's "bug fixing" timeout
when other services deadlock around it - and then in either case
the networking.service will independently either succeed (in the
probable case where networking devices were not part of the race
that deadlocked), or fail to bring up only the network devices
that were effected by that problem.  But it will be *much* less
likely for people to get locked out of remote access to fix the
real problem when the next dist-upgrade brings some change to the
set of unit files on their system which introduces this race in a
way their machine will lose (which was how I hit this on Buster
to Bullseye upgrades).


As a side note to all that, the TimeoutSec=180 in ifupdown-pre
is a bit misleading, as udevadm settle will itself time out in
120 seconds unless it is told to do otherwise.

  Cheers,
  Ron


As a postscript for anyone who might be interested, here is the
details of the particular race instance that first bit me and
got me digging into this:

 The BitBabbler package has udev rules and configuration for assigning
 hardware RNG devices directly to VM instances instead of to the host.

 It does this with a call to virsh, which in normal use (or prior to
 Bullseye) will 'immediately' either:

  - succeed
  - fail because the desired VM is not active
  - or fail because libvirtd has not yet started (or is not running)
    and its communication socket is not present.

 In no case was that operation ever expected to block for any extended
 duration, nor does it have any reason to.

 But in Bullseye, libvirt changed from managing its control socket
 itself to using a "socket activation" unit, which is created (aiui
 on the naive advice of systemd advocates) very early in the boot
 process - long before it would be able to start the service, as the
 service's own dependencies are not yet satisfied, and those are not
 applied transitively to the .socket unit which would be requesting
 the (as yet unstartable) service.

 So now we have a race where the kernel or a udev cold-plug trigger
 for an already attached BitBabbler triggers a call to virsh which
 is racing with the creation of the libvirtd.socket, if the socket
 unit has not yet created it, that fails (as expected) and everything
 runs normally, with the device being attached to the VM later when
 it is eventually started.

 But if the socket unit has already created a zombie socket, virsh
 will send its request to it and then wait for a response, which is
 never going to come because libvirtd being started is trapped on
 the other side of network.target being reached.

 And then ifupdown-pre innocently stumbles into this crime scene
 because calling udevadm settle at this point will in turn block
 until the call to virsh completes, and even though the network
 device events have probably been processed normally, probably
 before this whole chain of events even started, we now have a
 mexican standoff that has brought the whole show to a halt until
 systemd pulls its timeout trigger, and everyone loses in the
 resulting carnage.

The problem is fixable, but it requires fixes and mitigations in many
different places (at least while the systemd folk continue to insist
that "starting sockets as early as possible magically resolves all
dependencies" and don't make the dependencies of the service units
that sockets ultimately want to start be automatically transitive).
As long as there are zombie sockets things can block on, these sort
of circular races will always continue to exist.  No amount of
"deprecating" the use of udevadm settle, or other workarounds for
deadlocking will actually change that, they just sweep the problem
under a different rug that someone will eventually lift again.

ifupdown can make itself more resilient to this by using Wants to wait
for ifupdown-pre, but not failing to even try to start in this case as
it does when Requires is used.

I've tried to narrow the window for this race by testing earlier (in the
BitBabbler udev rule) for the presence of the libvirtd control socket
instead of waiting until virsh gets to the point where it does.  That
alone can't fix this problem, but it makes it harder and rarer to lose
on the slower machines where this was first seen.

And the next bug I'll file is for libvirt to defer the creation of its
.socket until the daemon's dependencies can be met so that the time
which this could block will become finite instead of an indefinite
deadlock - which will (aside from *very* slow machines still timing out,
which will always be a problem as long as systemd relies on timeouts to
resolve design and implementation bugs) actually fix this for this
particular permutation of participants ...


But until we've found and worked through all the possible permutations
of things that can create this situation, having ifupdown assume that
a timeout failure of ifupdown-pre is unlikely to mean networking.service
will also fail after that 2 minute delay, will give people the best
chance of still being able to access effected machines until it can be
traced and debugged in their particular case.

I have tested the patch above, prior to taking further actions to
prevent the race entirely, and after waiting for the timeout to fire
network does come up normally on all the machines I've had subjected
to this problem after a Bullseye upgrade.

Reply via email to