On Wed, 17 Jun 2026 09:25:07 -0400
Aaron Conole <[email protected]> wrote:

> Timothy Redaelli via dev <[email protected]> writes:
> 
> > When ovsdb-server or ovs-vswitchd fails and auto-restarts
> > (Restart=on-failure), it briefly passes through the failed/inactive
> > state.  This causes a cascade: the umbrella service (which Requires
> > both) sees the failure and stops, which in turn stops the other
> > service via PartOf.  When the failed service comes back, the other
> > does not automatically restart.
> >
> > RestartMode=direct (systemd v254+, PR systemd/systemd#27584) makes
> > the service transition directly to the activating state during
> > auto-restart, skipping the failed/inactive state.  Dependents never
> > see the failure, so the cascade does not happen.
> >
> > On older systemd versions the directive is silently ignored with a
> > harmless journal warning ("Unknown key name 'RestartMode'"), so
> > this change is safe for all supported platforms.  Tested with
> > containers:
> >
> >   systemd 252 (CentOS Stream 9, Debian 12): warning, ignored
> >   systemd 255 (Ubuntu 24.04): recognized, clean
> >   systemd 256 (CentOS Stream 10): recognized, clean
> >   systemd 257 (Debian 13): recognized, clean
> 
> I didn't check, but we should probably make sure that any systems where
> we apply this also have:
> 
> https://github.com/goenkam/systemd/commit/7f85fc2c31f074badcf4d517a4f84a1fd72cf909
> 
> applied, right?  Otherwise, I think there's some kind of looped
> dependency restarts when this is triggered.

That commit (upstream 7a13937007, in v257+) fixes stop-job propagation
to BindsTo= dependents during direct-mode restarts.
OVS don't use BindsTo=, openvswitch.service uses Requires= on the
sub-services, and the sub-services use PartOf=openvswitch.service.

The cascade we're preventing happens because Requires= reacts to
the sub-service entering the failed/inactive state.
RestartMode=direct prevents that by skipping the state transition
entirely, and that code path has been there since
v254.

> But actually, this mode should only be on Type=one-shot services I
> think.  If ovsdb-server experiences failure, the RestartMode=direct
> shouldn't have any effect.  I'm guessing based on this:
> 
> * i.e. unit_process_job -> job_finish_and_invalidate is never called,
> * and the previous job might still be running (especially for
> * Type=oneshot services).
> 
> Which seems to imply that if there's a weird failure propagated, we
> might end up with too many instances of vswitchd/db-server running.

RestartMode=direct is not restricted to Type=oneshot, it works with
any service type.
The comment you quoted says "especially for Type=oneshot services"
because those have long-running ExecStart= commands that might still be
in progress when a restart is attempted.

Our services are Type=forking with PIDFile=. This means the restart only
triggers when the main process exits (that's what Restart=on-failure
reacts to), so by the time service_enter_restart() runs, the old
process is already gone.
There's no window where two instances coexist.

Re-reading systemd service files made me think about migrating
Type=forking to Type=notify to avoid useless forking + PID checking and
to have a proper readiness signaling (sd_notify), but I'll do that as a
follow up series (since RestartMode=direct will still be needed).

> Perhaps I'm misunderstanding something.
> 
> > Timothy Redaelli (2):
> >   rhel: Add RestartMode=direct to service units.
> >   debian: Add RestartMode=direct to service units.
> >
> >  debian/openvswitch-switch.ovs-vswitchd.service      | 1 +
> >  debian/openvswitch-switch.ovsdb-server.service      | 1 +
> >  rhel/usr_lib_systemd_system_ovs-vswitchd.service.in | 1 +
> >  rhel/usr_lib_systemd_system_ovsdb-server.service    | 1 +
> >  4 files changed, 4 insertions(+)
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to