Bug#1022724: systemd restarts prometheus forever, leading to disk usage death spiral

Antoine Beaupre Mon, 24 Oct 2022 09:30:21 -0700

Package: prometheus
Version: 2.24.1+ds-1+b7
Severity: important

In the systemd unit provided by this Debian package (in
`debian/service`), we have this:


[Service]
Restart=on-failure

This makes it so that systemd will try to restart prometheus if it
exits for whatever reason other than exit code 0. This may be nice:
you may want it to retry if it crashes or something.

But I had a situation where I mistakenly pushed a broken config
(broken rules, more accurately) to prometheus. The gory details are in
this incident report:

https://gitlab.torproject.org/tpo/tpa/team/-/issues/40939

... but the gist of it is that systemd repeatedly tried to restart the
service and failed. And retried, and retried... this would fill the
disk not only with logs of those attempts, but it would also grow the
WAL every time (which is a separate issue).

(Normally, prometheus fails to start and doesn't mess with its data if
the config file is broken, since 2.20:

https://github.com/prometheus/prometheus/pull/7399

... but it seems this to be a courtesy that is not extended to
rules. will file an upstream bug on that one.)

Anyways, point is maybe we shouldn't restart so aggressively. Maybe
`Restart=on-abnormal` or `Restart=on-abort` would be better? That way
systemd wouldn't try to restart prometheus on syntax errors, and
correctly fail instead of retrying the service forever.

For now, I added a local override (`Restart=no`) to get through my
day, but I'd be happy to have a discussion on the best way to deal
with this.

(I first considered limiting the number of retries to something more
decent than the current "infinity", but there isn't a setting directly
for that in systemd. There *are* things like
`StartLimitIntervalSec=interval` and `StartLimitBurst=burst` but those
were not triggered by my incident, because prometheus would take about
3 seconds to startup, which is above the default 5 restarts in 10
seconds default. So maybe that's another way to fix this, ie. raise
the StartLimitIntervalSec (to, say 30 seconds) or lower the
StartLimitBurst.)

Either way, I think we can expect prometheus to return proper exit
statuses and, in those case *not* restart prometheus, so I would
propose `Restart=on-abort` instead of the `on-failure`.

(Interestingly, the Restart=on-failure was introduced explicitly to
handle situations like this:

https://salsa.debian.org/go-team/packages/prometheus/-/commit/1a61bbb194

Excerpt:

> Subject: Change systemd service Restart directive from always to on-failure
> 
> The always value is unusual, as it ignores successful exits. The
> prometheus daemon can also be requested to exit from its API, that
> should be honored.

... but it didn't take into account non-transient failures like
configuration errors.)

I'll probably followup with a MR on the package as well.

-- System Information:
Debian Release: 11.5
  APT prefers stable-security
  APT policy: (500, 'stable-security'), (500, 'stable-debug'), (500, 'stable'), 
(1, 'unstable'), (1, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-19-amd64 (SMP w/4 CPU threads)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_FIRMWARE_WORKAROUND, 
TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=fr_CA.UTF-8, LC_CTYPE=fr_CA.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages prometheus depends on:
ii  adduser                                  3.118
ii  fonts-glyphicons-halflings               1.009~3.4.1+dfsg-2
ii  init-system-helpers                      1.60
ii  libc6                                    2.31-13+deb11u4
pn  libjs-bootstrap4                         <none>
pn  libjs-eonasdan-bootstrap-datetimepicker  <none>
ii  libjs-jquery                             3.5.1+dfsg+~3.5.5-7
ii  libjs-jquery-hotkeys                     0~20130707+git2d51e3a9+dfsg-2.1
pn  libjs-moment                             <none>
pn  libjs-moment-timezone                    <none>
pn  libjs-mustache                           <none>
pn  libjs-popper.js                          <none>
pn  libjs-rickshaw                           <none>

Versions of packages prometheus recommends:
pn  prometheus-node-exporter  <none>

prometheus suggests no packages.

Bug#1022724: systemd restarts prometheus forever, leading to disk usage death spiral

Reply via email to