Package: prometheus Version: 2.24.1+ds-1+b7 Severity: important In the systemd unit provided by this Debian package (in `debian/service`), we have this:
[Service] Restart=on-failure This makes it so that systemd will try to restart prometheus if it exits for whatever reason other than exit code 0. This may be nice: you may want it to retry if it crashes or something. But I had a situation where I mistakenly pushed a broken config (broken rules, more accurately) to prometheus. The gory details are in this incident report: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40939 ... but the gist of it is that systemd repeatedly tried to restart the service and failed. And retried, and retried... this would fill the disk not only with logs of those attempts, but it would also grow the WAL every time (which is a separate issue). (Normally, prometheus fails to start and doesn't mess with its data if the config file is broken, since 2.20: https://github.com/prometheus/prometheus/pull/7399 ... but it seems this to be a courtesy that is not extended to rules. will file an upstream bug on that one.) Anyways, point is maybe we shouldn't restart so aggressively. Maybe `Restart=on-abnormal` or `Restart=on-abort` would be better? That way systemd wouldn't try to restart prometheus on syntax errors, and correctly fail instead of retrying the service forever. For now, I added a local override (`Restart=no`) to get through my day, but I'd be happy to have a discussion on the best way to deal with this. (I first considered limiting the number of retries to something more decent than the current "infinity", but there isn't a setting directly for that in systemd. There *are* things like `StartLimitIntervalSec=interval` and `StartLimitBurst=burst` but those were not triggered by my incident, because prometheus would take about 3 seconds to startup, which is above the default 5 restarts in 10 seconds default. So maybe that's another way to fix this, ie. raise the StartLimitIntervalSec (to, say 30 seconds) or lower the StartLimitBurst.) Either way, I think we can expect prometheus to return proper exit statuses and, in those case *not* restart prometheus, so I would propose `Restart=on-abort` instead of the `on-failure`. (Interestingly, the Restart=on-failure was introduced explicitly to handle situations like this: https://salsa.debian.org/go-team/packages/prometheus/-/commit/1a61bbb194 Excerpt: > Subject: Change systemd service Restart directive from always to on-failure > > The always value is unusual, as it ignores successful exits. The > prometheus daemon can also be requested to exit from its API, that > should be honored. ... but it didn't take into account non-transient failures like configuration errors.) I'll probably followup with a MR on the package as well. -- System Information: Debian Release: 11.5 APT prefers stable-security APT policy: (500, 'stable-security'), (500, 'stable-debug'), (500, 'stable'), (1, 'unstable'), (1, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 5.10.0-19-amd64 (SMP w/4 CPU threads) Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_FIRMWARE_WORKAROUND, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=fr_CA.UTF-8, LC_CTYPE=fr_CA.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages prometheus depends on: ii adduser 3.118 ii fonts-glyphicons-halflings 1.009~3.4.1+dfsg-2 ii init-system-helpers 1.60 ii libc6 2.31-13+deb11u4 pn libjs-bootstrap4 <none> pn libjs-eonasdan-bootstrap-datetimepicker <none> ii libjs-jquery 3.5.1+dfsg+~3.5.5-7 ii libjs-jquery-hotkeys 0~20130707+git2d51e3a9+dfsg-2.1 pn libjs-moment <none> pn libjs-moment-timezone <none> pn libjs-mustache <none> pn libjs-popper.js <none> pn libjs-rickshaw <none> Versions of packages prometheus recommends: pn prometheus-node-exporter <none> prometheus suggests no packages.