[prometheus-developers] Is this alerting architecture crazy?

Tony Di Nucci Sat, 20 Nov 2021 02:02:52 -0800

Cross-posted from 
https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610

In relation to alerting, I’m looking for a way to get strong alert delivery
guarantees (and if delivery is not possible I want to know about it
quickly).

Unless I’m mistaken AlertManager only offers best-effort delivery. What’s
puzzled me though is that I’ve not found anyone else speaking about this,
so I worry I’m missing something obvious. Am I?

Assuming I’m not mistaken I’ve been thinking of building a system with the
architecture shown below.

[image: alertmanager-alertrouting.png]

Basically rather than having AlertManager try and push to destinations I’d
have an AlertRouter which polls AlertManager. On each polling cycle the
steps would be (neglecting any optimisations):

- All active alerts are fetched from AlertManager.
- The last known set of active alerts is read from the Alert Event Store.
- The set of active alerts is compared with the last known state.
- New alerts are added to an “active” partition in the Alert Event Store.
- Resolved alerts are removed from the “active” partition and added to a
“resolved” partition.

A secondary process within AlertRouter would:

- Check for alerts in the “active” partition which do not have a state
of “delivered = true”.
- Attempt to send each of these alerts and set the “delivered” flag.
- Check for alerts in the “resolved” partition which do not have a state
of “delivered = true”.
- Attempt to send each of these resolved alerts and set the “delivered”
flag.
- Move all alerts in the “resolved” partition where “delivered=true” to
a “completed” partition.

Among other metrics, the AlertRouter would emit one called
“undelivered_alert_lowest_timestamp_in_seconds” and this could be used to
alert me to cases where any alert could not be delivered quickly enough.
Since the alert is still held in the Alert Event Store it should be
possible for me to resolve whatever issue is blocking and not lose the
alert.

I think there are other benefits to this architecture too, e.g. similar to
the way Prometheus scrapes, natural back-pressure is a property of the
system.

Anyway, as mentioned I’ve not found anyone else doing something like this
and this makes me wonder if there’s a very good reason not to. If anyone
knows that this design is crazy I’d love to hear!

Thanks

--
You received this message because you are subscribed to the Google Groups
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com.

[prometheus-developers] Is this alerting architecture crazy?

Reply via email to