Cross-posted from 
https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610

In relation to alerting, I’m looking for a way to get strong alert delivery 
guarantees (and if delivery is not possible I want to know about it 
quickly).

Unless I’m mistaken AlertManager only offers best-effort delivery. What’s 
puzzled me though is that I’ve not found anyone else speaking about this, 
so I worry I’m missing something obvious. Am I?

Assuming I’m not mistaken I’ve been thinking of building a system with the 
architecture shown below.

[image: alertmanager-alertrouting.png]

Basically rather than having AlertManager try and push to destinations I’d 
have an AlertRouter which polls AlertManager. On each polling cycle the 
steps would be (neglecting any optimisations):

   - All active alerts are fetched from AlertManager.
   - The last known set of active alerts is read from the Alert Event Store.
   - The set of active alerts is compared with the last known state.
   - New alerts are added to an “active” partition in the Alert Event Store.
   - Resolved alerts are removed from the “active” partition and added to a 
   “resolved” partition.

A secondary process within AlertRouter would:

   - Check for alerts in the “active” partition which do not have a state 
   of “delivered = true”.
   - Attempt to send each of these alerts and set the “delivered” flag.
   - Check for alerts in the “resolved” partition which do not have a state 
   of “delivered = true”.
   - Attempt to send each of these resolved alerts and set the “delivered” 
   flag.
   - Move all alerts in the “resolved” partition where “delivered=true” to 
   a “completed” partition.

Among other metrics, the AlertRouter would emit one called 
“undelivered_alert_lowest_timestamp_in_seconds” and this could be used to 
alert me to cases where any alert could not be delivered quickly enough. 
Since the alert is still held in the Alert Event Store it should be 
possible for me to resolve whatever issue is blocking and not lose the 
alert.

I think there are other benefits to this architecture too, e.g. similar to 
the way Prometheus scrapes, natural back-pressure is a property of the 
system.

Anyway, as mentioned I’ve not found anyone else doing something like this 
and this makes me wonder if there’s a very good reason not to. If anyone 
knows that this design is crazy I’d love to hear!

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com.

Reply via email to