Thanks for the feedback Stuart, I really appreciate you taking the time and 
you've given me reason to pause and reconsider my options.

I fully understand your concerns over having a new data store.  I'm not 
sure that AlertManager and Prometheus contain the state I need though and 
I'm not sure I should attempt to use Prometheus as the store for this state 
(tracking per alert latencies would end up with a metric with unbounded 
cardinality, each series would just contain a single data point and it 
would be tricky to analyse this data).

On the "guaranteeing" delivery front.  You of course have a point that the 
more moving parts there are the more that can go wrong.  From the sounds of 
things though I don't think we're debating the need for another system 
(since this is what a webhook receiver would be?).  

Unless I'm mistaken, to hit the following requirements there'll need to be 
a system external AlertManager and this will have to maintain some state:
* supporting complex alert enrichment (in ways that cannot be defined in 
alerting rules)
* support business specific alert routing rules (which are defined outside 
of alerting rules)
* support detailed alert analysis (which includes per alert latencies)

I think this means that the question is limited to; is it better in my case 
to push or pull from AlertManager.  BTW, I'm sorry for the way I worded my 
original post because I now realise how important it was to make explicit 
the requirements that (I think) necessitate the majority of the complexity.

As I still see it, the problems with the push approach (which are not 
present with the pull approach are):
* It's only possible to know that an alert cannot be delivered after 
waiting for *group_interval *(typically many minutes)
* At a given moment it's not possible to determine whether a specific 
active alert has been delivered (at least I'm not aware of a way to 
determine this)
* It is possible for alerts to be dropped 
(e.g. 
https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277)
 

The tradeoffs for this are:
* I'd need to discover the AlertManager instances.  This is pretty straight 
forward in k8s.
* I may need to dedupe alert groups across AlertManager instances.  I think 
this would be pretty straight forward too, esp. since AlertManager already 
populates fingerprints.


 

On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote:

> On 20/11/2021 23:42, Tony Di Nucci wrote:
> > Yes, the diagram is a bit of a simplification but not hugely.
> >
> > There may be multiple instances of AlertRouter however they will share 
> > a database.  Most likely things will be kept simple (at least 
> > initially) where each instance holds no state of its own.  Each active 
> > alert in the DB will be uniquely identified by the alert fingerprint 
> > (which the AlertManager API provides, i.e. a hash of the alert groups 
> > labels).  Each non-active alert will have a composite key (where one 
> > element is the alert group fingerprint).
> >
> > In this architecture I see AlertManager having the responsibilities of 
> > capturing, grouping, inhibiting and silencing alerts.  The AlertRouter 
> > will have the responsibilities of; enriching alerts, routing based on 
> > business rules, monitoring/guaranteeing delivery and enabling analysis 
> > of alert history.
> >
> > Due to my requirements, I think I need something like the 
> > AlertRouter.  The question is really, am I better to push from 
> > AlertManager to AlertRouter, or to have AlertRouter pull from 
> > AlertManager.  My current opinion is that pulling comes with more 
> > benefits but since I've not seen anyone else doing this I'm concerned 
> > there could be good reasons (I'm not aware of) for not doing this.
>
> If you really must have another system connected to Alertmanager having 
> it respond to webhook notifications would be the much simpler option. 
> You'd still need to run multiple copies of you application behind a load 
> balancer (and have a clustered database) for HA, but at least you'd not 
> have the complexity of each instance having to discover all the 
> Alertmanager instances, query them and then deduplicate amongst the 
> different instances (again something that Alertmanager does itself 
> already).
>
> I'm still struggling to see why you need an extra system at all - it 
> feels very much like you'd be increasing complexity significantly which 
> naturally decreases reliability (more bits to break, have bugs or act in 
> unexpected ways) and slow things down (as there is another "hop" for an 
> alert to pass through). All of the things you mention can be done 
> already through Alertmanager, or could be done pretty simply with a 
> webhook receiver (without the need for any additional state storage, etc.)
>
> * Adding data to an alert could be done with a simple webhook receiver, 
> that accepts an alert and then forwards it on to another API with extra 
> information added (no need for any state)
> * Routing can be done within Alertmanager, or for more complex cases 
> could again be handled by a stateless webhook receiver
> * With regards to "guaranteeing" delivery I don't see your suggestion in 
> allowing that (I believe it would actually make that less likely overall 
> due to the added complexity and likelihood of bugs/unhandled cases). 
> Alertmanager already does a good job of retrying on errors (and updating 
> metrics if that happens) but not much can be done if the final system is 
> totally down for long periods of time (and for many systems if that 
> happens old alerts aren't very useful once it is back, as they may have 
> already resolved).
> * Alertmanager and Prometheus already expose a number of useful metrics 
> (make sure your Prometheus is scraping itself & all the connected 
> Alertmanagers) which should give you lots of useful information about 
> alert history (with the advantage of that data being with the monitoring 
> system you already know [with whatever you have connected like 
> dashboards, alerts, etc.])
>
> -- 
> Stuart Clark
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com.

Reply via email to