Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-22 Thread Tony Di Nucci
> Honestly, most of what you want is stuff we could support in Alertmanager 
without a lot of trouble. And are things that other users would want as 
well. Rather than build a whole new system, why not contribute improvements 
directly to the Alertmanager.

That's a very good point and something I think would be great to do.  
Something I will have to keep in mind though is how things may play out in 
the world of hosted "Prometheus" solutions - if I were to go with one of 
these solutions then I'd have no control over when new features would be 
made available.

FWIW the custom routing that I'm talking about is very business specific 
and involves consulting (yet another!) system to determine the final alert 
severity and where it gets routed to.  I guess this could be supported in 
AlertManager (by having hooks or plugins), whether the maintainers of AM 
like this will obviously be it's own question.

I'll discuss this with my colleges to see whether we can consider 
contributing to AlertManager.

Thanks for the help!

On Monday, November 22, 2021 at 3:29:08 PM UTC sup...@gmail.com wrote:

> On Mon, Nov 22, 2021 at 4:03 PM Tony Di Nucci  wrote:
>
>> Thanks for the feedback Stuart, I really appreciate you taking the time 
>> and you've given me reason to pause and reconsider my options.
>>
>> I fully understand your concerns over having a new data store.  I'm not 
>> sure that AlertManager and Prometheus contain the state I need though and 
>> I'm not sure I should attempt to use Prometheus as the store for this state 
>> (tracking per alert latencies would end up with a metric with unbounded 
>> cardinality, each series would just contain a single data point and it 
>> would be tricky to analyse this data).
>>
>> On the "guaranteeing" delivery front.  You of course have a point that 
>> the more moving parts there are the more that can go wrong.  From the 
>> sounds of things though I don't think we're debating the need for another 
>> system (since this is what a webhook receiver would be?).  
>>
>> Unless I'm mistaken, to hit the following requirements there'll need to 
>> be a system external AlertManager and this will have to maintain some state:
>> * supporting complex alert enrichment (in ways that cannot be defined in 
>> alerting rules)
>>
>
> We actually are interested in adding this to the alertmanager, there are a 
> few open proposals for this. Basically the idea is that you can make an 
> enrichment call at alert time to do things like grab metrics/dashboard 
> snapshots, other system state, etc.
>  
>
>> * support business specific alert routing rules (which are defined 
>> outside of alerting rules)
>>
>
> The alertmanager routing rules are pretty powerful already. Depending on 
> what you're interested in adding, this is something we could support 
> directly.
>  
>
>> * support detailed alert analysis (which includes per alert latencies)
>>
>
> This is, IMO, more of a logging problem. I think this is something we 
> could add. You ship the alert notifications to any kind of BI system you 
> like, ELK, etc. 
>
> Maybe something to integrate into 
> https://github.com/yakshaving-art/alertsnitch.
>  
>
>>
>> I think this means that the question is limited to; is it better in my 
>> case to push or pull from AlertManager.  BTW, I'm sorry for the way I 
>> worded my original post because I now realise how important it was to make 
>> explicit the requirements that (I think) necessitate the majority of the 
>> complexity.
>>
>
> Honestly, most of what you want is stuff we could support in Alertmanager 
> without a lot of trouble. And are things that other users would want as 
> well. Rather than build a whole new system, why not contribute improvements 
> directly to the Alertmanager.
>  
>
>>
>> As I still see it, the problems with the push approach (which are not 
>> present with the pull approach are):
>> * It's only possible to know that an alert cannot be delivered after 
>> waiting for *group_interval *(typically many minutes)
>> * At a given moment it's not possible to determine whether a specific 
>> active alert has been delivered (at least I'm not aware of a way to 
>> determine this)
>> * It is possible for alerts to be dropped (e.g. 
>> https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277
>> ) 
>>
>> The tradeoffs for this are:
>> * I'd need to discover the AlertManager instances.  This is pretty 
>> straight forward in k8s.
>> * I may need to dedupe alert groups across AlertManager instances.  I 
>> think this would be pretty straight forward too, esp. since AlertManager 
>> already populates fingerprints.
>>
>>
>>  
>>
>> On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote:
>>
>>> On 20/11/2021 23:42, Tony Di Nucci wrote: 
>>> > Yes, the diagram is a bit of a simplification but not hugely. 
>>> > 
>>> > There may be multiple instances of AlertRouter however they will share 
>>> > a database.  Most likely 

Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-22 Thread Ben Kochie
On Mon, Nov 22, 2021 at 4:03 PM Tony Di Nucci  wrote:

> Thanks for the feedback Stuart, I really appreciate you taking the time
> and you've given me reason to pause and reconsider my options.
>
> I fully understand your concerns over having a new data store.  I'm not
> sure that AlertManager and Prometheus contain the state I need though and
> I'm not sure I should attempt to use Prometheus as the store for this state
> (tracking per alert latencies would end up with a metric with unbounded
> cardinality, each series would just contain a single data point and it
> would be tricky to analyse this data).
>
> On the "guaranteeing" delivery front.  You of course have a point that the
> more moving parts there are the more that can go wrong.  From the sounds of
> things though I don't think we're debating the need for another system
> (since this is what a webhook receiver would be?).
>
> Unless I'm mistaken, to hit the following requirements there'll need to be
> a system external AlertManager and this will have to maintain some state:
> * supporting complex alert enrichment (in ways that cannot be defined in
> alerting rules)
>

We actually are interested in adding this to the alertmanager, there are a
few open proposals for this. Basically the idea is that you can make an
enrichment call at alert time to do things like grab metrics/dashboard
snapshots, other system state, etc.


> * support business specific alert routing rules (which are defined outside
> of alerting rules)
>

The alertmanager routing rules are pretty powerful already. Depending on
what you're interested in adding, this is something we could support
directly.


> * support detailed alert analysis (which includes per alert latencies)
>

This is, IMO, more of a logging problem. I think this is something we could
add. You ship the alert notifications to any kind of BI system you like,
ELK, etc.

Maybe something to integrate into
https://github.com/yakshaving-art/alertsnitch.


>
> I think this means that the question is limited to; is it better in my
> case to push or pull from AlertManager.  BTW, I'm sorry for the way I
> worded my original post because I now realise how important it was to make
> explicit the requirements that (I think) necessitate the majority of the
> complexity.
>

Honestly, most of what you want is stuff we could support in Alertmanager
without a lot of trouble. And are things that other users would want as
well. Rather than build a whole new system, why not contribute improvements
directly to the Alertmanager.


>
> As I still see it, the problems with the push approach (which are not
> present with the pull approach are):
> * It's only possible to know that an alert cannot be delivered after
> waiting for *group_interval *(typically many minutes)
> * At a given moment it's not possible to determine whether a specific
> active alert has been delivered (at least I'm not aware of a way to
> determine this)
> * It is possible for alerts to be dropped (e.g.
> https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277
> )
>
> The tradeoffs for this are:
> * I'd need to discover the AlertManager instances.  This is pretty
> straight forward in k8s.
> * I may need to dedupe alert groups across AlertManager instances.  I
> think this would be pretty straight forward too, esp. since AlertManager
> already populates fingerprints.
>
>
>
>
> On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote:
>
>> On 20/11/2021 23:42, Tony Di Nucci wrote:
>> > Yes, the diagram is a bit of a simplification but not hugely.
>> >
>> > There may be multiple instances of AlertRouter however they will share
>> > a database.  Most likely things will be kept simple (at least
>> > initially) where each instance holds no state of its own.  Each active
>> > alert in the DB will be uniquely identified by the alert fingerprint
>> > (which the AlertManager API provides, i.e. a hash of the alert groups
>> > labels).  Each non-active alert will have a composite key (where one
>> > element is the alert group fingerprint).
>> >
>> > In this architecture I see AlertManager having the responsibilities of
>> > capturing, grouping, inhibiting and silencing alerts.  The AlertRouter
>> > will have the responsibilities of; enriching alerts, routing based on
>> > business rules, monitoring/guaranteeing delivery and enabling analysis
>> > of alert history.
>> >
>> > Due to my requirements, I think I need something like the
>> > AlertRouter.  The question is really, am I better to push from
>> > AlertManager to AlertRouter, or to have AlertRouter pull from
>> > AlertManager.  My current opinion is that pulling comes with more
>> > benefits but since I've not seen anyone else doing this I'm concerned
>> > there could be good reasons (I'm not aware of) for not doing this.
>>
>> If you really must have another system connected to Alertmanager having
>> it respond to webhook notifications would be the much si

Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-22 Thread Tony Di Nucci
Thanks for the feedback Stuart, I really appreciate you taking the time and 
you've given me reason to pause and reconsider my options.

I fully understand your concerns over having a new data store.  I'm not 
sure that AlertManager and Prometheus contain the state I need though and 
I'm not sure I should attempt to use Prometheus as the store for this state 
(tracking per alert latencies would end up with a metric with unbounded 
cardinality, each series would just contain a single data point and it 
would be tricky to analyse this data).

On the "guaranteeing" delivery front.  You of course have a point that the 
more moving parts there are the more that can go wrong.  From the sounds of 
things though I don't think we're debating the need for another system 
(since this is what a webhook receiver would be?).  

Unless I'm mistaken, to hit the following requirements there'll need to be 
a system external AlertManager and this will have to maintain some state:
* supporting complex alert enrichment (in ways that cannot be defined in 
alerting rules)
* support business specific alert routing rules (which are defined outside 
of alerting rules)
* support detailed alert analysis (which includes per alert latencies)

I think this means that the question is limited to; is it better in my case 
to push or pull from AlertManager.  BTW, I'm sorry for the way I worded my 
original post because I now realise how important it was to make explicit 
the requirements that (I think) necessitate the majority of the complexity.

As I still see it, the problems with the push approach (which are not 
present with the pull approach are):
* It's only possible to know that an alert cannot be delivered after 
waiting for *group_interval *(typically many minutes)
* At a given moment it's not possible to determine whether a specific 
active alert has been delivered (at least I'm not aware of a way to 
determine this)
* It is possible for alerts to be dropped 
(e.g. 
https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277)
 

The tradeoffs for this are:
* I'd need to discover the AlertManager instances.  This is pretty straight 
forward in k8s.
* I may need to dedupe alert groups across AlertManager instances.  I think 
this would be pretty straight forward too, esp. since AlertManager already 
populates fingerprints.


 

On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote:

> On 20/11/2021 23:42, Tony Di Nucci wrote:
> > Yes, the diagram is a bit of a simplification but not hugely.
> >
> > There may be multiple instances of AlertRouter however they will share 
> > a database.  Most likely things will be kept simple (at least 
> > initially) where each instance holds no state of its own.  Each active 
> > alert in the DB will be uniquely identified by the alert fingerprint 
> > (which the AlertManager API provides, i.e. a hash of the alert groups 
> > labels).  Each non-active alert will have a composite key (where one 
> > element is the alert group fingerprint).
> >
> > In this architecture I see AlertManager having the responsibilities of 
> > capturing, grouping, inhibiting and silencing alerts.  The AlertRouter 
> > will have the responsibilities of; enriching alerts, routing based on 
> > business rules, monitoring/guaranteeing delivery and enabling analysis 
> > of alert history.
> >
> > Due to my requirements, I think I need something like the 
> > AlertRouter.  The question is really, am I better to push from 
> > AlertManager to AlertRouter, or to have AlertRouter pull from 
> > AlertManager.  My current opinion is that pulling comes with more 
> > benefits but since I've not seen anyone else doing this I'm concerned 
> > there could be good reasons (I'm not aware of) for not doing this.
>
> If you really must have another system connected to Alertmanager having 
> it respond to webhook notifications would be the much simpler option. 
> You'd still need to run multiple copies of you application behind a load 
> balancer (and have a clustered database) for HA, but at least you'd not 
> have the complexity of each instance having to discover all the 
> Alertmanager instances, query them and then deduplicate amongst the 
> different instances (again something that Alertmanager does itself 
> already).
>
> I'm still struggling to see why you need an extra system at all - it 
> feels very much like you'd be increasing complexity significantly which 
> naturally decreases reliability (more bits to break, have bugs or act in 
> unexpected ways) and slow things down (as there is another "hop" for an 
> alert to pass through). All of the things you mention can be done 
> already through Alertmanager, or could be done pretty simply with a 
> webhook receiver (without the need for any additional state storage, etc.)
>
> * Adding data to an alert could be done with a simple webhook receiver, 
> that accepts an alert and then forwards it on to another API with e

Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-21 Thread Stuart Clark

On 20/11/2021 23:42, Tony Di Nucci wrote:

Yes, the diagram is a bit of a simplification but not hugely.

There may be multiple instances of AlertRouter however they will share 
a database.  Most likely things will be kept simple (at least 
initially) where each instance holds no state of its own.  Each active 
alert in the DB will be uniquely identified by the alert fingerprint 
(which the AlertManager API provides, i.e. a hash of the alert groups 
labels).  Each non-active alert will have a composite key (where one 
element is the alert group fingerprint).


In this architecture I see AlertManager having the responsibilities of 
capturing, grouping, inhibiting and silencing alerts.  The AlertRouter 
will have the responsibilities of; enriching alerts, routing based on 
business rules, monitoring/guaranteeing delivery and enabling analysis 
of alert history.


Due to my requirements, I think I need something like the 
AlertRouter.  The question is really, am I better to push from 
AlertManager to AlertRouter, or to have AlertRouter pull from 
AlertManager.  My current opinion is that pulling comes with more 
benefits but since I've not seen anyone else doing this I'm concerned 
there could be good reasons (I'm not aware of) for not doing this.


If you really must have another system connected to Alertmanager having 
it respond to webhook notifications would be the much simpler option. 
You'd still need to run multiple copies of you application behind a load 
balancer (and have a clustered database) for HA, but at least you'd not 
have the complexity of each instance having to discover all the 
Alertmanager instances, query them and then deduplicate amongst the 
different instances (again something that Alertmanager does itself already).


I'm still struggling to see why you need an extra system at all - it 
feels very much like you'd be increasing complexity significantly which 
naturally decreases reliability (more bits to break, have bugs or act in 
unexpected ways) and slow things down (as there is another "hop" for an 
alert to pass through). All of the things you mention can be done 
already through Alertmanager, or could be done pretty simply with a 
webhook receiver (without the need for any additional state storage, etc.)


* Adding data to an alert could be done with a simple webhook receiver, 
that accepts an alert and then forwards it on to another API with extra 
information added (no need for any state)
* Routing can be done within Alertmanager, or for more complex cases 
could again be handled by a stateless webhook receiver
* With regards to "guaranteeing" delivery I don't see your suggestion in 
allowing that (I believe it would actually make that less likely overall 
due to the added complexity and likelihood of bugs/unhandled cases). 
Alertmanager already does a good job of retrying on errors (and updating 
metrics if that happens) but not much can be done if the final system is 
totally down for long periods of time (and for many systems if that 
happens old alerts aren't very useful once it is back, as they may have 
already resolved).
* Alertmanager and Prometheus already expose a number of useful metrics 
(make sure your Prometheus is scraping itself & all the connected 
Alertmanagers) which should give you lots of useful information about 
alert history (with the advantage of that data being with the monitoring 
system you already know [with whatever you have connected like 
dashboards, alerts, etc.])


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/894f2f0c-2a8e-dc83-d4fa-cf4a1d605db9%40Jahingo.com.


Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-20 Thread Tony Di Nucci
Yes, the diagram is a bit of a simplification but not hugely.

There may be multiple instances of AlertRouter however they will share a 
database.  Most likely things will be kept simple (at least initially) 
where each instance holds no state of its own.  Each active alert in the DB 
will be uniquely identified by the alert fingerprint (which the 
AlertManager API provides, i.e. a hash of the alert groups labels).  Each 
non-active alert will have a composite key (where one element is the alert 
group fingerprint).

In this architecture I see AlertManager having the responsibilities of 
capturing, grouping, inhibiting and silencing alerts.  The AlertRouter will 
have the responsibilities of; enriching alerts, routing based on business 
rules, monitoring/guaranteeing delivery and enabling analysis of alert 
history.

Due to my requirements, I think I need something like the AlertRouter.  The 
question is really, am I better to push from AlertManager to AlertRouter, 
or to have AlertRouter pull from AlertManager.  My current opinion is that 
pulling comes with more benefits but since I've not seen anyone else doing 
this I'm concerned there could be good reasons (I'm not aware of) for not 
doing this.

On Saturday, November 20, 2021 at 5:38:06 PM UTC Stuart Clark wrote:

> It sounds like you are planning on creating a fairly complex system that 
> duplicates a reasonable amount of what Alertmanager already does. I'm 
> presuming your diagram is a simplification and that the application is 
> itself a cluster, so each instance would be querying each instance of 
> Alertmanager? Would your storage be part of the clustering system (similar 
> to Alertmanager) or another cluster of something like a relational 
> database? 
>
> On 20 November 2021 11:28:30 GMT, Tony Di Nucci  
> wrote:
>>
>> There are other things I need to do as well, alert enrichment, complex 
>> routing, etc.  which means that I think some additional system is needed 
>> between AlertManager and the final destination in any case.
>>
>> The main question in my mind is really; are there reasons why I should 
>> prefer to have AlertManager push to this new system over having this new 
>> system pull?  
>>
>> My reasons for preferring a pull based architecture are:
>> * Just by looking at the AlertRouter we can get a reasonable 
>> understanding of overall health.  If alerts are pushed to the router then 
>> it alone can't tell the difference between no alerts firing and it not 
>> receiving alerts that have fired.
>> * Backpressure is a natural property of the system.
>>
>> With this extra context, what do you think?
>>
>> On Saturday, November 20, 2021 at 11:08:58 AM UTC Tony Di Nucci wrote:
>>
>>> Thanks for the feedback.
>>>
>>> > What gives you the impression that the Alertmanager is "best effort"?
>>> Sorry, best-effort probably wasn't the right term to use.  I am aware of 
>>> there being retries however these could still all fail and I'm thinking I 
>>> wouldn't be made aware of the issue for potentially quite a long time.
>>>
>>> My understanding is that an 
>>> *alertmanager_notification_requests_failed_total* counter will be 
>>> incremented each time there is a failed send attempt however from this 
>>> alone I can't tell the difference between a single alert that's 
>>> consistently failing and a small number of alerts which are all failing.  I 
>>> think this means that I've got to wait until 
>>> *alertmanager_notifications_failed_total 
>>> *is incremented before considering an alert to have failed (and this 
>>> can take many minutes) and then a bit of exploration is needed to figure 
>>> out which alert(s) failed.  Depending on the criticality of the alert it 
>>> may be fine for it to take some minutes before we're made aware of a 
>>> delivery problem, in other cases though it won't be.
>>>
>>> A couple of things I didn't really touch on originally which will also 
>>> help explain where my head is:
>>> * I have a requirement to be able to measure accurate latency per alert 
>>> through the alerting pipeline, i.e. for each alert I need to know the 
>>> amount of time it was known to AlertManager before it was successfully 
>>> written to the destination.
>>> * I have a requirement to be able to analyse historic alerts.
>>>
>>>
>>>
>>> On Saturday, November 20, 2021 at 10:33:12 AM UTC sup...@gmail.com 
>>> wrote:
>>>
 Also, the alertmanager does have an "even store", it's a shared state 
 between all instances.

 If you're interested in changing some of the behavior of the retry 
 mechanisms or how this works, feel free to open specific issues. You don't 
 need to build an entirely new system, we can add new features to the 
 existing Alertmanager clustering framework.

 On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie  wrote:

> What gives you the impression that the Alertmanager is "best effort"?
>
> The alertmanager provides a reasonably robust HA solution (gossip 
> cluster

Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-20 Thread Stuart Clark
It sounds like you are planning on creating a fairly complex system that 
duplicates a reasonable amount of what Alertmanager already does. I'm presuming 
your diagram is a simplification and that the application is itself a cluster, 
so each instance would be querying each instance of Alertmanager? Would your 
storage be part of the clustering system (similar to Alertmanager) or another 
cluster of something like a relational database? 

On 20 November 2021 11:28:30 GMT, Tony Di Nucci  wrote:
>There are other things I need to do as well, alert enrichment, complex 
>routing, etc.  which means that I think some additional system is needed 
>between AlertManager and the final destination in any case.
>
>The main question in my mind is really; are there reasons why I should 
>prefer to have AlertManager push to this new system over having this new 
>system pull?  
>
>My reasons for preferring a pull based architecture are:
>* Just by looking at the AlertRouter we can get a reasonable understanding 
>of overall health.  If alerts are pushed to the router then it alone can't 
>tell the difference between no alerts firing and it not receiving alerts 
>that have fired.
>* Backpressure is a natural property of the system.
>
>With this extra context, what do you think?
>
>On Saturday, November 20, 2021 at 11:08:58 AM UTC Tony Di Nucci wrote:
>
>> Thanks for the feedback.
>>
>> > What gives you the impression that the Alertmanager is "best effort"?
>> Sorry, best-effort probably wasn't the right term to use.  I am aware of 
>> there being retries however these could still all fail and I'm thinking I 
>> wouldn't be made aware of the issue for potentially quite a long time.
>>
>> My understanding is that an 
>> *alertmanager_notification_requests_failed_total* counter will be 
>> incremented each time there is a failed send attempt however from this 
>> alone I can't tell the difference between a single alert that's 
>> consistently failing and a small number of alerts which are all failing.  I 
>> think this means that I've got to wait until 
>> *alertmanager_notifications_failed_total 
>> *is incremented before considering an alert to have failed (and this can 
>> take many minutes) and then a bit of exploration is needed to figure out 
>> which alert(s) failed.  Depending on the criticality of the alert it may be 
>> fine for it to take some minutes before we're made aware of a delivery 
>> problem, in other cases though it won't be.
>>
>> A couple of things I didn't really touch on originally which will also 
>> help explain where my head is:
>> * I have a requirement to be able to measure accurate latency per alert 
>> through the alerting pipeline, i.e. for each alert I need to know the 
>> amount of time it was known to AlertManager before it was successfully 
>> written to the destination.
>> * I have a requirement to be able to analyse historic alerts.
>>
>>
>>
>> On Saturday, November 20, 2021 at 10:33:12 AM UTC sup...@gmail.com wrote:
>>
>>> Also, the alertmanager does have an "even store", it's a shared state 
>>> between all instances.
>>>
>>> If you're interested in changing some of the behavior of the retry 
>>> mechanisms or how this works, feel free to open specific issues. You don't 
>>> need to build an entirely new system, we can add new features to the 
>>> existing Alertmanager clustering framework.
>>>
>>> On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie  wrote:
>>>
 What gives you the impression that the Alertmanager is "best effort"?

 The alertmanager provides a reasonably robust HA solution (gossip 
 clustering). The only thing best-effort here is actually deduplication. 
 The 
 Alertmanager design is "at least once" delivery, so it's robust against 
 network split-brain issues. So in the event of a failure, you may get 
 duplicate alerts, not none.

 When it comes to delivery, the Alertmanager does have retries. If a 
 connection to PagerDuty or other receivers has an issue, it will retry. 
 There are also metrics for this, so you can alert on failures to alternate 
 channels.

 What you likely need is a heartbeat setup. Because services like 
 PagerDuty and Slack do have outages, you can't guarantee delivery if 
 they're down.

 The method here is to have an end-to-end "always firing heartbeat" 
 alert, which goes to a system/service like healthchecks.io or 
 deadmanssnitch.com. These will trigger an alert in the absence of your 
 heartbeat. Letting you know that some part of the pipeline has failed.

 On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci  
 wrote:

> Cross-posted from 
> https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610
>
> In relation to alerting, I’m looking for a way to get strong alert 
> delivery guarantees (and if delivery is not possible I want to know about 
> it quickly).
>
> Unless I’m mistaken AlertManager only offer

Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-20 Thread Tony Di Nucci
There are other things I need to do as well, alert enrichment, complex 
routing, etc.  which means that I think some additional system is needed 
between AlertManager and the final destination in any case.

The main question in my mind is really; are there reasons why I should 
prefer to have AlertManager push to this new system over having this new 
system pull?  

My reasons for preferring a pull based architecture are:
* Just by looking at the AlertRouter we can get a reasonable understanding 
of overall health.  If alerts are pushed to the router then it alone can't 
tell the difference between no alerts firing and it not receiving alerts 
that have fired.
* Backpressure is a natural property of the system.

With this extra context, what do you think?

On Saturday, November 20, 2021 at 11:08:58 AM UTC Tony Di Nucci wrote:

> Thanks for the feedback.
>
> > What gives you the impression that the Alertmanager is "best effort"?
> Sorry, best-effort probably wasn't the right term to use.  I am aware of 
> there being retries however these could still all fail and I'm thinking I 
> wouldn't be made aware of the issue for potentially quite a long time.
>
> My understanding is that an 
> *alertmanager_notification_requests_failed_total* counter will be 
> incremented each time there is a failed send attempt however from this 
> alone I can't tell the difference between a single alert that's 
> consistently failing and a small number of alerts which are all failing.  I 
> think this means that I've got to wait until 
> *alertmanager_notifications_failed_total 
> *is incremented before considering an alert to have failed (and this can 
> take many minutes) and then a bit of exploration is needed to figure out 
> which alert(s) failed.  Depending on the criticality of the alert it may be 
> fine for it to take some minutes before we're made aware of a delivery 
> problem, in other cases though it won't be.
>
> A couple of things I didn't really touch on originally which will also 
> help explain where my head is:
> * I have a requirement to be able to measure accurate latency per alert 
> through the alerting pipeline, i.e. for each alert I need to know the 
> amount of time it was known to AlertManager before it was successfully 
> written to the destination.
> * I have a requirement to be able to analyse historic alerts.
>
>
>
> On Saturday, November 20, 2021 at 10:33:12 AM UTC sup...@gmail.com wrote:
>
>> Also, the alertmanager does have an "even store", it's a shared state 
>> between all instances.
>>
>> If you're interested in changing some of the behavior of the retry 
>> mechanisms or how this works, feel free to open specific issues. You don't 
>> need to build an entirely new system, we can add new features to the 
>> existing Alertmanager clustering framework.
>>
>> On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie  wrote:
>>
>>> What gives you the impression that the Alertmanager is "best effort"?
>>>
>>> The alertmanager provides a reasonably robust HA solution (gossip 
>>> clustering). The only thing best-effort here is actually deduplication. The 
>>> Alertmanager design is "at least once" delivery, so it's robust against 
>>> network split-brain issues. So in the event of a failure, you may get 
>>> duplicate alerts, not none.
>>>
>>> When it comes to delivery, the Alertmanager does have retries. If a 
>>> connection to PagerDuty or other receivers has an issue, it will retry. 
>>> There are also metrics for this, so you can alert on failures to alternate 
>>> channels.
>>>
>>> What you likely need is a heartbeat setup. Because services like 
>>> PagerDuty and Slack do have outages, you can't guarantee delivery if 
>>> they're down.
>>>
>>> The method here is to have an end-to-end "always firing heartbeat" 
>>> alert, which goes to a system/service like healthchecks.io or 
>>> deadmanssnitch.com. These will trigger an alert in the absence of your 
>>> heartbeat. Letting you know that some part of the pipeline has failed.
>>>
>>> On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci  
>>> wrote:
>>>
 Cross-posted from 
 https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610

 In relation to alerting, I’m looking for a way to get strong alert 
 delivery guarantees (and if delivery is not possible I want to know about 
 it quickly).

 Unless I’m mistaken AlertManager only offers best-effort delivery. 
 What’s puzzled me though is that I’ve not found anyone else speaking about 
 this, so I worry I’m missing something obvious. Am I?

 Assuming I’m not mistaken I’ve been thinking of building a system with 
 the architecture shown below.

 [image: alertmanager-alertrouting.png]

 Basically rather than having AlertManager try and push to destinations 
 I’d have an AlertRouter which polls AlertManager. On each polling cycle 
 the 
 steps would be (neglecting any optimisations):

- All active alerts are fetched f

Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-20 Thread Tony Di Nucci
Thanks for the feedback.

> What gives you the impression that the Alertmanager is "best effort"?
Sorry, best-effort probably wasn't the right term to use.  I am aware of 
there being retries however these could still all fail and I'm thinking I 
wouldn't be made aware of the issue for potentially quite a long time.

My understanding is that an 
*alertmanager_notification_requests_failed_total* counter will be 
incremented each time there is a failed send attempt however from this 
alone I can't tell the difference between a single alert that's 
consistently failing and a small number of alerts which are all failing.  I 
think this means that I've got to wait until 
*alertmanager_notifications_failed_total 
*is incremented before considering an alert to have failed (and this can 
take many minutes) and then a bit of exploration is needed to figure out 
which alert(s) failed.  Depending on the criticality of the alert it may be 
fine for it to take some minutes before we're made aware of a delivery 
problem, in other cases though it won't be.

A couple of things I didn't really touch on originally which will also help 
explain where my head is:
* I have a requirement to be able to measure accurate latency per alert 
through the alerting pipeline, i.e. for each alert I need to know the 
amount of time it was known to AlertManager before it was successfully 
written to the destination.
* I have a requirement to be able to analyse historic alerts.



On Saturday, November 20, 2021 at 10:33:12 AM UTC sup...@gmail.com wrote:

> Also, the alertmanager does have an "even store", it's a shared state 
> between all instances.
>
> If you're interested in changing some of the behavior of the retry 
> mechanisms or how this works, feel free to open specific issues. You don't 
> need to build an entirely new system, we can add new features to the 
> existing Alertmanager clustering framework.
>
> On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie  wrote:
>
>> What gives you the impression that the Alertmanager is "best effort"?
>>
>> The alertmanager provides a reasonably robust HA solution (gossip 
>> clustering). The only thing best-effort here is actually deduplication. The 
>> Alertmanager design is "at least once" delivery, so it's robust against 
>> network split-brain issues. So in the event of a failure, you may get 
>> duplicate alerts, not none.
>>
>> When it comes to delivery, the Alertmanager does have retries. If a 
>> connection to PagerDuty or other receivers has an issue, it will retry. 
>> There are also metrics for this, so you can alert on failures to alternate 
>> channels.
>>
>> What you likely need is a heartbeat setup. Because services like 
>> PagerDuty and Slack do have outages, you can't guarantee delivery if 
>> they're down.
>>
>> The method here is to have an end-to-end "always firing heartbeat" alert, 
>> which goes to a system/service like healthchecks.io or deadmanssnitch.com. 
>> These will trigger an alert in the absence of your heartbeat. Letting you 
>> know that some part of the pipeline has failed.
>>
>> On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci  
>> wrote:
>>
>>> Cross-posted from 
>>> https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610
>>>
>>> In relation to alerting, I’m looking for a way to get strong alert 
>>> delivery guarantees (and if delivery is not possible I want to know about 
>>> it quickly).
>>>
>>> Unless I’m mistaken AlertManager only offers best-effort delivery. 
>>> What’s puzzled me though is that I’ve not found anyone else speaking about 
>>> this, so I worry I’m missing something obvious. Am I?
>>>
>>> Assuming I’m not mistaken I’ve been thinking of building a system with 
>>> the architecture shown below.
>>>
>>> [image: alertmanager-alertrouting.png]
>>>
>>> Basically rather than having AlertManager try and push to destinations 
>>> I’d have an AlertRouter which polls AlertManager. On each polling cycle the 
>>> steps would be (neglecting any optimisations):
>>>
>>>- All active alerts are fetched from AlertManager.
>>>- The last known set of active alerts is read from the Alert Event 
>>>Store.
>>>- The set of active alerts is compared with the last known state.
>>>- New alerts are added to an “active” partition in the Alert Event 
>>>Store.
>>>- Resolved alerts are removed from the “active” partition and added 
>>>to a “resolved” partition.
>>>
>>> A secondary process within AlertRouter would:
>>>
>>>- Check for alerts in the “active” partition which do not have a 
>>>state of “delivered = true”.
>>>- Attempt to send each of these alerts and set the “delivered” flag.
>>>- Check for alerts in the “resolved” partition which do not have a 
>>>state of “delivered = true”.
>>>- Attempt to send each of these resolved alerts and set the 
>>>“delivered” flag.
>>>- Move all alerts in the “resolved” partition where “delivered=true” 
>>>to a “completed” partition.
>>>
>>> Among other 

Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-20 Thread Ben Kochie
Also, the alertmanager does have an "even store", it's a shared state
between all instances.

If you're interested in changing some of the behavior of the retry
mechanisms or how this works, feel free to open specific issues. You don't
need to build an entirely new system, we can add new features to the
existing Alertmanager clustering framework.

On Sat, Nov 20, 2021 at 11:29 AM Ben Kochie  wrote:

> What gives you the impression that the Alertmanager is "best effort"?
>
> The alertmanager provides a reasonably robust HA solution (gossip
> clustering). The only thing best-effort here is actually deduplication. The
> Alertmanager design is "at least once" delivery, so it's robust against
> network split-brain issues. So in the event of a failure, you may get
> duplicate alerts, not none.
>
> When it comes to delivery, the Alertmanager does have retries. If a
> connection to PagerDuty or other receivers has an issue, it will retry.
> There are also metrics for this, so you can alert on failures to alternate
> channels.
>
> What you likely need is a heartbeat setup. Because services like PagerDuty
> and Slack do have outages, you can't guarantee delivery if they're down.
>
> The method here is to have an end-to-end "always firing heartbeat" alert,
> which goes to a system/service like healthchecks.io or deadmanssnitch.com.
> These will trigger an alert in the absence of your heartbeat. Letting you
> know that some part of the pipeline has failed.
>
> On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci 
> wrote:
>
>> Cross-posted from
>> https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610
>>
>> In relation to alerting, I’m looking for a way to get strong alert
>> delivery guarantees (and if delivery is not possible I want to know about
>> it quickly).
>>
>> Unless I’m mistaken AlertManager only offers best-effort delivery. What’s
>> puzzled me though is that I’ve not found anyone else speaking about this,
>> so I worry I’m missing something obvious. Am I?
>>
>> Assuming I’m not mistaken I’ve been thinking of building a system with
>> the architecture shown below.
>>
>> [image: alertmanager-alertrouting.png]
>>
>> Basically rather than having AlertManager try and push to destinations
>> I’d have an AlertRouter which polls AlertManager. On each polling cycle the
>> steps would be (neglecting any optimisations):
>>
>>- All active alerts are fetched from AlertManager.
>>- The last known set of active alerts is read from the Alert Event
>>Store.
>>- The set of active alerts is compared with the last known state.
>>- New alerts are added to an “active” partition in the Alert Event
>>Store.
>>- Resolved alerts are removed from the “active” partition and added
>>to a “resolved” partition.
>>
>> A secondary process within AlertRouter would:
>>
>>- Check for alerts in the “active” partition which do not have a
>>state of “delivered = true”.
>>- Attempt to send each of these alerts and set the “delivered” flag.
>>- Check for alerts in the “resolved” partition which do not have a
>>state of “delivered = true”.
>>- Attempt to send each of these resolved alerts and set the
>>“delivered” flag.
>>- Move all alerts in the “resolved” partition where “delivered=true”
>>to a “completed” partition.
>>
>> Among other metrics, the AlertRouter would emit one called
>> “undelivered_alert_lowest_timestamp_in_seconds” and this could be used to
>> alert me to cases where any alert could not be delivered quickly enough.
>> Since the alert is still held in the Alert Event Store it should be
>> possible for me to resolve whatever issue is blocking and not lose the
>> alert.
>>
>> I think there are other benefits to this architecture too, e.g. similar
>> to the way Prometheus scrapes, natural back-pressure is a property of the
>> system.
>>
>> Anyway, as mentioned I’ve not found anyone else doing something like this
>> and this makes me wonder if there’s a very good reason not to. If anyone
>> knows that this design is crazy I’d love to hear!
>>
>> Thanks
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to prometheus-developers+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-dev

Re: [prometheus-developers] Is this alerting architecture crazy?

2021-11-20 Thread Ben Kochie
What gives you the impression that the Alertmanager is "best effort"?

The alertmanager provides a reasonably robust HA solution (gossip
clustering). The only thing best-effort here is actually deduplication. The
Alertmanager design is "at least once" delivery, so it's robust against
network split-brain issues. So in the event of a failure, you may get
duplicate alerts, not none.

When it comes to delivery, the Alertmanager does have retries. If a
connection to PagerDuty or other receivers has an issue, it will retry.
There are also metrics for this, so you can alert on failures to alternate
channels.

What you likely need is a heartbeat setup. Because services like PagerDuty
and Slack do have outages, you can't guarantee delivery if they're down.

The method here is to have an end-to-end "always firing heartbeat" alert,
which goes to a system/service like healthchecks.io or deadmanssnitch.com.
These will trigger an alert in the absence of your heartbeat. Letting you
know that some part of the pipeline has failed.

On Sat, Nov 20, 2021 at 11:02 AM Tony Di Nucci 
wrote:

> Cross-posted from
> https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610
>
> In relation to alerting, I’m looking for a way to get strong alert
> delivery guarantees (and if delivery is not possible I want to know about
> it quickly).
>
> Unless I’m mistaken AlertManager only offers best-effort delivery. What’s
> puzzled me though is that I’ve not found anyone else speaking about this,
> so I worry I’m missing something obvious. Am I?
>
> Assuming I’m not mistaken I’ve been thinking of building a system with the
> architecture shown below.
>
> [image: alertmanager-alertrouting.png]
>
> Basically rather than having AlertManager try and push to destinations I’d
> have an AlertRouter which polls AlertManager. On each polling cycle the
> steps would be (neglecting any optimisations):
>
>- All active alerts are fetched from AlertManager.
>- The last known set of active alerts is read from the Alert Event
>Store.
>- The set of active alerts is compared with the last known state.
>- New alerts are added to an “active” partition in the Alert Event
>Store.
>- Resolved alerts are removed from the “active” partition and added to
>a “resolved” partition.
>
> A secondary process within AlertRouter would:
>
>- Check for alerts in the “active” partition which do not have a state
>of “delivered = true”.
>- Attempt to send each of these alerts and set the “delivered” flag.
>- Check for alerts in the “resolved” partition which do not have a
>state of “delivered = true”.
>- Attempt to send each of these resolved alerts and set the
>“delivered” flag.
>- Move all alerts in the “resolved” partition where “delivered=true”
>to a “completed” partition.
>
> Among other metrics, the AlertRouter would emit one called
> “undelivered_alert_lowest_timestamp_in_seconds” and this could be used to
> alert me to cases where any alert could not be delivered quickly enough.
> Since the alert is still held in the Alert Event Store it should be
> possible for me to resolve whatever issue is blocking and not lose the
> alert.
>
> I think there are other benefits to this architecture too, e.g. similar to
> the way Prometheus scrapes, natural back-pressure is a property of the
> system.
>
> Anyway, as mentioned I’ve not found anyone else doing something like this
> and this makes me wonder if there’s a very good reason not to. If anyone
> knows that this design is crazy I’d love to hear!
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-developers+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABbyFmq%2BL7My_srC-pe8BGg0wnQqTq_TF%2B7BFE-0Lt9jPdAeHw%40mail.gmail.com.