Re: [prometheus-developers] Is this alerting architecture crazy?

Tony Di Nucci Mon, 22 Nov 2021 08:01:55 -0800

> Honestly, most of what you want is stuff we could support in Alertmanager 
without a lot of trouble. And are things that other users would want as 
well. Rather than build a whole new system, why not contribute improvements 
directly to the Alertmanager.


That's a very good point and something I think would be great to do.  
Something I will have to keep in mind though is how things may play out in 
the world of hosted "Prometheus" solutions - if I were to go with one of 
these solutions then I'd have no control over when new features would be 
made available.

FWIW the custom routing that I'm talking about is very business specific 
and involves consulting (yet another!) system to determine the final alert 
severity and where it gets routed to.  I guess this could be supported in 
AlertManager (by having hooks or plugins), whether the maintainers of AM 
like this will obviously be it's own question.

I'll discuss this with my colleges to see whether we can consider 
contributing to AlertManager.

Thanks for the help!

On Monday, November 22, 2021 at 3:29:08 PM UTC sup...@gmail.com wrote:

> On Mon, Nov 22, 2021 at 4:03 PM Tony Di Nucci <tonyd...@gmail.com> wrote:
>
>> Thanks for the feedback Stuart, I really appreciate you taking the time 
>> and you've given me reason to pause and reconsider my options.
>>
>> I fully understand your concerns over having a new data store.  I'm not 
>> sure that AlertManager and Prometheus contain the state I need though and 
>> I'm not sure I should attempt to use Prometheus as the store for this state 
>> (tracking per alert latencies would end up with a metric with unbounded 
>> cardinality, each series would just contain a single data point and it 
>> would be tricky to analyse this data).
>>
>> On the "guaranteeing" delivery front.  You of course have a point that 
>> the more moving parts there are the more that can go wrong.  From the 
>> sounds of things though I don't think we're debating the need for another 
>> system (since this is what a webhook receiver would be?).  
>>
>> Unless I'm mistaken, to hit the following requirements there'll need to 
>> be a system external AlertManager and this will have to maintain some state:
>> * supporting complex alert enrichment (in ways that cannot be defined in 
>> alerting rules)
>>
>
> We actually are interested in adding this to the alertmanager, there are a 
> few open proposals for this. Basically the idea is that you can make an 
> enrichment call at alert time to do things like grab metrics/dashboard 
> snapshots, other system state, etc.
>  
>
>> * support business specific alert routing rules (which are defined 
>> outside of alerting rules)
>>
>
> The alertmanager routing rules are pretty powerful already. Depending on 
> what you're interested in adding, this is something we could support 
> directly.
>  
>
>> * support detailed alert analysis (which includes per alert latencies)
>>
>
> This is, IMO, more of a logging problem. I think this is something we 
> could add. You ship the alert notifications to any kind of BI system you 
> like, ELK, etc. 
>
> Maybe something to integrate into 
> https://github.com/yakshaving-art/alertsnitch.
>  
>
>>
>> I think this means that the question is limited to; is it better in my 
>> case to push or pull from AlertManager.  BTW, I'm sorry for the way I 
>> worded my original post because I now realise how important it was to make 
>> explicit the requirements that (I think) necessitate the majority of the 
>> complexity.
>>
>
> Honestly, most of what you want is stuff we could support in Alertmanager 
> without a lot of trouble. And are things that other users would want as 
> well. Rather than build a whole new system, why not contribute improvements 
> directly to the Alertmanager.
>  
>
>>
>> As I still see it, the problems with the push approach (which are not 
>> present with the pull approach are):
>> * It's only possible to know that an alert cannot be delivered after 
>> waiting for *group_interval *(typically many minutes)
>> * At a given moment it's not possible to determine whether a specific 
>> active alert has been delivered (at least I'm not aware of a way to 
>> determine this)
>> * It is possible for alerts to be dropped (e.g. 
>> https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277
>> ) 
>>
>> The tradeoffs for this are:
>> * I'd need to discover the AlertManager instances.  This is pretty 
>> straight forward in k8s.
>> * I may need to dedupe alert groups across AlertManager instances.  I 
>> think this would be pretty straight forward too, esp. since AlertManager 
>> already populates fingerprints.
>>
>>
>>  
>>
>> On Sunday, November 21, 2021 at 10:28:49 PM UTC Stuart Clark wrote:
>>
>>> On 20/11/2021 23:42, Tony Di Nucci wrote: 
>>> > Yes, the diagram is a bit of a simplification but not hugely. 
>>> > 
>>> > There may be multiple instances of AlertRouter however they will share 
>>> > a database.  Most likely things will be kept simple (at least 
>>> > initially) where each instance holds no state of its own.  Each active 
>>> > alert in the DB will be uniquely identified by the alert fingerprint 
>>> > (which the AlertManager API provides, i.e. a hash of the alert groups 
>>> > labels).  Each non-active alert will have a composite key (where one 
>>> > element is the alert group fingerprint). 
>>> > 
>>> > In this architecture I see AlertManager having the responsibilities of 
>>> > capturing, grouping, inhibiting and silencing alerts.  The AlertRouter 
>>> > will have the responsibilities of; enriching alerts, routing based on 
>>> > business rules, monitoring/guaranteeing delivery and enabling analysis 
>>> > of alert history. 
>>> > 
>>> > Due to my requirements, I think I need something like the 
>>> > AlertRouter.  The question is really, am I better to push from 
>>> > AlertManager to AlertRouter, or to have AlertRouter pull from 
>>> > AlertManager.  My current opinion is that pulling comes with more 
>>> > benefits but since I've not seen anyone else doing this I'm concerned 
>>> > there could be good reasons (I'm not aware of) for not doing this. 
>>>
>>> If you really must have another system connected to Alertmanager having 
>>> it respond to webhook notifications would be the much simpler option. 
>>> You'd still need to run multiple copies of you application behind a load 
>>> balancer (and have a clustered database) for HA, but at least you'd not 
>>> have the complexity of each instance having to discover all the 
>>> Alertmanager instances, query them and then deduplicate amongst the 
>>> different instances (again something that Alertmanager does itself 
>>> already). 
>>>
>>> I'm still struggling to see why you need an extra system at all - it 
>>> feels very much like you'd be increasing complexity significantly which 
>>> naturally decreases reliability (more bits to break, have bugs or act in 
>>> unexpected ways) and slow things down (as there is another "hop" for an 
>>> alert to pass through). All of the things you mention can be done 
>>> already through Alertmanager, or could be done pretty simply with a 
>>> webhook receiver (without the need for any additional state storage, 
>>> etc.) 
>>>
>>> * Adding data to an alert could be done with a simple webhook receiver, 
>>> that accepts an alert and then forwards it on to another API with extra 
>>> information added (no need for any state) 
>>> * Routing can be done within Alertmanager, or for more complex cases 
>>> could again be handled by a stateless webhook receiver 
>>> * With regards to "guaranteeing" delivery I don't see your suggestion in 
>>> allowing that (I believe it would actually make that less likely overall 
>>> due to the added complexity and likelihood of bugs/unhandled cases). 
>>> Alertmanager already does a good job of retrying on errors (and updating 
>>> metrics if that happens) but not much can be done if the final system is 
>>> totally down for long periods of time (and for many systems if that 
>>> happens old alerts aren't very useful once it is back, as they may have 
>>> already resolved). 
>>> * Alertmanager and Prometheus already expose a number of useful metrics 
>>> (make sure your Prometheus is scraping itself & all the connected 
>>> Alertmanagers) which should give you lots of useful information about 
>>> alert history (with the advantage of that data being with the monitoring 
>>> system you already know [with whatever you have connected like 
>>> dashboards, alerts, etc.]) 
>>>
>>> -- 
>>> Stuart Clark 
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-devel...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/ce89e6fd-c098-4f94-bf84-fad0089edfbdn%40googlegroups.com.

Re: [prometheus-developers] Is this alerting architecture crazy?

Reply via email to