Re: [prometheus-users] Re: Spreading single alertmanager cluster nodes over multiple geographical regions

hartfordfive Wed, 12 Mar 2025 08:36:08 -0700

Great, thank you for the feedback.   Should any of the flags be set to 
custom values when deploying across a wide WAN or should the default values 
still suffice?



On Monday, March 3, 2025 at 7:03:30 AM UTC-5 Ben Kochie wrote:

> Part of the Prometheus/Alertmanager design is to better survive WAN 
> split-brain.
>
> IMO, running a wide Alertmanager cluster is a good idea when you have a 
> wide network. The AM gossip protocol and deduplication is designed to fail 
> open in the event of a split brain.
>
> The only thing you have to be aware of is that Prometheus-to-Alertmanager 
> is an all-all communication. All Prometheus instances need to send to all 
> Alertmanagers.
>
> On Thu, Feb 27, 2025 at 5:38 PM 'Brian Candler' via Prometheus Users <
> [email protected]> wrote:
>
>> On Thursday, 27 February 2025 at 15:37:54 UTC hartfordfive wrote:
>>
>> With this approach, multiple AZ which are typically each hosted within a 
>> single DC, still run the risk of being inaccessible should the link to the 
>> DC go down.   So let's say you have datacenters in 3 regions (AMER, EMEA 
>> and APAC) and you've chosen to have a single AM cluster in EMEA, should the 
>> link between AMER and EMEA and/or EMEA and APAC go down , then Prometheus 
>> instances located in AMER or APAC won't be able to send alert 
>> notifications.   If you instead of 2 or 3 alertmanager instances in each of 
>> these regions, wouldn't that still allow alerts to be received and actioned 
>> within each of those regions?    
>>
>>
>> Only you know what the meaningful failure modes are for your environment. 
>> It seems to me that you expect key DC-to-DC connectivity to go down, but 
>> you are still able to send alerts (presumably via Internet or some other 
>> out-of-band means).  You could get Prometheus to talk to alertmanager over 
>> the Internet too, using https, if you felt that was more reliable.
>>
>> Also, if DC-to-DC communication is unreliable, then personally I would 
>> not want to run any sort of distributed application across it (alertmanager 
>> or otherwise), due to problems with partitioning / split brain.
>>
>> However, you need to make your own call as to what works best for you, 
>> and what is the optimum tradeoff between cost, complexity, and 
>> reliability.  My gut feeling is towards simplicity and reliability, which 
>> for me means either a single global alertmanager cluster, or a separate AM 
>> cluster per region, but you can build whatever you're comfortable with.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/prometheus-users/ec7b1e1f-d1af-4e0c-ad59-1f238e661737n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/ec7b1e1f-d1af-4e0c-ad59-1f238e661737n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-users/e0d30be0-0dfb-421a-a457-ebef81b4d1d9n%40googlegroups.com.

Re: [prometheus-users] Re: Spreading single alertmanager cluster nodes over multiple geographical regions

Reply via email to