Hello!

This is somewhere between a feature request and a number of questions to 
help me understand some of the design decisions made in Alertmanager.

When Alertmanager cannot expand a template, for example because the 
operator has made a mistake in the template:

receivers:
- name: test
  email_configs:
  - to: exam...@example.com
    from: nore...@example.com
    smarthost: 127.0.0.1:8585
    require_tls: false
    text: "{{ $labels.foo }}"
route:
  receiver: test
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 1m
 
 it logs an error similar to the following:

ts=2023-02-07T13:28:04.815Z caller=dispatch.go:352 level=error 
component=dispatcher msg="Notify for alerts failed" num_alerts=1 
err="test/email[0]: notify retry canceled due to unrecoverable error after 
1 attempts: execute text template: template: :1: undefined variable 
\"$labels\""

I understand that following this error Alertmanager will begin the retry 
stage of the notification until the next group_interval or repeat_interval. 
To fix the issue the user must go and fix their template and reload 
Alertmanager.

However, it seems to me that it's not uncommon to have quite complex 
templates, with for loops, if statements, and sub-templates. It can be 
quite difficult to verify the correctness of these templates at 
"compile-time", and if using amtool, you need to test all possible branches 
in the template.

While I appreciate the responsibility of writing correct templates is on 
the user, I have also been considering whether Alertmanager should be more 
tolerant of template errors, and attempt to send some kind of notification 
when this happens. For example, falling back to the default template that 
we have high confidence of being correct.

However, before discussing the issue further, I would like to first 
understand whether there is a conscious design choice behind how 
Alertmanager operates under such failures, or whether it came to be perhaps 
due to ease of implementation.

Thank you, and I'm very interested to hear you opinions.

Kind regards,

George

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/deb3a588-fef3-4099-8f04-3c6bdea77134n%40googlegroups.com.

Reply via email to