On 19/02/2021 19:43, Badreddin Aboubakr wrote:
Hello,
We use Prometheus to monitor our infrastructure (hypervisors,
gateways, storage servers, etc). Scrape targets are sourced from a
Postgres database, which contains additional information about the “in
production” state of the target. In the beginning we used to have a
metadata metric which indicated the state of the server as an `enum`
metric.
By joining the state metric on each alerting rule and then dropping
the alerts which have specific state, we were able to suppress
un-needed alerts
With the growth of number of alerting rules and the number of states,
joining on these metrics in all alerting rules became so expensive
that we wrote some recording rules which keeps evaluating the enum
metric and produces enum metric with less cardinality (production
(where alerts shall pass to their receivers) and everything else (Will
be dropped at alertmanager step))
so again we join on these metrics and drop alerts which have
non-production.
Now that is not going to scale but it was a temporary solution as our
alerting rules are growing.
So we discussed some solutions:
* We can set silences and remove them on state change using alert
manager API:
This approach is too dynamic however (I don’t know if alertmanager
API was designed for this purpose and, maybe it’s ) Will that scale
with number of silences and hosts
* We can develop a kind of proxy which will be deployed between
Prometheus and alertmanager, and drop alerts for hosts in
non-production state:
This approach is dangerous as if the proxy fails, no alerts will
reach alertmanager
* put the proxy on the notification path: This will make it a bit
complicated as the proxy has to understand receivers, etc
PS: We still want to scrape and monitor the servers which are not in
production state.
We will be really thankful for any suggestions or ideas.
Couldn't you run two sets of Prometheus servers to monitor the
production infrastructure separately from the non-production. Then just
don't have alerting rules or connect alertmanagers to the non-production
servers.
--
Stuart Clark
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/07f4078b-997d-2aa0-f7a5-c88afcef2d70%40Jahingo.com.