[prometheus-developers] Proposal: Alertmanager Log Receiver

2021-06-09 Thread Levi Harrison
Hi everyone,

I'd like to share I design doc for a log receiver that supports logging to 
stdout, a file, and syslog.

https://docs.google.com/document/d/1Oevu2stHVGAupzmc9C7_wW5nTb_CJ6Ut72viXfve6zI/edit?usp=sharing

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/6c6f051e-5047-4fc0-a0af-ec49026daf1fn%40googlegroups.com.


Re: [prometheus-developers] Add metric for scrape timeout

2021-06-09 Thread Bjoern Rabenstein
On 06.06.21 09:56, Christian Galsterer wrote:
> There are metrics for the actual scrape duration but currently there are no 
> metrics for the scrape timeouts. Adding metrics for the scrape timeout 
> would it make possible to monitor and alert on scrape timeouts without 
> hard-coding the timeouts in the PromQL queries but the new metric can be 
> used.

Sounds like a good idea at first glance, but note that this would be
yet another metric that gets automatically added to every single
target. I think we have to be careful when doing so.

Your proposal mirrors a part of the configuration into metrics. That
is sometimes a neat thing to do, but it has to be enjoyed responsibly.

In this case, you want to specifically alert on scrape timeouts (or, I
guess, approaching them). The same argument could be made to alert on
exceeding (or approaching) the sample limit. So we need a new scrape
metric for the `sample_limit` configuration setting, too. The same is
true for all the other limits: `label_limit`,
`label_name_length_limit`, `label_value_length_limit`,
`target_limit`. So we have to add _six_ new metrics. Also, I had a
bunch of situations where I would have liked to know the intended
scrape interval of a series (rather than guessing it from the spacing
I could see in the samples of the series). So yet another metric for
the configured scrape interval. Things are getting out of control
here...

The question is, of course, why you would like to alert on scrape
timeout specifically. There are many possible reasons why a scrape
fails. Generally, I would recommend to just alert on `up` being zero
too often. If that alert fires, you can then checkout the Prometheus
server in question and investigate _why_ the scrapes are failing.

Interestingly, we have a metric
`prometheus_rule_group_interval_seconds` for the configured evaluation
interval of a rule group. Note, however, that this is not a synthetic
metric injected alongside the evaluation result of the rule, but only
exposed by the `/metrics` endpoint of Prometheus itself. That's only
one metric per rule group, and it's exposed for meta-monitoring, which
could be on a separate server, so it doesn't "pollute" the normal
metrics.

In summary, I'm pretty sure we shouldn't add half a dozen synthetic
metrics for each target to mirror its configuration into metrics. But
perhaps we could add more metrics for meta-monitoring. Have a look at
the already existing metrics beginning with
`prometheus_target_...`. There is for example
`prometheus_target_scrapes_exceeded_sample_limit_total`, but note that
this is just one metric for the whole server. It's mostly meant to get
a specific alerts if _any_ targets run into the sample limit. Perhaps
the same could be done for timeouts as
`prometheus_target_scrapes_exceeded_scrape_timeout`.

-- 
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/20210609162547.GO3670%40jahnn.


Re: [prometheus-developers] Alerting rule for gauge metric with new label value

2021-06-09 Thread Bjoern Rabenstein
[Redirecting this to prometheus-users@ and bcc'ing
prometheus-developers@ because it is about using Prometheus, not
developing it.]

On 05.06.21 10:00, karthik reddy wrote:
> 
> 
> Hello Developers,
> 
> Please let me know how to create alerting rule such that, whenever 
> Prometheus scrapes a gauge metric with a new label value from Pushgateway, 
> I need to check that value range and raise an alert if it is out of range.
> 
> For example:
> I want to alert if file_size>100 for newly added files, id is different and 
> random for each file
> file_size{job=”pushgateway”, id=F234} 80 (in GB)
> file_size{job=”pushgateway”, id=F129} 40 (in GB)
> 
> whenever new file_size(job=”pushgateway”, id=F787} 23 is added to 
> Prometheus, I should be check 23>100? and send an alert mail such that, 
> “file with id F787 size exceeded”.

I think you could craft something with `absent` and `offset` so that
the alert only fires if the corresponding time series wasn't there a
certain amount af time ago.

However, this all smells quite event-driven: Pushing something like an
event to the Pushgateway, then creating a one-shot alert based on that
"event"... Perhaps you are shoehorning Prometheus into something it's
not good at? A Prometheus alert is usually something that keeps firing
for as long as the alerting condition persists. Are files larger than
100GiB suddenly fine once they have been around for a while? (And how
long is that "a while"?)

-- 
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/20210609154423.GN3670%40jahnn.