[prometheus-users] Re: Remote write dying after some time.

2023-01-18 Thread Christian Oelsner
Hi Brian,
Thanks for your input, it as always apreaciated.
I will try to have the observability team enable some debug logging on the 
agent, and see if i can spot something.

Regards

Christian Oelsner

onsdag den 18. januar 2023 kl. 14.14.23 UTC+1 skrev Brian Candler:

> Looks to me like a problem at the receiver end (i.e. the middleware 
> Elastic agent, or Elasticsearch itself), i.e. that side has stopped 
> accepting data.
>
> Try looking at logs of these to determine why they are no longer accepting 
> data.
>
> On Wednesday, 18 January 2023 at 11:29:05 UTC christia...@gmail.com wrote:
>
>> Hello guys.
>>
>> I am scraping some metrics wich ar ethen shipped of to an Elastic agent 
>> to be ingested into Elasticsearch. All seems fine to start with, but after 
>> some time, metrics stop comming in, and the prometheus logs shows a lot of 
>> entries like this:
>>
>> ts=2023-01-18T10:51:46.125Z caller=dedupe.go:112 component=remote 
>> level=warn remote_name=010ca8 url=
>> http://agent-svc.observability.svc.cluster.local:9201/write msg="Failed 
>> to send batch, retrying" err="Post \"
>> http://agent-svc.observability.svc.cluster.local:9201/write\": context 
>> deadline exceeded"
>>
>> ts=2023-01-18T10:51:20.364Z caller=dedupe.go:112 component=remote 
>> level=debug remote_name=010ca8 url=
>> http://agent-svc.observability.svc.cluster.local:9201/write msg="Not 
>> downsharding due to being too far behind"
>>
>> I am guessing that Prometheus is trying to tell me something, but i just 
>> dont know what.
>>
>> Checking the TSDB status on the prom UI it tells me that Number of series 
>> is 8439 wich does not sound like a lot.
>> Any help would be very appreciated.
>>
>> Best regards
>> Christian Oelsner
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4cb58947-e401-4213-bd34-8e00096f3807n%40googlegroups.com.


Re: [prometheus-users] Time series with change interval much less than scrape interval

2023-01-18 Thread Stuart Clark

On 18/01/2023 00:15, Mark Selby wrote:

I am struggling with PromQL over an issue dealing with a metric that
changes less frequently than the scrape interval. I am trying to use
Prometheus as a pseudo event tracker and hoping to get some advice on
how to best try and accomplish my goal.
I think this is the fundamental issue you are facing. Prometheus isn't 
an event system. It is designed for metrics, which are pretty different 
to events. It sounds like you should look at a system like Loki, 
Elasticsearch or a general purpose SQL or key/value database, as they 
are likely to be a much better fit for you than a timeseries database 
and ecosystem that is designed for handling metrics.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/35ad9d4d-fc3f-2d6a-9f48-4cddb96b9fe9%40Jahingo.com.


Re: [prometheus-users] Prometheus metrics where change interval and scrape interval are quite different

2023-01-18 Thread Mark Selby
Thanks very much for taking the time out to reply. Indeed this is an 
example of having a hammer and seeing everything as nail. I do need a 
system to deal with event data and I will probably go with a Postgres 
solution. Luckily I have other needs for Postgres so this is not as 
heavyweight as it would be just for this use.

On Wednesday, January 18, 2023 at 6:01:30 AM UTC-8 juliu...@promlabs.com 
wrote:

> Hi Mark,
>
> That is indeed not directly possible with PromQL (though you could pull 
> the data out of course), since Prometheus and PromQL are very decidedly 
> about metrics and not about tracking individual events. So you'll either 
> need an event processing system for this, or formulate the problem in a 
> different way so that it works better with metrics. What is it that you 
> want to do based on the data in the end (e.g. alert on some condition)? 
> Maybe there's a better, Prometheus-compatible pattern that we can suggest.
>
> Also, given your current data, is it actually possible for two runs of the 
> same job to produce the same sample value, so you wouldn't even be able to 
> distinguish them anyway?
>
> Regards,
> Julius
>
> On Wed, Jan 18, 2023 at 2:00 PM Mark Selby  wrote:
>
>> I am struggling with PromQL over an issue dealing with a metric that 
>> changes less frequently than the scrape interval. I am trying to use 
>> Prometheus as a pseudo event tracker and hoping to get some advice on how 
>> to best try and accomplish my goal.
>>
>> I have a random job that runs at different intervals depending on the 
>> situation. Some instances of the job run every five minutes and some run 
>> only once an hour or once a day. The job created a node_exporter textfile 
>> snippet that gets scraped on 30 second interval.
>>
>> Below is an example of a metric that changes only every five minutes with 
>> the lesser scrape interval. In this scenario all the points with same value 
>> are from the same job run. I really only care about one of those.
>>
>> I have no way to know what the interval is between set for all my 
>> different jobs. All I know is that when the value changes, a new set is in 
>> play.
>>
>> What I want to do in "reduce" my dataset to deal with only distinct 
>> values. I want to collapse these 27 entries into 3 by taking either the 
>> first or last value of each "set".
>>
>> I can not find a PromQL function/operator that does what I want. Maybe I 
>> need to use recording rules?
>>
>> All and any help is greatly appreciated.
>>
>> metric_name{instance="hostname.example.net", job="external/generic", 
>> mode="pull", name="snafu"}
>>
>> 9973997301   
>> @1673997343.774 
>> 9973997301   
>>  
>>  
>>  
>>  
>>  
>>  
>> @1673997373.764 
>> 9973997301 @1673997403.764 9973997301 @1673997433.764 9973997301 
>> @1673997463.764 9973997301 @1673997493.764 9973997301 @1673997523.764 
>> 9973997301 @1673997553.764 9973997301 @1673997583.764
>>
>> 9973997601   
>> @1673997613.764 
>> 9973997601   
>>  
>>  
>>  
>>  
>>  
>>  
>> @1673997643.764 
>> 9973997601 @1673997673.764 9973997601 @1673997703.774 9973997601 
>> @1673997733.764 9973997601 @1673997763.764 9973997601 @1673997793.764 
>> 9973997601 @1673997823.764 9973997601 @1673997853.863
>>
>> 9973997901   
>> @1673997913.764 
>> 9973997901   
>>  
>>  
>>  
>>  
>>  
>>  
>> @1673997943.767 
>> 9973997901 @1673997973.764 9973997901 @1673998003.764 9973997901 
>> @1673998033.764 9973997901 @1673998063.764 9973997901 

Re: [prometheus-users] alert rules: multiple rules for general & special cases

2023-01-18 Thread Brian Candler
If you refactor the rules a bit, you may find them easier to maintain:

alert1:
  expr: probe_success{somelabel="XYZ"} == 0
  labels:
someswitch: foo

alert2:
  expr: probe_success{somelabel="ABC"} == 0
  labels:
someswitch: bar

alert3:
  expr: |
probe_success == 0
unless probe_success{somelabel="XYZ"} == 0
unless probe_success{somelabel="ABC"} == 0

The 'special cases' alert1 and alert2 have particular rules; alert3 has the 
generic catch-all rule with 'unless' blocks to suppress the alert1 and 
alert2 cases, but using identical expressions.

I think this approach is easier to reason about than having to generate new 
expressions with inverted logic.  The expressions do have to return the 
same label sets (which, in the case of the same metric, should be true)

On Wednesday, 18 January 2023 at 13:48:59 UTC juliu...@promlabs.com wrote:

> If your special cases have a *longer* "for" duration than the general 
> ones, then I guess they won't be useful for inhibiting the general ones, 
> since the special cases will start firing too late relative to the general 
> ones to inhibit them. I guess you could introduce a copy of each special 
> case alert without any "for" duration (or a shorter one) that you don't 
> route anywhere and that is only used for inhibitions. And then you have a 
> second version of it that's actually routed, with a longer "for" duration?
>
> Whether that's more maintainable than going for !~ and =~ regex matchers 
> as you described is a good question though. Maybe rather than 
> distinguishing each special case in the alerting rules themselves, maybe 
> you can attach a special new (single) label to your targets that 
> differentiate the general ones from the longer "for" duration ones, so you 
> can just use that one label for filtering in the rules?
>
> On Wed, Jan 18, 2023 at 2:00 PM Mario Cornaccini  
> wrote:
>
>> hi,
>>
>> for the same metric, i want to have multiple rules in alert manager, to 
>> have longer FOR: times for some special cases.
>>
>> the way i do this now is:
>> alert1 # general
>> probe_succes{ somelabel !~"specialcase1|specialcase2"}
>> alert2 # special
>> probe_success{somelabel =~"specialcase1|specialcase2"}
>> .. which is obviously badly maintainable and ugly and won't scale..
>>
>> i've seen this, 
>> https://www.robustperception.io/using-time-series-as-alert-thresholds/..
>> but it looks a bit, well, hard to maintain..
>>
>>
>>
>> so i got this idea, what would happen if i did this:
>>
>> in prometheus rules :
>> alert1 # handles the special case
>> ie. probe_success{somelabel="XYZ"}
>> labels:
>>someswitch: true
>>
>> alert 2 #handles the general case
>> probe_success{}
>>
>> and in alert manager:
>> define an inhibit rule which mutes the general alert, if there is also an 
>> special case one, based on the someswitch label, would that work?
>>
>> any help/pointers/comments greatly appreciated,
>> cheers,
>> mario
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/5fe28eec-a91c-4619-af41-128966ad08d9n%40googlegroups.com
>>  
>> 
>> .
>>
>
>
> -- 
> Julius Volz
> PromLabs - promlabs.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d33055a4-2b6d-4dca-8d3e-e4a52ef4b18an%40googlegroups.com.


Re: [prometheus-users] Prometheus metrics where change interval and scrape interval are quite different

2023-01-18 Thread Julius Volz
Hi Mark,

That is indeed not directly possible with PromQL (though you could pull the
data out of course), since Prometheus and PromQL are very decidedly about
metrics and not about tracking individual events. So you'll either need an
event processing system for this, or formulate the problem in a different
way so that it works better with metrics. What is it that you want to do
based on the data in the end (e.g. alert on some condition)? Maybe there's
a better, Prometheus-compatible pattern that we can suggest.

Also, given your current data, is it actually possible for two runs of the
same job to produce the same sample value, so you wouldn't even be able to
distinguish them anyway?

Regards,
Julius

On Wed, Jan 18, 2023 at 2:00 PM Mark Selby  wrote:

> I am struggling with PromQL over an issue dealing with a metric that
> changes less frequently than the scrape interval. I am trying to use
> Prometheus as a pseudo event tracker and hoping to get some advice on how
> to best try and accomplish my goal.
>
> I have a random job that runs at different intervals depending on the
> situation. Some instances of the job run every five minutes and some run
> only once an hour or once a day. The job created a node_exporter textfile
> snippet that gets scraped on 30 second interval.
>
> Below is an example of a metric that changes only every five minutes with
> the lesser scrape interval. In this scenario all the points with same value
> are from the same job run. I really only care about one of those.
>
> I have no way to know what the interval is between set for all my
> different jobs. All I know is that when the value changes, a new set is in
> play.
>
> What I want to do in "reduce" my dataset to deal with only distinct
> values. I want to collapse these 27 entries into 3 by taking either the
> first or last value of each "set".
>
> I can not find a PromQL function/operator that does what I want. Maybe I
> need to use recording rules?
>
> All and any help is greatly appreciated.
>
> metric_name{instance="hostname.example.net", job="external/generic",
> mode="pull", name="snafu"}
>
> 9973997301 
> @1673997343.774
> 9973997301  
> 
> 
> 
> 
> 
> 
> @1673997373.764
> 9973997301 @1673997403.764 9973997301 @1673997433.764 9973997301
> @1673997463.764 9973997301 @1673997493.764 9973997301 @1673997523.764
> 9973997301 @1673997553.764 9973997301 @1673997583.764
>
> 9973997601 
> @1673997613.764
> 9973997601  
> 
> 
> 
> 
> 
> 
> @1673997643.764
> 9973997601 @1673997673.764 9973997601 @1673997703.774 9973997601
> @1673997733.764 9973997601 @1673997763.764 9973997601 @1673997793.764
> 9973997601 @1673997823.764 9973997601 @1673997853.863
>
> 9973997901 
> @1673997913.764
> 9973997901  
> 
> 
> 
> 
> 
> 
> @1673997943.767
> 9973997901 @1673997973.764 9973997901 @1673998003.764 9973997901
> @1673998033.764 9973997901 @1673998063.764 9973997901 @1673998093.764
> 9973997901 @1673998123.764 9973997901 @1673998153.764
>
> I have tried many of the PromQL functions/operators to try and reduce my
> sets. The count_vaules() operator is the closest I have come but that works
> only with instant vectors not range vectors.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> 

Re: [prometheus-users] alert rules: multiple rules for general & special cases

2023-01-18 Thread Julius Volz
If your special cases have a *longer* "for" duration than the general ones,
then I guess they won't be useful for inhibiting the general ones, since
the special cases will start firing too late relative to the general ones
to inhibit them. I guess you could introduce a copy of each special case
alert without any "for" duration (or a shorter one) that you don't route
anywhere and that is only used for inhibitions. And then you have a second
version of it that's actually routed, with a longer "for" duration?

Whether that's more maintainable than going for !~ and =~ regex matchers as
you described is a good question though. Maybe rather than distinguishing
each special case in the alerting rules themselves, maybe you can attach a
special new (single) label to your targets that differentiate the general
ones from the longer "for" duration ones, so you can just use that one
label for filtering in the rules?

On Wed, Jan 18, 2023 at 2:00 PM Mario Cornaccini 
wrote:

> hi,
>
> for the same metric, i want to have multiple rules in alert manager, to
> have longer FOR: times for some special cases.
>
> the way i do this now is:
> alert1 # general
> probe_succes{ somelabel !~"specialcase1|specialcase2"}
> alert2 # special
> probe_success{somelabel =~"specialcase1|specialcase2"}
> .. which is obviously badly maintainable and ugly and won't scale..
>
> i've seen this,
> https://www.robustperception.io/using-time-series-as-alert-thresholds/..
> but it looks a bit, well, hard to maintain..
>
>
>
> so i got this idea, what would happen if i did this:
>
> in prometheus rules :
> alert1 # handles the special case
> ie. probe_success{somelabel="XYZ"}
> labels:
>someswitch: true
>
> alert 2 #handles the general case
> probe_success{}
>
> and in alert manager:
> define an inhibit rule which mutes the general alert, if there is also an
> special case one, based on the someswitch label, would that work?
>
> any help/pointers/comments greatly appreciated,
> cheers,
> mario
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/5fe28eec-a91c-4619-af41-128966ad08d9n%40googlegroups.com
> 
> .
>


-- 
Julius Volz
PromLabs - promlabs.com

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAObpH5zLPC9zmUenJyiEGH%2B2tLd9f-PNRctnC3pOGagmLgFJUA%40mail.gmail.com.


[prometheus-users] Re: Remote write dying after some time.

2023-01-18 Thread Brian Candler
Looks to me like a problem at the receiver end (i.e. the middleware Elastic 
agent, or Elasticsearch itself), i.e. that side has stopped accepting data.

Try looking at logs of these to determine why they are no longer accepting 
data.

On Wednesday, 18 January 2023 at 11:29:05 UTC christia...@gmail.com wrote:

> Hello guys.
>
> I am scraping some metrics wich ar ethen shipped of to an Elastic agent to 
> be ingested into Elasticsearch. All seems fine to start with, but after 
> some time, metrics stop comming in, and the prometheus logs shows a lot of 
> entries like this:
>
> ts=2023-01-18T10:51:46.125Z caller=dedupe.go:112 component=remote 
> level=warn remote_name=010ca8 url=
> http://agent-svc.observability.svc.cluster.local:9201/write msg="Failed 
> to send batch, retrying" err="Post \"
> http://agent-svc.observability.svc.cluster.local:9201/write\": context 
> deadline exceeded"
>
> ts=2023-01-18T10:51:20.364Z caller=dedupe.go:112 component=remote 
> level=debug remote_name=010ca8 url=
> http://agent-svc.observability.svc.cluster.local:9201/write msg="Not 
> downsharding due to being too far behind"
>
> I am guessing that Prometheus is trying to tell me something, but i just 
> dont know what.
>
> Checking the TSDB status on the prom UI it tells me that Number of series 
> is 8439 wich does not sound like a lot.
> Any help would be very appreciated.
>
> Best regards
> Christian Oelsner
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/16e03bd3-eefb-4c68-87d4-a3d181315ab5n%40googlegroups.com.


[prometheus-users] alert rules: multiple rules for general & special cases

2023-01-18 Thread Mario Cornaccini
hi,

for the same metric, i want to have multiple rules in alert manager, to 
have longer FOR: times for some special cases.

the way i do this now is:
alert1 # general
probe_succes{ somelabel !~"specialcase1|specialcase2"}
alert2 # special
probe_success{somelabel =~"specialcase1|specialcase2"}
.. which is obviously badly maintainable and ugly and won't scale..

i've seen this, 
https://www.robustperception.io/using-time-series-as-alert-thresholds/..
but it looks a bit, well, hard to maintain..



so i got this idea, what would happen if i did this:

in prometheus rules :
alert1 # handles the special case
ie. probe_success{somelabel="XYZ"}
labels:
   someswitch: true

alert 2 #handles the general case
probe_success{}

and in alert manager:
define an inhibit rule which mutes the general alert, if there is also an 
special case one, based on the someswitch label, would that work?

any help/pointers/comments greatly appreciated,
cheers,
mario

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5fe28eec-a91c-4619-af41-128966ad08d9n%40googlegroups.com.


[prometheus-users] Prometheus metrics where change interval and scrape interval are quite different

2023-01-18 Thread Mark Selby
I am struggling with PromQL over an issue dealing with a metric that 
changes less frequently than the scrape interval. I am trying to use 
Prometheus as a pseudo event tracker and hoping to get some advice on how 
to best try and accomplish my goal.

I have a random job that runs at different intervals depending on the 
situation. Some instances of the job run every five minutes and some run 
only once an hour or once a day. The job created a node_exporter textfile 
snippet that gets scraped on 30 second interval.

Below is an example of a metric that changes only every five minutes with 
the lesser scrape interval. In this scenario all the points with same value 
are from the same job run. I really only care about one of those.

I have no way to know what the interval is between set for all my different 
jobs. All I know is that when the value changes, a new set is in play.

What I want to do in "reduce" my dataset to deal with only distinct values. 
I want to collapse these 27 entries into 3 by taking either the first or 
last value of each "set".

I can not find a PromQL function/operator that does what I want. Maybe I 
need to use recording rules?

All and any help is greatly appreciated.

metric_name{instance="hostname.example.net", job="external/generic", 
mode="pull", name="snafu"}

9973997301   
@1673997343.774 
9973997301   
 
 
 
 
 
 
@1673997373.764 
9973997301 @1673997403.764 9973997301 @1673997433.764 9973997301 
@1673997463.764 9973997301 @1673997493.764 9973997301 @1673997523.764 
9973997301 @1673997553.764 9973997301 @1673997583.764

9973997601   
@1673997613.764 
9973997601   
 
 
 
 
 
 
@1673997643.764 
9973997601 @1673997673.764 9973997601 @1673997703.774 9973997601 
@1673997733.764 9973997601 @1673997763.764 9973997601 @1673997793.764 
9973997601 @1673997823.764 9973997601 @1673997853.863

9973997901   
@1673997913.764 
9973997901   
 
 
 
 
 
 
@1673997943.767 
9973997901 @1673997973.764 9973997901 @1673998003.764 9973997901 
@1673998033.764 9973997901 @1673998063.764 9973997901 @1673998093.764 
9973997901 @1673998123.764 9973997901 @1673998153.764

I have tried many of the PromQL functions/operators to try and reduce my 
sets. The count_vaules() operator is the closest I have come but that works 
only with instant vectors not range vectors.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b3882dc1-32ea-42ff-9264-3dcbb72f662dn%40googlegroups.com.


[prometheus-users] Time series with change interval much less than scrape interval

2023-01-18 Thread Mark Selby
I am struggling with PromQL over an issue dealing with a metric that
changes less frequently than the scrape interval. I am trying to use
Prometheus as a pseudo event tracker and hoping to get some advice on
how to best try and accomplish my goal.

I have a random job that runs at different intervals depending on the
situation. Some instances of the job run every five minutes and some run
only once an hour or once a day. The job creats a node_exporter
textfile snippet that gets scraped on 30 second interval.

Below is an example of a metric that changes only every five minutes with
the lesser scrape interval. In this scenario all the points with same
value are from the same job run. I really only care about one of those.

I have no way to know what the interval is between sets for all my
different jobs. All I know is that when the value changes, a new set is
in play.

What I want to do in "reduce" my dataset to deal with only distinct
values. I want to collapse these 27 entries below into 3 by taking either
the first or last value of each "set".

I can not find a PromQL function/operator that does what I want. Maybe I
need to use recording rules?

All and any help is greatly appreciated.































*metric_name{instance="hostname.example.net", job="external/generic", 
mode="pull", name="snafu"}9973997301 
@1673997343.7749973997301 
@1673997373.7649973997301 @1673997403.7649973997301 
@1673997433.7649973997301 @1673997463.7649973997301 
@1673997493.7649973997301 @1673997523.7649973997301 
@1673997553.7649973997301 @1673997583.7649973997601 
@1673997613.7649973997601 @1673997643.7649973997601 
@1673997673.7649973997601 @1673997703.7749973997601 
@1673997733.7649973997601 @1673997763.7649973997601 
@1673997793.7649973997601 @1673997823.7649973997601 
@1673997853.8639973997901 @1673997913.7649973997901 
@1673997943.7679973997901 @1673997973.7649973997901 
@1673998003.7649973997901 @1673998033.7649973997901 
@1673998063.7649973997901 @1673998093.7649973997901 
@1673998123.7649973997901 @1673998153.764*

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/99bec6df-8c64-4cb2-95e7-f7673418ce25n%40googlegroups.com.


[prometheus-users] Remote write dying after some time.

2023-01-18 Thread Christian Oelsner
Hello guys.

I am scraping some metrics wich ar ethen shipped of to an Elastic agent to 
be ingested into Elasticsearch. All seems fine to start with, but after 
some time, metrics stop comming in, and the prometheus logs shows a lot of 
entries like this:

ts=2023-01-18T10:51:46.125Z caller=dedupe.go:112 component=remote 
level=warn remote_name=010ca8 
url=http://agent-svc.observability.svc.cluster.local:9201/write msg="Failed 
to send batch, retrying" err="Post 
\"http://agent-svc.observability.svc.cluster.local:9201/write\": context 
deadline exceeded"

ts=2023-01-18T10:51:20.364Z caller=dedupe.go:112 component=remote 
level=debug remote_name=010ca8 
url=http://agent-svc.observability.svc.cluster.local:9201/write msg="Not 
downsharding due to being too far behind"

I am guessing that Prometheus is trying to tell me something, but i just 
dont know what.

Checking the TSDB status on the prom UI it tells me that Number of series 
is 8439 wich does not sound like a lot.
Any help would be very appreciated.

Best regards
Christian Oelsner

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/776fa548-b38e-47dc-bd54-b08cef7be93cn%40googlegroups.com.


Re: [prometheus-users] AlertManager rules examples

2023-01-18 Thread Stuart Clark
Grafana does have its own alerting solution, but that's not something to do 
with anything Prometheus. You'd need to ask the Grafana lists around how to do 
it with that option. 

On 17 January 2023 21:11:45 GMT, Eulogio Apelin  
wrote:
>Thanks for the info. it helps.
>
>Would be nice if there are examples on web pages or you tube vids.  We also 
>have Grafana, but it sounds like the engineers are trying to maybe pick 
>alertmanager over grafana as it currently is a mix and it's not straight 
>forward to us when configuring both.  Mainly because we don't have a 
>dedicated person working on alerts.  It tends to be the lower 10-20% on the 
>priority list for us and with other companies i've been with also deal with 
>this in the same way.  Just my 2 cents on this
>
>The lazy in my just wants to click click click and be done.
>
>
>On Friday, January 13, 2023 at 1:53:14 AM UTC-10 Stuart Clark wrote:
>
>> On 11/01/2023 19:58, Eulogio Apelin wrote:
>> > I'm looking for information, primarily examples, of various ways to 
>> > configure alert rules.
>> >
>> > Specifically, scenarios like:
>> >
>> > In a single rule group:
>> > Regular expression that determined a tls cert expires in 60 days. send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 40 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 30 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 20 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 10 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 5 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 0 days, send 
>> > 1 alert
>> >
>> > Another scenario is to
>> > send an alert once day to an email address.
>> > send an alert if it's the 3rd day in a row, send the alert to another 
>> > set of address. and stop alerting.
>> >
>> > can alertmanager send alerts to teams like it does slack?
>> >
>> > And another other general examples of alert manager rules.
>> >
>> I think it is best not to think of alerts as moment in time events but 
>> as being a time period where a certain condition is true. Separate to 
>> the actual alert firing are then rules (in Alertmanager) of how to route 
>> it (e.g. to Slack, email, etc.), what to send (email body template) and 
>> how often to remind people that the alert is happening.
>>
>> So for example with your TLS expiry example you might have an alert 
>> which starts firing once a certificate is within 60 days of expiry. It 
>> would continue to fire continuously until either the certificate is 
>> renewed (i.e. it is over 60 days again) or stops existing (because 
>> you've reconfigured Prometheus to no longer monitor that certificate). 
>> Then within Alertmanager you can set rules to send you a message every 
>> 10 days that alert is firing, meaning you'd get a message at 60, 50, 40, 
>> etc days until expiry.
>>
>> More complex alerting routing decisions are generally out of scope for 
>> Alertmanager and would be expected to be managed by a more complex 
>> system (such as PagerDuty, OpsGenie, Grafana On-Call, etc.). This would 
>> cover you example of wanting to escalate an alert after a period of 
>> time, but would also cover things like having on-call rotas where 
>> different people would be contacted by looking at a rota calendar.
>>
>> -- 
>> Stuart Clark
>>
>>
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Users" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to prometheus-users+unsubscr...@googlegroups.com.
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-users/e0e6a8bd-3d65-4573-b524-7a9af1578e95n%40googlegroups.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/AE1AB352-B549-47C5-BEBE-4C3A9E0F881E%40Jahingo.com.