[prometheus-users] Need a query function which returns an actual increase in counter values within a time interval and handles counter reset cases

2022-10-17 Thread anantha sai ram


Is there a functionality using which we can get the difference in the 
values of counter samples within the time interval, while handling counter 
reset case and excluding the extrapolate functionality?

We tried using the increase() function, but it returns an extrapolated 
result. We have observed that the difference between the actual increase 
and the extrapolated result is considerably high.

*Example:*
If we are calculating the increase in a metric "node_disk_read_bytes_total" 
every 5 mins, with prometheus scrape interval set as 1 min:

   - Consider the following sample values for the metric 
   "node_disk_read_bytes_total" within a 5 mins interval :
   [23758450955264 *(F)*, 23758499419136, 23758518625280, 23758519292928, 
   23758519870464 *(L)*]

*Result of the increase function:*

   - Extrapolated value returned by increase function over 5 mins: 86144000

*Our requirement is a function which:*

   - *Handles the counters resets similar to increase function*
   - *And returns the actual difference b/w the values of the first 
   sample(F) and Last sample(L) within the specified interval : 68915200*

*Thanks *

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5272c37c-2089-4370-9c40-419c77ba0063n%40googlegroups.com.


[prometheus-users] Need a query function which returns an actual increase in counter values within a time interval and handles counter reset cases

2022-10-17 Thread anantha sai ram


Is there a functionality using which we can get the difference in the 
values of counter samples within the time interval, while handling counter 
reset case and excluding the extrapolate functionality?

We tried using the increase() function, but it returns an extrapolated 
result. We have observed that the difference between the actual increase 
and the extrapolated result is considerably high.

*Example:*
If we are calculating the increase in a metric "node_disk_read_bytes_total" 
every 5 mins, with prometheus scrape interval set as 1 min:

   - Consider the following sample values for the metric 
   "node_disk_read_bytes_total" within a 5 mins interval :
   [23758450955264 *(F)*, 23758499419136, 23758518625280, 23758519292928, 
   23758519870464 *(L)*]

*Result of the increase function:*

   - Extrapolated value returned by increase function over 5 mins: 86144000

*Our requirement is a function which:*

   - *Handles the counters resets similar to increase function*
   - *And returns the actual difference b/w the values of the first 
   sample(F) and Last sample(L) within the specified interval : 68915200*

*Thanks!*

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/81dcdec0-3adc-417e-9861-73f013ed95den%40googlegroups.com.


[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-17 Thread marc koser
Thanks for the pointer Brian.

>From what you suggested; I updated my query to include `service` rather 
than `job` to cover the different values (representing either redis service 
on each `instance`), however I'm still not getting the results I expect:

query: 
redis_cluster_known_nodes != on (instance, service, group) count by 
(instance, service, group) (up{service=~"exporter-redis-.*"})

result:
{group="group-a", instance="node-1", service="exporter-redis-6379"} 10
{group="group-a", instance="node-1", service="exporter-redis-6380"} 10
{group="group-a", instance="node-2", service="exporter-redis-6379"} 11
{group="group-a", instance="node-2", service="exporter-redis-6380"} 16
{group="group-a", instance="node-3", service="exporter-redis-6379"} 16
{group="group-a", instance="node-3", service="exporter-redis-6380"} 16
{group="group-a", instance="node-4", service="exporter-redis-6379"} 16
{group="group-a", instance="node-4", service="exporter-redis-6380"} 16
{group="group-a", instance="node-5", service="exporter-redis-6379"} 16
{group="group-a", instance="node-5", service="exporter-redis-6380"} 16

I would expect only those who's count is != 10 be included in the result.


Here's a metric sample of those used in the query:
``` 
up{group="group-a", instance="node-1", job="redis-cluster", 
service="exporter-redis-6379", team="sre"} 1
up{group="group-a", instance="node-1", job="redis-cluster", 
service="exporter-redis-6380", team="sre"} 1
up{group="group-a", instance="node-2", job="redis-cluster", 
service="exporter-redis-6379"} 1
up{group="group-a", instance="node-2", job="redis-cluster", 
service="exporter-redis-6380"} 1
up{group="group-a", instance="node-3", job="redis-cluster", 
service="exporter-redis-6379"} 1
up{group="group-a", instance="node-3", job="redis-cluster", 
service="exporter-redis-6380"} 1
up{group="group-a", instance="node-4", job="redis-cluster", 
service="exporter-redis-6379"} 1
up{group="group-a", instance="node-4", job="redis-cluster", 
service="exporter-redis-6380"} 1
up{group="group-a", instance="node-5", job="redis-cluster", 
service="exporter-redis-6379"} 1
up{group="group-a", instance="node-5", job="redis-cluster", 
service="exporter-redis-6380"} 1

redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6380", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-2", 
job="redis-cluster", service="exporter-redis-6379"} 11
redis_cluster_known_nodes{group="group-a", instance="node-2", 
job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-3", 
job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-3", 
job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-4", 
job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-4", 
job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-5", 
job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-5", 
job="redis-cluster", service="exporter-redis-6380"} 16
```
On Thursday, October 13, 2022 at 9:17:55 AM UTC-4 Brian Candler wrote:

> Sorry, second to last sentence was unclear.  What I meant was:
>
>
> *If the LHS vector contains N metrics with a particular value of the 
> "group" label, which correspond to exactly 1 metric on the RHS with the 
> matching label value, or vice versa, then you can use N:1 matching.*
> On Thursday, 13 October 2022 at 14:13:42 UTC+1 Brian Candler wrote:
>
>> > Is it possible to have one side of a query limit the results of another 
>> part of the same query?
>>
>> Yes, but it depends on exactly what you mean. The details are here:
>> https://prometheus.io/docs/prometheus/latest/querying/operators/
>> It depends on whether you can construct vectors for the LHS and RHS which 
>> have corresponding labels.
>>
>> If you can give some specific examples of the metrics themselves - 
>> including all their labels - then we can see whether it's possible to do 
>> what you want in PromQL.  Right now the requirements are unclear.
>>
>>
>> *> redis_cluster_known_nodes != 
>> scalar(count(up{service=~"redis-exporter"}))*
>> > 
>> > The shared label value would be something like, *group="cluster-a" *and 
>> should not evaluate metrics where *group="cluster-b"*
>>
>> You need to arrange both LHS and RHS to have some corresponding labels 
>> before you can combine them with any operator such as !=.  The RHS has no 
>> "group" label at the moment, in fact it's not even a vector, but you could 
>> do:
>>
>> count by (group) (up{service="redis-exporter"})
>>
>> Then, assuming that 

[prometheus-users] Re: how to predict a date in a future on a threshold from the predict_linear function

2022-10-17 Thread Brian Candler
Spotted typo:

expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes*{*fstype!~"fuse.*|nfs.*"}[12h], 
604800) < 0)) * 604800

And of course, I was wrong to say that disk_usage_percentage trends down to 
zero - it will presumably trend up to 100 as the disk fills.

On Thursday, 18 August 2022 at 10:11:57 UTC+1 Brian Candler wrote:

> I had the a very similar requirement.  It was a tricky query to build from 
> scratch, but simple when you've worked it out, so I'm happy to share :-)
>
> - name: DiskRate12h
>   interval: 1h
>   rules:
>   # Warn if rate of growth over last 12 hours means filesystem will fill 
> in 7 days
>   - alert: DiskFilling7d
> expr: |
> node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
> 
> (predict_linear(node_filesystem_avail_bytes,fstype!~"fuse.*|nfs.*"}[12h], 
> 604800) < 0)) * 604800
> for: 24h
> labels:
>   severity: warning
> annotations:
>   summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
> at current 12h growth rate'
>
> I'm using "node_filesystem_avail_bytes" rather than 
> "disk_usage_percentage", but as they both trend down to zero, you should be 
> able to replace it.  Replace the time periods as appropriate.
>
> The logic goes something like this: say V is the variable you're 
> interested in (node_filesystem_avail_bytes in this case)
>
> * we take the current value of V; call it V1
> * predict_linear(V[12h], 604800) is the expected value in 7 days time 
> based on the trend over the last 12 hours; call it V2
> * filter that with < 0, so we get no value unless it's predicted to be 
> below 0 in 7 days
>
>  ^  V1
>  |\
>  | \
>  +--0---x--7> time
>  \
>   V2
>
> To find where the cut is on the time axis, you note that V1 is to (V1 + 
> (-V2)) as x is to 7 days.  That is, V1/(V1-V2) is the ratio of the lines 
> V1...x and V1...V2.  And therefore that's also the fraction of 604800 
> seconds to the zero crossing point x.
>
> Your problem is slightly different: you want to know when the free space 
> percentage will fall below 20, not when it falls below zero.  I'll leave 
> that as an exercise :-)  I think just substituting 
> (disk_space_percentage-20) everywhere in place of the variable is a good 
> starting point, but you have to be careful what happens if the current 
> value is already below 20.
>
> HTH,
>
> Brian.
>
> On Thursday, 18 August 2022 at 06:31:08 UTC+1 jer...@loyet.net wrote:
>
>> Hello all,
>>
>> I have the percentage of disk usage on a metric. I can use 
>> predict_linear(disk_usage_percentage[30d], 30*24*60*60) to give me a 
>> prediction in 1 month from the past month of metrics. fine
>>
>> but how could I retrieve the date on which the predict_linear function 
>> will reach 80% for instance ? if that's possible :-)
>>
>> Thank you 
>>
>> regards
>> ++ Jerome
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bcf02d00-bc79-4217-8cd0-b3b825a3e67dn%40googlegroups.com.


[prometheus-users] Re: PromQL

2022-10-17 Thread Brian Candler
The expression you've written doesn't really make much sense.  If you have 
a metric "disk_used_percent", which runs between 0 and 100 (presumably), 
why are you summing it by host?  This means that if one host had three 
disks, each 40% used, that the result would be "120% used" and trigger an 
alert unnecessarily.

I would expect the expression to be simply:

expr: disk_used_percent > 85

> Now for that i need to create 2 rule for each severity. Now i have 
question can we create one query for both severity like range between 85-95 
warring and 95 up critical? 

No, you were right the first time: you need one rule for 85%+ and one for 
95%+

You can then use inhibit rules in Alertmanager so that if the 95%+ alert is 
firing, it inhibits sending the 85%+ one.  To do this you'll need to add 
labels to your alerts, and set up the inhibit rules 
 
appropriately.

Personally though, I find such rules difficult to maintain and irritating. 
Suppose you have one machine which is sitting at 88% disk full, but is 
working perfectly normally.  Do you want it to be continuously alerting?  
Suppose you've already done all the free space tidying you can.  Are you 
*really* going to add more disk space to this machine, just to bring the 
usage under 85% to silence the alert?  Probably not (unless it's a VM and 
can be grown easily). However, once you start to accept continuously firing 
alerts, then you'll find that everyone ignores them, and then *real* 
problems get lost amongst the noise.

You might decide you want to have different thresholds for each 
filesystem.  But then either you end up with lots of alerting rules, or you 
need to put the thresholds in their own timeseries, as described here:
https://www.robustperception.io/using-time-series-as-alert-thresholds
- and this is a pain to maintain.

Personally, I've ditched all static alerting thresholds on disk space.  
Instead I have rules for when the filesystem is completely full(*), plus 
rules which look at how fast the filesystem is growing, and predict when 
they will be full if they continue to grow at the current rate.  Examples:

- name: DiskRate10m
  interval: 1m
  rules:
  # Warn if rate of growth over last 10 minutes means filesystem will fill 
in 2 hours
  - alert: DiskFilling10m
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 
7200) < 0)) * 7200
for: 20m
labels:
  severity: warning
annotations:
  summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 10m growth rate'

- name: DiskRate3h
  interval: 10m
  rules:
  # Warn if rate of growth over last 3 hours means filesystem will fill in 
2 days
  - alert: DiskFilling3h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 
172800) < 0)) * 172800
for: 6h
labels:
  severity: warning
annotations:
  summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 3h growth rate'

- name: DiskRate12h
  interval: 1h
  rules:
  # Warn if rate of growth over last 12 hours means filesystem will fill in 
7 days
  - alert: DiskFilling12h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[12h], 
604800) < 0)) * 604800
for: 24h
labels:
  severity: warning
annotations:
  summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 12h growth rate'


For an explanation of how these rules work 
see https://groups.google.com/g/prometheus-users/c/PCT4MJjFFgI/m/kVfOW069BQAJ

(*) In practice I also alert at *just below* full, e.g.

- name: DiskSpace
  interval: 1m
  rules:
  # Alert if any filesystem has less than 100MB available space (except for 
filesystems which are smaller than 150MB)
  - alert: DiskFull
expr: |
  node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"} < 1 
unless node_filesystem_size_bytes{fstype!~"fuse.*|nfs.*"} < 15000
for: 10m
labels:
  severity: critical
annotations:
  summary: 'Filesystem full or less than 100MB free space'

I find this helpful for /boot partitions where if they do get completely 
full with partially-installed kernel updates, it's tricky to fix.  But I 
still wouldn't "alert" in the sense of getting someone out of bed at 3am - 
unless the system is failing in a way that your users or customers would 
notice (which is something you should be checking and alerting on 
separately), this is something that can be fixed at leisure.

Finally, I can strongly recommend this "philosophy on alerting":
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/
You might want to consider whether some of these 

[prometheus-users] PromQL

2022-10-17 Thread ritesh patel
Hello Team,

Need help from you for PromQL query.

I want to create alert rule on prometheus alert manager and i have
prometheus query

Expr- sum by (host_name) (disk_used_percent) > 85 for warning ⚠️ level  and
same query with threshold > 95 critical level.

Now for that i need to create 2 rule for each severity. Now i have question
can we create one query for both severity like range between 85-95 warring
and 95 up critical? If yes can someone help me to write one sample query
example. So using one query i can setup alert rule.

Thanks and regards
Ritesh patel

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAPxUNF97HMKwLYmu3qEksLwY9--S5gM7T_AUP0NR4k3%2BVH%2Bgpw%40mail.gmail.com.


Re: [prometheus-users] Use Case Assessment

2022-10-17 Thread Rishabh Arora
Thank you for this perspective. We're currently looking at other systems 
for our more granular, functional monitoring needs, or perhaps the idea of 
building one which caters to our requirements.



On Monday, 17 October 2022 at 14:18:46 UTC+5:30 sayf.eddi...@gmail.com 
wrote:

> Hello,
> Monitoring the health of the system with Prometheus is fine. but I think 
> you are trying to include it as a functional brick in the application, 
> which I am not very keen with. imo The monitoring system should not coupled 
> with the functionning of your system (as in your system should continue to 
> work fine if Prometheus is down for exp).
> You need sth else like issuing events and alert on them (there you are 
> free to focus on the payementID info)
>
> On Monday, October 17, 2022 at 9:35:17 AM UTC+2 Rishabh Arora wrote:
>
>> Thank you for the clarification, Stuart.
>>
>> On Monday, 17 October 2022 at 12:50:57 UTC+5:30 Stuart Clark wrote:
>>
>>> On 17/10/2022 07:26, Rishabh Arora wrote:
>>>
>>> Hello!
>>>
>>> I'm currently in the process of implementing Prometheus along with 
>>> Alertmanager as our de facto solution for node health monitoring. We have a 
>>> kubernetes, kafka, mqtt setup and for monitoring our infrastructure, 
>>> prometheus is an obvious good fit.
>>>
>>> We have an application / business case, where I'm wondering whether 
>>> Prometheus may be a reasonable solution. Our application needs to meet 
>>> certain SLAs. In case those SLAs are not being, some alerts need to be 
>>> firing. For example, consider the following case which bears close 
>>> resemblance to our real business case:
>>>
>>> An *Order* schema in our system has a *payment* field which can be one 
>>> of ['COMPLETED','FAILED','PENDING']. In our HA real time system, we need to 
>>> fire alerts for Orders which are in a PENDING state. Rows in our 
>>> *Orders* collection will be in the order of potentially millions. An 
>>> order also has a *paymentEngine* field, which represents the entity 
>>> responsible for processing the payment for the order.
>>>
>>> Now, with Prometheus, finding the total count of PENDING Orders would be 
>>> a simple metric, but what we're interested in is also the Order IDs. For 
>>> instance, is there a way I could capture the PENDING order IDs in the 
>>> "metadata"(???) or "payload" of the metric? Downstream in the alertmanager, 
>>> I'd also like to group by *paymentEngine* so I could potentially 
>>> inhibit alerts for an unstable engine.
>>>
>>> Can anyone please help me out? Apologies in advance for my naivety :)
>>>
>>> What you are asking for isn't really the job of Prometheus.
>>>
>>> Having a metric detailing the number of pending orders & alerting on 
>>> that is completely within the normal area for Prometheus & Alertmanager - 
>>> observing the system and alerting if there are issues that need 
>>> investigation. However the next step of dealing with the individual 
>>> events/orders is the job for a different system. If paymentEngine could be 
>>> a small number of options (e.g. PayPal, Swipe, Cash) then it would be 
>>> reasonable to have that as a label to the pending orders metric (which then 
>>> would allow you to alert if one method stops working), but order ID isn't 
>>> something you should ever put in the metrics. Instead once you were alerted 
>>> about a potential issue you might query your order database directly or 
>>> look at log files to dig into the detail and figure out what is happening.
>>>
>>> -- 
>>> Stuart Clark
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c93af5f4-4335-4a71-9bf6-d5e7032a0074n%40googlegroups.com.


Re: [prometheus-users] Use Case Assessment

2022-10-17 Thread sayf.eddi...@gmail.com
Hello,
Monitoring the health of the system with Prometheus is fine. but I think 
you are trying to include it as a functional brick in the application, 
which I am not very keen with. imo The monitoring system should not coupled 
with the functionning of your system (as in your system should continue to 
work fine if Prometheus is down for exp).
You need sth else like issuing events and alert on them (there you are free 
to focus on the payementID info)

On Monday, October 17, 2022 at 9:35:17 AM UTC+2 Rishabh Arora wrote:

> Thank you for the clarification, Stuart.
>
> On Monday, 17 October 2022 at 12:50:57 UTC+5:30 Stuart Clark wrote:
>
>> On 17/10/2022 07:26, Rishabh Arora wrote:
>>
>> Hello!
>>
>> I'm currently in the process of implementing Prometheus along with 
>> Alertmanager as our de facto solution for node health monitoring. We have a 
>> kubernetes, kafka, mqtt setup and for monitoring our infrastructure, 
>> prometheus is an obvious good fit.
>>
>> We have an application / business case, where I'm wondering whether 
>> Prometheus may be a reasonable solution. Our application needs to meet 
>> certain SLAs. In case those SLAs are not being, some alerts need to be 
>> firing. For example, consider the following case which bears close 
>> resemblance to our real business case:
>>
>> An *Order* schema in our system has a *payment* field which can be one 
>> of ['COMPLETED','FAILED','PENDING']. In our HA real time system, we need to 
>> fire alerts for Orders which are in a PENDING state. Rows in our *Orders* 
>> collection 
>> will be in the order of potentially millions. An order also has a 
>> *paymentEngine* field, which represents the entity responsible for 
>> processing the payment for the order.
>>
>> Now, with Prometheus, finding the total count of PENDING Orders would be 
>> a simple metric, but what we're interested in is also the Order IDs. For 
>> instance, is there a way I could capture the PENDING order IDs in the 
>> "metadata"(???) or "payload" of the metric? Downstream in the alertmanager, 
>> I'd also like to group by *paymentEngine* so I could potentially inhibit 
>> alerts for an unstable engine.
>>
>> Can anyone please help me out? Apologies in advance for my naivety :)
>>
>> What you are asking for isn't really the job of Prometheus.
>>
>> Having a metric detailing the number of pending orders & alerting on that 
>> is completely within the normal area for Prometheus & Alertmanager - 
>> observing the system and alerting if there are issues that need 
>> investigation. However the next step of dealing with the individual 
>> events/orders is the job for a different system. If paymentEngine could be 
>> a small number of options (e.g. PayPal, Swipe, Cash) then it would be 
>> reasonable to have that as a label to the pending orders metric (which then 
>> would allow you to alert if one method stops working), but order ID isn't 
>> something you should ever put in the metrics. Instead once you were alerted 
>> about a potential issue you might query your order database directly or 
>> look at log files to dig into the detail and figure out what is happening.
>>
>> -- 
>> Stuart Clark
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/91f5d8c2-d1f4-42df-bf9e-ccfae3f6d9b0n%40googlegroups.com.


[prometheus-users] Re: snmp_exporter drop metrics by value or label value

2022-10-17 Thread Brian Candler
Duplicate of
https://groups.google.com/g/prometheus-users/c/6X9Hc6jH1v4

On Monday, 17 October 2022 at 08:01:27 UTC+1 yngwi...@gmail.com wrote:

> Hi everyone. I want to drop some specific metrics by its value or label 
> value, for example:
>
> 1. the temperature metrics which values are 65535 meaning it's invalid
> 2. the power metrics which "entPhysicalClass" label value are not "6" and 
> "9"
>
> How can I write the configuration? I write the "snmp.yml" without 
> generator, where can I find the specification of its syntax?
>
> Appreciate for any help :)
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/cfbf0adf-4197-4880-8cdf-4002c63609c7n%40googlegroups.com.


[prometheus-users] Re: snmp_exporter drop metrics by metric value or label value

2022-10-17 Thread Brian Candler
> Hi, everyone. I want to drop some specific metrics by its value or its 
label value. For example:
> 1. the temperature metrics which values are 65535 meaning it's invalid
> 2. the power metrics which "entPhysicalClass" label vale are not "6" and 
"9"

Metric relabeling 

 
can be used to drop specific timeseries in the scrape response by label 
value, but not by metric value.

The only way I can think of dropping by metric value (without changing the 
exporter output) is to use a recording rule 
 
to 
make a modified version of the timeseries, e.g.

expr: some_temperature != 65535

If you are happy to hack snmp.yml, you could try something like this 
(untested):

regex_extracts:
  "":
  - value: NaN
regex: ^65535$
  - value: $1
regex: ^(.+)$

I don't know if it's allowed to use "NaN" as a value here: source code 
 
suggests it should work . However, a time 
series consisting of NaNs 
 is not the same as 
an empty/missing timeseries. So depending on your requirements, it may be 
better to do

regex_extracts:
  "":
  - value: INVALID
regex: ^65535$
  - value: $1
regex: ^(.+)$

although this will cause snmp_exporter to generate noisy logs at debug 
level.

Aside: if you look through the examples you can see regex being used to 
divide a value by 10 (or by 100), e.g.:

regex_extracts:
  "":
  - value: $1.$2
regex: ^(?:(.*)(.))$

> I write the snmp.yml without generator, didn't find a specification of 
its syntax, does somebody know where it is?

https://github.com/prometheus/snmp_exporter/blob/v0.20.0/generator/FORMAT.md

Having said that, you may just want to run the generator and look at its 
output to see what it emits :-)

On Monday, 17 October 2022 at 08:01:08 UTC+1 yngwi...@gmail.com wrote:

> Hi, everyone. I want to drop some specific metrics by its value or its 
> label value. For example:
> 1. the temperature metrics which values are 65535 meaning it's invalid
> 2. the power metrics which "entPhysicalClass" label vale are not "6" and 
> "9"
>
> I write the snmp.yml without generator, didn't find a specification of its 
> syntax, does somebody know where it is?
>
> Appreciate for any help :)
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0a4a3423-487e-423b-8ab4-02096a49ace9n%40googlegroups.com.


[prometheus-users] Re: mute_time_intervals over night: start time cannot be equal or greater than end time

2022-10-17 Thread Brian Candler
Try this:



*times:  - start_time: '22:00'end_time: '24:00'*

*  - start_time: '00:00'end_time: '06:00'*

On Monday, 17 October 2022 at 08:00:41 UTC+1 smaxxx 1337 wrote:

> Hello,
>
> im trying to implement mute_time_intervals for muting alerts over night 
> (22:00 UTC - 06:00 UTC) but when trying to do so I receive the error "*start 
> time cannot be equal or greater than end time*"
>
> Question is, how do I implement this so the alerts are muted over night? 
> Configuration snippet:
>
>
>
>
>
> *  - name: nighttime_intervals:  - months: ['january', 'february', 
> 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 
> 'november', 'december']times:  - start_time: '22:00'
> end_time: '06:00'*
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6ad30ba3-9721-410b-bf49-b70613778dd9n%40googlegroups.com.


Re: [prometheus-users] Use Case Assessment

2022-10-17 Thread Rishabh Arora
Thank you for the clarification, Stuart.

On Monday, 17 October 2022 at 12:50:57 UTC+5:30 Stuart Clark wrote:

> On 17/10/2022 07:26, Rishabh Arora wrote:
>
> Hello!
>
> I'm currently in the process of implementing Prometheus along with 
> Alertmanager as our de facto solution for node health monitoring. We have a 
> kubernetes, kafka, mqtt setup and for monitoring our infrastructure, 
> prometheus is an obvious good fit.
>
> We have an application / business case, where I'm wondering whether 
> Prometheus may be a reasonable solution. Our application needs to meet 
> certain SLAs. In case those SLAs are not being, some alerts need to be 
> firing. For example, consider the following case which bears close 
> resemblance to our real business case:
>
> An *Order* schema in our system has a *payment* field which can be one of 
> ['COMPLETED','FAILED','PENDING']. In our HA real time system, we need to 
> fire alerts for Orders which are in a PENDING state. Rows in our *Orders* 
> collection 
> will be in the order of potentially millions. An order also has a 
> *paymentEngine* field, which represents the entity responsible for 
> processing the payment for the order.
>
> Now, with Prometheus, finding the total count of PENDING Orders would be a 
> simple metric, but what we're interested in is also the Order IDs. For 
> instance, is there a way I could capture the PENDING order IDs in the 
> "metadata"(???) or "payload" of the metric? Downstream in the alertmanager, 
> I'd also like to group by *paymentEngine* so I could potentially inhibit 
> alerts for an unstable engine.
>
> Can anyone please help me out? Apologies in advance for my naivety :)
>
> What you are asking for isn't really the job of Prometheus.
>
> Having a metric detailing the number of pending orders & alerting on that 
> is completely within the normal area for Prometheus & Alertmanager - 
> observing the system and alerting if there are issues that need 
> investigation. However the next step of dealing with the individual 
> events/orders is the job for a different system. If paymentEngine could be 
> a small number of options (e.g. PayPal, Swipe, Cash) then it would be 
> reasonable to have that as a label to the pending orders metric (which then 
> would allow you to alert if one method stops working), but order ID isn't 
> something you should ever put in the metrics. Instead once you were alerted 
> about a potential issue you might query your order database directly or 
> look at log files to dig into the detail and figure out what is happening.
>
> -- 
> Stuart Clark
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a45ef22c-d1e4-4155-aede-e5c2cae8d696n%40googlegroups.com.


Re: [prometheus-users] Use Case Assessment

2022-10-17 Thread Stuart Clark

On 17/10/2022 07:26, Rishabh Arora wrote:

Hello!

I'm currently in the process of implementing Prometheus along with 
Alertmanager as our de facto solution for node health monitoring. We 
have a kubernetes, kafka, mqtt setup and for monitoring our 
infrastructure, prometheus is an obvious good fit.


We have an application / business case, where I'm wondering whether 
Prometheus may be a reasonable solution. Our application needs to meet 
certain SLAs. In case those SLAs are not being, some alerts need to be 
firing. For example, consider the following case which bears close 
resemblance to our real business case:


An /Order/ schema in our system has a /payment/ field which can be one 
of ['COMPLETED','FAILED','PENDING']. In our HA real time system, we 
need to fire alerts for Orders which are in a PENDING state. Rows in 
our /Orders/ collection will be in the order of potentially millions. 
An order also has a /paymentEngine/ field, which represents the entity 
responsible for processing the payment for the order.


Now, with Prometheus, finding the total count of PENDING Orders would 
be a simple metric, but what we're interested in is also the Order 
IDs. For instance, is there a way I could capture the PENDING order 
IDs in the "metadata"(???) or "payload" of the metric? Downstream in 
the alertmanager, I'd also like to group by /paymentEngine/__so I 
could potentially inhibit alerts for an unstable engine.


Can anyone please help me out? Apologies in advance for my naivety :)


What you are asking for isn't really the job of Prometheus.

Having a metric detailing the number of pending orders & alerting on 
that is completely within the normal area for Prometheus & Alertmanager 
- observing the system and alerting if there are issues that need 
investigation. However the next step of dealing with the individual 
events/orders is the job for a different system. If paymentEngine could 
be a small number of options (e.g. PayPal, Swipe, Cash) then it would be 
reasonable to have that as a label to the pending orders metric (which 
then would allow you to alert if one method stops working), but order ID 
isn't something you should ever put in the metrics. Instead once you 
were alerted about a potential issue you might query your order database 
directly or look at log files to dig into the detail and figure out what 
is happening.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/43479ddc-5970-194e-4779-97b6fc6e1e32%40Jahingo.com.


[prometheus-users] Re: Kube-State-Metrics not being scraped by Prometheus

2022-10-17 Thread Brian Candler
> When I do:
> curl stateMetricsIP:8080/metrics
> it displays all the state-metrics. But these metrics are not present 
within the /metrics endpoint of Prometheus.

That's correct and expected.  The /metrics endpoint of prometheus does 
*not* show all scraped metrics; it just returns metrics about prometheus 
itself, such as the status of the timeseries database.  The idea is so that 
you can get prometheus to scrape itself to get a history of the TSDB status.

If you want to check whether the kube-state-metrics have been ingested, you 
need to send a PromQL query to prometheus, either using the built-in PromQL 
query browser (the "Graph" tab in the web interface), or using the HTTP 
query API .

If you don't know what the metric names are, then you can do
{__name__=~".+",job="kube-state-metrics"}
although I would never recommend that on a production system because it 
could return many thousands of results.

On Monday, 17 October 2022 at 08:00:09 UTC+1 rani...@gmail.com wrote:

> Hey guys my issue here is as stated, the kube-state-metrics are not being 
> scraped by Prometheus, therefore my Grafana dashboards tell me that I have 
> no data when displaying my pod metrics.
> When I do:
> curl stateMetricsIP:8080/metrics
> it displays all the state-metrics. But these metrics are not present 
> within the /metrics endpoint of Prometheus.
> I suspect that the problem is within the prometheus.yaml 
> kube-state-metrics job but I'm not sure.
> My configuration on Prometheus side for the job is
>   
> - job_name: 'kube-state-metrics'
>   honor_timestamps: true
>   scrape_interval: 5s
>   scrape_timeout: 5s
>   metrics_path: /metrics
>   scheme: http
>   static_configs:
>   - targets:
> - kube-state-metrics.ops.svc.cluster.local:8080
>
> Am I missing something here? Both my services are working fine, prometheus 
> gives me the 'OK' for kube-state-metrics target as well. 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1336567a-7cc0-40d9-b002-dbda0a4ddd53n%40googlegroups.com.


[prometheus-users] Use Case Assessment

2022-10-17 Thread Rishabh Arora
Hello!

I'm currently in the process of implementing Prometheus along with 
Alertmanager as our de facto solution for node health monitoring. We have a 
kubernetes, kafka, mqtt setup and for monitoring our infrastructure, 
prometheus is an obvious good fit.

We have an application / business case, where I'm wondering whether 
Prometheus may be a reasonable solution. Our application needs to meet 
certain SLAs. In case those SLAs are not being, some alerts need to be 
firing. For example, consider the following case which bears close 
resemblance to our real business case:

An *Order* schema in our system has a *payment* field which can be one of 
['COMPLETED','FAILED','PENDING']. In our HA real time system, we need to 
fire alerts for Orders which are in a PENDING state. Rows in our *Orders* 
collection 
will be in the order of potentially millions. An order also has a 
*paymentEngine* field, which represents the entity responsible for 
processing the payment for the order.

Now, with Prometheus, finding the total count of PENDING Orders would be a 
simple metric, but what we're interested in is also the Order IDs. For 
instance, is there a way I could capture the PENDING order IDs in the 
"metadata"(???) or "payload" of the metric? Downstream in the alertmanager, 
I'd also like to group by *paymentEngine* so I could potentially inhibit 
alerts for an unstable engine.

Can anyone please help me out? Apologies in advance for my naivety :)

Best,

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/dd57c63c-5e33-4103-9d3b-7968b26a4a59n%40googlegroups.com.


[prometheus-users] snmp_exporter drop metrics by value or label value

2022-10-17 Thread Wang Yngwie
Hi everyone. I want to drop some specific metrics by its value or label 
value, for example:

1. the temperature metrics which values are 65535 meaning it's invalid
2. the power metrics which "entPhysicalClass" label value are not "6" and 
"9"

How can I write the configuration? I write the "snmp.yml" without 
generator, where can I find the specification of its syntax?

Appreciate for any help :)

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2980ec27-194f-4428-b96d-0543d4fc6647n%40googlegroups.com.


[prometheus-users] snmp_exporter drop metrics by metric value or label value

2022-10-17 Thread Wang Yngwie
Hi, everyone. I want to drop some specific metrics by its value or its 
label value. For example:
1. the temperature metrics which values are 65535 meaning it's invalid
2. the power metrics which "entPhysicalClass" label vale are not "6" and "9"

I write the snmp.yml without generator, didn't find a specification of its 
syntax, does somebody know where it is?

Appreciate for any help :)

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e4de8483-e208-4af6-a7d0-1dbe44ddd8a4n%40googlegroups.com.


[prometheus-users] mute_time_intervals over night: start time cannot be equal or greater than end time

2022-10-17 Thread 'smaxxx 1337' via Prometheus Users
Hello,

im trying to implement mute_time_intervals for muting alerts over night 
(22:00 UTC - 06:00 UTC) but when trying to do so I receive the error "*start 
time cannot be equal or greater than end time*"

Question is, how do I implement this so the alerts are muted over night? 
Configuration snippet:





*  - name: nighttime_intervals:  - months: ['january', 'february', 
'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 
'november', 'december']times:  - start_time: '22:00'
end_time: '06:00'*

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/85d550bc-18fb-417d-be00-81ff284b2745n%40googlegroups.com.


[prometheus-users] Kube-State-Metrics not being scraped by Prometheus

2022-10-17 Thread Ranindu
Hey guys my issue here is as stated, the kube-state-metrics are not being 
scraped by Prometheus, therefore my Grafana dashboards tell me that I have 
no data when displaying my pod metrics.
When I do:
curl stateMetricsIP:8080/metrics
it displays all the state-metrics. But these metrics are not present within 
the /metrics endpoint of Prometheus.
I suspect that the problem is within the prometheus.yaml kube-state-metrics 
job but I'm not sure.
My configuration on Prometheus side for the job is
  
- job_name: 'kube-state-metrics'
  honor_timestamps: true
  scrape_interval: 5s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
- kube-state-metrics.ops.svc.cluster.local:8080

Am I missing something here? Both my services are working fine, prometheus 
gives me the 'OK' for kube-state-metrics target as well. 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2fb48c35-ca3d-4d1d-9e40-faf760e49e4bn%40googlegroups.com.