Re: [prometheus-users] How to specify Tags for timeSeries that represent url with parameters

2021-01-29 Thread Debashish Ghosh
Yes the values will actually have real values. But the number of timeseries 
will be limited since there will be only few configured values that will be 
used to create the url..

Sent from my iPhone

> On Jan 29, 2021, at 2:42 PM, Julius Volz  wrote:
> 
> 
> Are you saying that the time series itself only has those placeholders in the 
> "uri" label value? Or does it actually have all the real values for the 
> placeholders in there (which may end up being a lot of time series then)?
> 
>> On Fri, Jan 29, 2021 at 6:05 PM Debashish Ghosh 
>>  wrote:
>> I have a timeseries (http requests) that has uri as tag and the uri has 
>> parameters. How to specify that in prometheus query. I am using the tag as 
>> follows :
>> http_server_requests_seconds_bucket{le="1.0",uri="/interopapi/v1/org/{orgId}/facility/{facility}/api/FHIR/{fhirVersion}/{resourceName}
>>  }
>> 
>> Here parameters like orgId , facility, fhirVersion are dynamic . I don't see 
>> any data showing up for this timeseries . Is there any other way of 
>> specifying this in prometheus query ?
>> 
>> Thanks
>> Debashish
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-users+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/ddecf4c5-01be-4059-b8a5-8c2bd0f02fdan%40googlegroups.com.
> 
> 
> -- 
> Julius Volz
> PromLabs - promlabs.com

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/67FB461F-0083-4BD1-8796-B1B46CFD4274%40gmail.com.


[prometheus-users] How to specify Tags for timeSeries that represent url with parameters

2021-01-29 Thread Debashish Ghosh
I have a timeseries (http requests) that has uri as tag and the uri has 
parameters. How to specify that in prometheus query. I am using the tag as 
follows :
http_server_requests_seconds_bucket{le="1.0",uri="/interopapi/v1/org/{orgId}/facility/{facility}/api/FHIR/{fhirVersion}/{resourceName}
 
}

Here parameters like orgId , facility, fhirVersion are dynamic . I don't 
see any data showing up for this timeseries . Is there any other way of 
specifying this in prometheus query ?

Thanks
Debashish

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ddecf4c5-01be-4059-b8a5-8c2bd0f02fdan%40googlegroups.com.


[prometheus-users] Prometheus query using variables from grafana templates having spaces doesn't work

2020-07-06 Thread Debashish Ghosh
Hi,
   I have created a grafana template varaible to extract names of 
organizations from Prometheus db using query 
label_values(custom_message_volume_endpoint_organization_total,Organization).

This yields organization names that have spaces in some cases .. For example
OrgA
OrgB product1
OrgC product2

I use the following query to extract the timeseries corresponding to an 
organization..
custom_latency_endpoint_organization_total{job="MyJob",Organization=~"$Organization"}

It only works when I select the variable value OrgA but doesn't work for 
the other 2 . Apparently any organization name having space is discarded.
Is there a workaround to get around this ?

Thanks
Debashish

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c0fe6136-8ce4-4309-8516-0826cc3f12fbo%40googlegroups.com.


[prometheus-users] want to eliminate -ve values in my prometheus query

2020-05-04 Thread Debashish Ghosh
Hi,
I am plotting a prometheus timeline using the query -( increase(metric 
A)[5m] - increase(metric B)[5m]) to derive another timeline metric C .
But due to probably some weird case , sometimes this value is negative . Is 
there a way in Prom QL where I can say something like a conditional 
operator like metricA - metricB <0 ?0: metricA-metricB.
So if the value is -ve use 0 otherwise use the actual value .

Thanks
Debashish

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2985dad3-c4fd-4db7-b827-5487d5b321f8%40googlegroups.com.


[prometheus-users] Calculating Availability SLA over multiple VMs

2020-03-16 Thread Debashish Ghosh
Hi,
  I am currently using spring's actuator/micrometer to spit out metrics 
that are scraped by prometheus.
The framework generates a metric called *process_uptime_seconds* which is 
the number of seconds my app is running in a VM . I have *2 VMs* where my 
app is running to provide high availability of 99.95 %.

I am using the formula *100-(((30*24*60*60) - 
increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100
 
*to calculate the SLA.

30*24*60*60 represents the number of sencods in 30 days and the difference 
with the process_uptime_seconds will give the number of seconds the app was 
down in a VM .

But the problem with this approach is that periodically we have to *restart 
*the service to apply patch and while doing so we do it one by one so that 
there is no downtime.

But since the above formula creates one timeseries for each VM instance the 
SLA goes down since both the servers are restarted one after the another.

Is there a way to take this into consideration to calculate sla based on 
the time* when both the servers were down together *?

Thanks
Debashish
  

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6547455f-8ebb-4d7f-b5b9-8198f415fb84%40googlegroups.com.


[prometheus-users] Prometheus alerting rules test for counters requiring multiple day span

2020-03-10 Thread Debashish Ghosh
Hi,
I have a metric regarding SLA that needs to be 99.95 % or above . I am
using the formula 100-(((30*24*60*60) -
increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100
that runs for15 minutes ,which means if there is any time missing between
the total number of seconds in 30 days minus the number of seconds the
server was up in the last 30 days , that time should be less than .05%.. I
am having difficulty writing test for this since I see that alert rules
test doesn't allow '1d' as interval . So should I use something like 1m as
interval with values: '0+60x43200' which would be number of entries equal
to the number of minutes in 30 days. Also what should be the eval_time I
use in this case ? I am using 15m but that doesn't yield the required
result .

I have similar problem for Latency SLA . I am using histogram for that and
am trying to get the percentage of messages below 1 second bucket . I am
using the formula below :
sum(rate(http_server_requests_seconds_bucket{le="1.0",uri="/inboundapi/message/v2"}[30d]))
by (job)
/sum(rate(http_server_requests_seconds_count{uri="/inboundapi/message/v2"}[30d]))by
(job)*100.
To test this too I need to use something similar to above case.

Thanks
Debashish

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAHg4STwcVpsdX4_Q1Q3W2tCK69UcS8oLfPAJR%2BvBVcUiwkhHiw%40mail.gmail.com.


Fwd: [prometheus-users] Re: Prometheus alerting rules test for counters

2020-03-09 Thread Debashish Ghosh
This perfectly makes sense for alerting. This is really helpful. I was just
using the queries I have for my grafana dashboard where I really wanted
plots that are more fine grained and done every 2 minutes .
I have another metric regarding SLA that needs to be 99.95 % or above . I
am using the formula 100-(((30*24*60*60) -
increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100
which means if there is any time missing between the total number of
seconds in 30 days minus the number of seconds the server was up in the
last 30 days , that time should be less than .05%.. I am having difficulty
writing test for this since I see that it doesn't allow '1d' as interval .
So should I use something like 24*60m instead of 1d.

I have similar problem for Latency SLA . I am using histogram for that and
am trying to get the percentage of messages below 1 second bucket . I am
using the formula below :
sum(rate(http_server_requests_seconds_bucket{le="1.0",uri="/inboundapi/message/v2"}[30d]))
by (job)
/sum(rate(http_server_requests_seconds_count{uri="/inboundapi/message/v2"}[30d]))by
(job)*100.
To test this too I need to use days in the interval.

Let me know your thoughts .

Thanks
Debashish

On Mon, Mar 9, 2020 at 11:18 AM Brian Candler  wrote:

> On Monday, 9 March 2020 14:47:28 UTC, Debashish Ghosh wrote:
>>
>> Thanks brian .. that answers most of my questions ... Regarding using
>> [15m] in the increase we have purposely kept it [2m] that runs for 15
>> minutes since we are really tracking something continuously to be true all
>> the time to trigger an alert as opposed to only once .
>>
>>
> Yes, but think about it.  You are evaluating the rule every minute.
>
> In one case you are saying:
>
> b-a == 0  (*)
> c-b == 0
> d-e == 0
> ... must be true 15 times in a row
>
> What I'm recommending is you do the rate over 15 minutes, which means
>
> q-a == 0
>
> You can still evaluate this rule every 1 minute, and it will first trigger
> once the counter has been flat for 15 minutes.
>
> I think you can see that in both cases, the counter must be continuously
> non-incrementing over 15 minutes to alert.  However, the second formulation
> is more stable in the face of any missed data collection.  metric[2m] will
> return no value if there are not two points within a 2 minute window.
>
> (*) It's not exactly "b-a==0", because rate() or increase() will skip
> cases where the counter resets.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/5370-2074-473e-a7cf-9e2e78c42dc3%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/5370-2074-473e-a7cf-9e2e78c42dc3%40googlegroups.com?utm_medium=email_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAHg4STw%2BJG_ehQ_Pb%3DBmnYq7Er%3DUWzm_sEoRnCXsqsJxf%2ByKvg%40mail.gmail.com.


Re: [prometheus-users] Re: Prometheus alerting rules test for counters

2020-03-09 Thread Debashish Ghosh
Thanks brian .. that answers most of my questions ... Regarding using [15m]
in the increase we have purposely kept it [2m] that runs for 15 minutes
since we are really tracking something continuously to be true all the time
to trigger an alert as opposed to only once .

On Mon, Mar 9, 2020 at 5:07 AM Brian Candler  wrote:

> BTW, I think that rule would be more robust against missing values by using
>
> expr: increase(metric_name[15m]) == 0
>
> instead of using "for:".  If you use "for:" then the condition must be
> true for every single evaluation, and a single missed sample may reset the
> alert.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/f62a0e3a-8f71-48aa-a5f3-b86f16be57b2%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAHg4STyYZeT4hc1Ma8-5jgfpkPeD2R6VU82Lzdp90UfdHhE1NQ%40mail.gmail.com.


[prometheus-users] Prometheus alerting rules test for counters

2020-03-08 Thread Debashish Ghosh
Hi ,
   I have a few alerts created for some counter time series in Prometheus . 
I went through the basic alerting test examples in the prometheus web site. 
But they don't seem to work well with my counters that I use for alerting 
.I use some expressions on counters like increase() , rate() and sum() and 
want to have test rules created for these. I have attached my alerts file 
as well as my test file. 
I am trying the most basic test of the 
counter custom_message_volume_endpoint_organization_total  where I set all 
the values to 0 so that when my alert with expr 
- increase(custom_message_volume_endpoint_organization_total[2m]) == 0 runs 
it should always be zero for 15 minutes and then it should return the 
alert. But it keeps returning blank. 
Can you please help me on this ?

Also I has one question regarding the difference between interval and 
evaluation_interval in the test file . Are the same and if now what is the 
difference ? I now understand the meaning of eval_time .

Thanks
Debashish

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9c1eca6b-c25d-447c-a854-1e302f0e1919%40googlegroups.com.


interop_alert_rule_test.yml
Description: Binary data


Prometheus_alerting_rules.yml
Description: Binary data


[prometheus-users] Scrapping different metrics at different intervals..

2020-02-29 Thread Debashish Ghosh


Hi,

   I have an application that uses spring actuator/micrometer framework to 
expose a myriad of metrics..

There are some that are volume and latency related which needs to be scraped 
every 10sec.. There are some resource monitoring metrics that are very process 
heavy that needs to be scraped every 2 minutes..

How do I ensure these two set of metrics are scraped in an isolated manner 
since l don’t want my resource metrics to be scraped every 10 seconds?

One solution could be to probably expose 2 different end points with two 
Prometheus jobs for the endpoint..

Is there any other better way of handling this?

Thanks 
Debashish 

Sent from my iPhone

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3FFD21D2-32E2-4F2A-8958-FC70285B648E%40gmail.com.


[prometheus-users] Prometheus query with conditional operation

2020-02-27 Thread Debashish Ghosh
Hi,
I have a tricky problem to resolve when trying to get percentage of all 
the messages that flow through our system that takes more than 1 second 
over a span of 30 days.

I have a volumeCounter that gives the total_volume of messages so I believe 
the total number of messges will be increase(volumeCounter[30d]).

I have another counter latencyCounter that adds latency of each message . 
So to get latency per message I can use 
increase(latencyCounter[30d])/increase(volumeCounter[30d].

Now in the timeseries generated there are some dataPoints where the value 
is >1 second. I want to get the percentage of those from the overall number 
of dataPoints.

So for example in the last 30 days I send 1 messages out of which 1000 
took more than 1 second so the result of the query at that point should 
return 10 .

Is there a way of achieving this in prometheus ?

Thanks
Debashish

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b2c3481f-aa62-4bdf-a5eb-1c2fb3f9d0fb%40googlegroups.com.