[prometheus-users] Re: PromQL redirection

2024-09-06 Thread 'Brian Candler' via Prometheus Users
Have a look at https://github.com/jacksontj/promxy

But I don't think it's yet clever enough to avoid querying servers that 
couldn't possibly match the query.  (Presumably it could only do that if 
your PromQL was specific enough with its labels)

On Friday 6 September 2024 at 16:28:10 UTC+1 Samit Jain wrote:

>
> We've multiple segregated data systems which support PromQL, each storing 
> metrics for different class of applications, infra, etc.
>
> We would like to explore the possibility of abstracting promql over these 
> systems, such that user can run a query without knowing about the different 
> backends. The options we considered below use something of a brute force 
> approach and won't scale:
>
>1. support remote read API in all backends and configure Prometheus to 
>remote read from all of them.
>2. send PromQL query to all backends and merge the results.
>
> I think a system where there is an external 'router' component which knows 
> where different time series' are stored (some sort of an index table) and 
> uses it to query the right backend would be worth exploring. We can presume 
> for now that the time series are unique across all backends. Do you know of 
> something like this exists in some form, or some literature on this that we 
> could build upon?
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7e65f7de-cab5-48b5-a4f6-ab810ea5823bn%40googlegroups.com.


[prometheus-users] Re: PromQL - Check for specific value in the past

2024-03-06 Thread 'Brian Candler' via Prometheus Users
You can use a subquery which will sample the data, something like this:

bgp_state_info != 3 and present_over_time((bgp_state_info == 3)[60d:1h])

You can reduce the sampling interval from 1h to reduce the risk of missing 
times when BGP was up, but then the query becomes increasingly expensive.

It would be nice if PromQL allowed you to do filtering and arithmetic 
expressions between range vectors and scalars, e.g. 
present_over_time(bgp_state_info[60d] == 3), but it doesn't.

Another approach is to use a recording rule, where you can combine the 
current value with a new value, e.g.

- record: bgp_seen
  expr: bgp_seen or bgp_state_info == 3

Temporarily set the expression to the subquery to prime it from historical 
data.  With a bit of tweaking you could make the value of this expression 
be the timestamp when bgp_state_info == 3 was first seen.

The alert then becomes:

bgp_state_info != 3 and bgp_seen

On Wednesday 6 March 2024 at 11:48:23 UTC fiala...@gmail.com wrote:

> Hi,
>
> I have a metric bgp_state_info. Ok state is when metric has value = 3, 
> other values (from 1 to 7) are considered as error.
>
> I want to fire alert only for metrics that has value 3 at least only once. 
> In other words I dont' want to fire alert for bgp that never worked.
>
> Is it possible via promQL to do this? I have data retention 60 days and 
> I'm aware of this limitations.
>
> Thank you.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9cc1f2fd-cdb8-416f-9a6a-83376f7071b5n%40googlegroups.com.


[prometheus-users] Re: PromQL: understanding the and operator

2024-02-23 Thread 'Brian Candler' via Prometheus Users
On Saturday 24 February 2024 at 01:00:57 UTC+7 Alexander Wilke wrote:

Another possibility could be

QueryA + queryB == 0  #both down


No, that doesn't work, for exactly the same reason that "QueryA and QueryB" 
doesn't work.

With a binary expression like "foo + bar", each side is a vector, and each 
element of the vector has a different label set.

The result only combines values from the left and right hand sides with 
*exactly* matching label sets.  Therefore, an element in the LHS with 
{HOSTNAME="server1"} does not match an element in the RHS with 
{hostname="server2"}.  Elements in the LHS which don't match any element in 
the RHS (and vice versa) are dropped.

But you can modify that logic, using for example "foo + ignoring(HOSTNAME) 
bar"

In this case, the HOSTNAME label is ignored when matching the LHS and RHS. 
But if an element on the LHS then matches multiple on the RHS, or vice 
versa, there will be an error.  N:1 or 1:N matches can be made to work by 
adding group_left or group_right clauses. If multiple elements on LHS match 
multiple elements on the RHS, then that doesn't work.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c780eea5-d842-404b-a04c-02558163eafbn%40googlegroups.com.


[prometheus-users] Re: PromQL: understanding the and operator

2024-02-23 Thread 'Brian Candler' via Prometheus Users
On Friday 23 February 2024 at 02:28:52 UTC+7 Puneet Singh wrote:

Now i tried to find the time duration where both these service were 
simultaneously down / 0 on both server1 and server2 :
(sum without (USER) (
*go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
 
< 1) and (sum without (USER) (
*go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
 
< 1)


I was expecting a graph similar to the once for server2 , but i got :
[image: Untitled.png]

I think i need to ignore the HOSTNAME label , but unable to figure out the 
way to ignore the HOSTNAME label in combination with sum without clause.


You've got exactly the right idea.  It's not the "sum without" that needs 
modifying, it's the "and"

() and ignoring (hostname) ()

 See: 
https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching-keywords

In this particular example, there are other ways to do this which might end 
up with a more compact expression. You could have an outer sum over the 
inner sums, but then I think the whole expression simplifies to just

sum without (USER) (
*go_service_status{HOSTNAME=~"server1|server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
 
< 1

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3cf69feb-735f-4dd0-b8a5-1d74008f35ccn%40googlegroups.com.


[prometheus-users] Re: PromQL: understanding the and operator

2024-02-23 Thread Alexander Wilke
Another possibility could be

QueryA + queryB == 0  #both down

Or the other way
QueryA + querxB == 2 # both up



Alexander Wilke schrieb am Freitag, 23. Februar 2024 um 17:45:28 UTC+1:

> In Grafana i create query A and Query B and then an Expression C with 
> "Math" and then I can compare Like $A > 0 && B > 0.
> Maybe there is "Transform Data" and then a calcukation Option.
>
> Puneet Singh schrieb am Donnerstag, 22. Februar 2024 um 21:58:08 UTC+1:
>
>> okay, So I think should this be the correct way to perform the and 
>> operation ? - 
>> (sum without (USER, HOSTNAME ,instance ) (
>> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>  
>> < 1) and (sum without ( USER, HOSTNAME ,instance  ) (
>> *go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>  
>> < 1)
>>
>> Regards
>> P
>>
>>
>> On Friday 23 February 2024 at 00:58:52 UTC+5:30 Puneet Singh wrote:
>>
>>> Hi All, 
>>> I have a metric called go_service_status where  i use the "sum without" 
>>> operator to determine whether a service is up or down on a server. Now 
>>> there can be a situation where service can be down simultaneously on 2 
>>> master servers and I am unable to figure out a PromQL query to detect that 
>>> situation. Example -  
>>>
>>>
>>> *go_service_status{SERVICETYPE="grade1",SERVER_CATEGORY="db1",instance=~"server1:7878"}*
>>> and it can have 2 possible series -
>>> go_service_status{HOSTNAME="server1", SERVER_CATEGORY="db1", 
>>> SERVICETYPE="grade1", USER="admin", instance="server1:7878", 
>>> job="customprocessexporter01"} 0
>>> go_service_status{HOSTNAME="server1", SERVER_CATEGORY="db1", 
>>> SERVICETYPE="grade1", USER="root", instance="server1:7878", 
>>> job="customprocessexporter01"} 1
>>>
>>> and in the same way
>>>
>>> *go_service_status{SERVICETYPE="grade1",SERVER_CATEGORY="db1",instance=~"server2:7878"}*
>>> and it can have 2 possible series -
>>> go_service_status{HOSTNAME="server2", SERVER_CATEGORY="db1", 
>>> SERVICETYPE="grade1", USER="admin", instance="server2:7878", 
>>> job="customprocessexporter01"} 0
>>> go_service_status{HOSTNAME="server2", SERVER_CATEGORY="db1", 
>>> SERVICETYPE="grade1", USER="root", instance="server2:7878", 
>>> job="customprocessexporter01"} 0  
>>>
>>>
>>> Here;s the query using which i figure out status of the service on 
>>> server1.  Example - 
>>>
>>> (sum without (USER) (
>>> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>>  
>>> < 1)[image: Untitled.png]
>>>
>>> so the server1's service is momentarily 0
>>>
>>>
>>> and server2's service is always down , example - 
>>> (sum without (USER) (
>>> *go_lsf_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>>  
>>> < 1)[image: Untitled.png]
>>>
>>>
>>> Now i tried to find the time duration where both these service were 
>>> simultaneously down / 0 on both server1 and server2 :
>>> (sum without (USER) (
>>> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>>  
>>> < 1) and (sum without (USER) (
>>> *go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>>  
>>> < 1)
>>>
>>>
>>> I was expecting a graph similar to the once for server2 , but i got :
>>> [image: Untitled.png]
>>>
>>> I think i need to ignore the HOSTNAME label , but unable to figure out 
>>> the way to ignore the HOSTNAME label in combination with sum without 
>>> clause.
>>>
>>> Any help/hint to improve this query will be very useful for me to 
>>> understand the and condition in context of sum without  clause.
>>>
>>> Thanks,
>>> Puneet
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/add10f5e-b354-4d7d-8b7c-fbbf6389f372n%40googlegroups.com.


[prometheus-users] Re: PromQL: understanding the and operator

2024-02-23 Thread Alexander Wilke
In Grafana i create query A and Query B and then an Expression C with 
"Math" and then I can compare Like $A > 0 && B > 0.
Maybe there is "Transform Data" and then a calcukation Option.

Puneet Singh schrieb am Donnerstag, 22. Februar 2024 um 21:58:08 UTC+1:

> okay, So I think should this be the correct way to perform the and 
> operation ? - 
> (sum without (USER, HOSTNAME ,instance ) (
> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1) and (sum without ( USER, HOSTNAME ,instance  ) (
> *go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1)
>
> Regards
> P
>
>
> On Friday 23 February 2024 at 00:58:52 UTC+5:30 Puneet Singh wrote:
>
>> Hi All, 
>> I have a metric called go_service_status where  i use the "sum without" 
>> operator to determine whether a service is up or down on a server. Now 
>> there can be a situation where service can be down simultaneously on 2 
>> master servers and I am unable to figure out a PromQL query to detect that 
>> situation. Example -  
>>
>>
>> *go_service_status{SERVICETYPE="grade1",SERVER_CATEGORY="db1",instance=~"server1:7878"}*
>> and it can have 2 possible series -
>> go_service_status{HOSTNAME="server1", SERVER_CATEGORY="db1", 
>> SERVICETYPE="grade1", USER="admin", instance="server1:7878", 
>> job="customprocessexporter01"} 0
>> go_service_status{HOSTNAME="server1", SERVER_CATEGORY="db1", 
>> SERVICETYPE="grade1", USER="root", instance="server1:7878", 
>> job="customprocessexporter01"} 1
>>
>> and in the same way
>>
>> *go_service_status{SERVICETYPE="grade1",SERVER_CATEGORY="db1",instance=~"server2:7878"}*
>> and it can have 2 possible series -
>> go_service_status{HOSTNAME="server2", SERVER_CATEGORY="db1", 
>> SERVICETYPE="grade1", USER="admin", instance="server2:7878", 
>> job="customprocessexporter01"} 0
>> go_service_status{HOSTNAME="server2", SERVER_CATEGORY="db1", 
>> SERVICETYPE="grade1", USER="root", instance="server2:7878", 
>> job="customprocessexporter01"} 0  
>>
>>
>> Here;s the query using which i figure out status of the service on 
>> server1.  Example - 
>>
>> (sum without (USER) (
>> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>  
>> < 1)[image: Untitled.png]
>>
>> so the server1's service is momentarily 0
>>
>>
>> and server2's service is always down , example - 
>> (sum without (USER) (
>> *go_lsf_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>  
>> < 1)[image: Untitled.png]
>>
>>
>> Now i tried to find the time duration where both these service were 
>> simultaneously down / 0 on both server1 and server2 :
>> (sum without (USER) (
>> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>  
>> < 1) and (sum without (USER) (
>> *go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>>  
>> < 1)
>>
>>
>> I was expecting a graph similar to the once for server2 , but i got :
>> [image: Untitled.png]
>>
>> I think i need to ignore the HOSTNAME label , but unable to figure out 
>> the way to ignore the HOSTNAME label in combination with sum without 
>> clause.
>>
>> Any help/hint to improve this query will be very useful for me to 
>> understand the and condition in context of sum without  clause.
>>
>> Thanks,
>> Puneet
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/91684962-e2f3-4ba2-ae74-9fd9bebce60en%40googlegroups.com.


[prometheus-users] Re: PromQL: understanding the and operator

2024-02-22 Thread Puneet Singh
okay, So I think should this be the correct way to perform the and 
operation ? - 
(sum without (USER, HOSTNAME ,instance ) (
*go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
 
< 1) and (sum without ( USER, HOSTNAME ,instance  ) (
*go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
 
< 1)

Regards
P


On Friday 23 February 2024 at 00:58:52 UTC+5:30 Puneet Singh wrote:

> Hi All, 
> I have a metric called go_service_status where  i use the "sum without" 
> operator to determine whether a service is up or down on a server. Now 
> there can be a situation where service can be down simultaneously on 2 
> master servers and I am unable to figure out a PromQL query to detect that 
> situation. Example -  
>
>
> *go_service_status{SERVICETYPE="grade1",SERVER_CATEGORY="db1",instance=~"server1:7878"}*
> and it can have 2 possible series -
> go_service_status{HOSTNAME="server1", SERVER_CATEGORY="db1", 
> SERVICETYPE="grade1", USER="admin", instance="server1:7878", 
> job="customprocessexporter01"} 0
> go_service_status{HOSTNAME="server1", SERVER_CATEGORY="db1", 
> SERVICETYPE="grade1", USER="root", instance="server1:7878", 
> job="customprocessexporter01"} 1
>
> and in the same way
>
> *go_service_status{SERVICETYPE="grade1",SERVER_CATEGORY="db1",instance=~"server2:7878"}*
> and it can have 2 possible series -
> go_service_status{HOSTNAME="server2", SERVER_CATEGORY="db1", 
> SERVICETYPE="grade1", USER="admin", instance="server2:7878", 
> job="customprocessexporter01"} 0
> go_service_status{HOSTNAME="server2", SERVER_CATEGORY="db1", 
> SERVICETYPE="grade1", USER="root", instance="server2:7878", 
> job="customprocessexporter01"} 0  
>
>
> Here;s the query using which i figure out status of the service on 
> server1.  Example - 
>
> (sum without (USER) (
> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1)[image: Untitled.png]
>
> so the server1's service is momentarily 0
>
>
> and server2's service is always down , example - 
> (sum without (USER) (
> *go_lsf_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1)[image: Untitled.png]
>
>
> Now i tried to find the time duration where both these service were 
> simultaneously down / 0 on both server1 and server2 :
> (sum without (USER) (
> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1) and (sum without (USER) (
> *go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1)
>
>
> I was expecting a graph similar to the once for server2 , but i got :
> [image: Untitled.png]
>
> I think i need to ignore the HOSTNAME label , but unable to figure out the 
> way to ignore the HOSTNAME label in combination with sum without clause.
>
> Any help/hint to improve this query will be very useful for me to 
> understand the and condition in context of sum without  clause.
>
> Thanks,
> Puneet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/296a728b-9857-49db-9859-50e8436fb2cen%40googlegroups.com.


[prometheus-users] Re: PromQL: understanding the and operator

2024-02-22 Thread Puneet Singh
*Correction: I was expecting a graph similar to the once for server2 , but 
i got :
should be - I was expecting a graph similar to the server1 , but i got :

On Friday 23 February 2024 at 00:58:52 UTC+5:30 Puneet Singh wrote:

> Hi All, 
> I have a metric called go_service_status where  i use the "sum without" 
> operator to determine whether a service is up or down on a server. Now 
> there can be a situation where service can be down simultaneously on 2 
> master servers and I am unable to figure out a PromQL query to detect that 
> situation. Example -  
>
>
> *go_service_status{SERVICETYPE="grade1",SERVER_CATEGORY="db1",instance=~"server1:7878"}*
> and it can have 2 possible series -
> go_service_status{HOSTNAME="server1", SERVER_CATEGORY="db1", 
> SERVICETYPE="grade1", USER="admin", instance="server1:7878", 
> job="customprocessexporter01"} 0
> go_service_status{HOSTNAME="server1", SERVER_CATEGORY="db1", 
> SERVICETYPE="grade1", USER="root", instance="server1:7878", 
> job="customprocessexporter01"} 1
>
> and in the same way
>
> *go_service_status{SERVICETYPE="grade1",SERVER_CATEGORY="db1",instance=~"server2:7878"}*
> and it can have 2 possible series -
> go_service_status{HOSTNAME="server2", SERVER_CATEGORY="db1", 
> SERVICETYPE="grade1", USER="admin", instance="server2:7878", 
> job="customprocessexporter01"} 0
> go_service_status{HOSTNAME="server2", SERVER_CATEGORY="db1", 
> SERVICETYPE="grade1", USER="root", instance="server2:7878", 
> job="customprocessexporter01"} 0  
>
>
> Here;s the query using which i figure out status of the service on 
> server1.  Example - 
>
> (sum without (USER) (
> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1)[image: Untitled.png]
>
> so the server1's service is momentarily 0
>
>
> and server2's service is always down , example - 
> (sum without (USER) (
> *go_lsf_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1)[image: Untitled.png]
>
>
> Now i tried to find the time duration where both these service were 
> simultaneously down / 0 on both server1 and server2 :
> (sum without (USER) (
> *go_service_status{HOSTNAME="server1",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1) and (sum without (USER) (
> *go_service_status{HOSTNAME="server2",SERVER_CATEGORY="db1",SERVICETYPE="grade1"}*)
>  
> < 1)
>
>
> I was expecting a graph similar to the once for server2 , but i got :
> [image: Untitled.png]
>
> I think i need to ignore the HOSTNAME label , but unable to figure out the 
> way to ignore the HOSTNAME label in combination with sum without clause.
>
> Any help/hint to improve this query will be very useful for me to 
> understand the and condition in context of sum without  clause.
>
> Thanks,
> Puneet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f217cb71-566a-4063-8ab2-56e5305647b1n%40googlegroups.com.


[prometheus-users] Re: PromQL filter based on current date

2024-02-12 Thread 'Brian Candler' via Prometheus Users
The only ways I know are to use the Prometheus API 
 
and set the evaluation time, or to use the @ timestamp 
 PromQL 
modifier. But in either case you have to work out the timestamp of the end 
of the 24 hours of interest, and insert it yourself.

On Friday 9 February 2024 at 14:29:39 UTC Dipesh J wrote:

> Hi, 
>
> Is there way to get metrics only for current date instead of using time 
> like [24h] which would probably give metrics for day before too.
>
> last 24 hours
> my_metric{node="ABC"} [24h]
>
> Something like below to give metrics for current date only.
> my_metric{node="ABC"} [TODAYS_DATE]
>
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6f37d3e0-f044-46bb-9c9a-0933c41741acn%40googlegroups.com.


[prometheus-users] Re: PROMQL replace values

2023-12-15 Thread Bryan Boreham
I think what you are looking for is to do a "join", which in PromQL terms 
is some binary expression.
See for instance this 
article: https://www.robustperception.io/left-joins-in-promql/

You'd need to do a label_replace from "DBInstanceIdentifier" to "instance"

Also you can do `group by` to reduce one of the values to 1 so you can use 
* as the binary operator.

Bryan

On Friday 15 December 2023 at 15:34:44 UTC marqu...@gmail.com wrote:

> Has anyone ever done any logic like this, or would have any idea how I can 
> solve this.
>
> I use WriteLatency{DBInstanceIdentifier=~"aurora-mysql.*"} > 0 to filter 
> who is the *master* of my database.
>
> The result of *Query 1* will be: 
> WriteLatency{DBInstanceIdentifier="aurora-mysql-2"} where *aurora-mysql-2* is 
> the *master* database.
>
> To validate the uptime, I use the following query 
> mysql_global_status_uptime{instance=" 
> "}.
>
> I need the result of the *DBInstanceIdentifier* label in *Query 1* to be 
> inserted as a value in *Query 2* in the *instance* field. That way, I can 
> have a query in which I check the uptime of the master database only.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/da1ce56b-2986-4918-99d4-c463fed59c86n%40googlegroups.com.


Re: [prometheus-users] Re: PromQL queries not supporting OR and | logical operator

2023-06-17 Thread Brian Candler
If you group by label, and the label is missing or empty (which are the 
same thing in prometheus), you get no results. e.g.

sum by (foo) (node_filesystem_avail_bytes)

I think what you have to do is to use label_replace to synthesize a dummy 
value for the missing label:

sum by (foo) 
(label_replace(node_filesystem_avail_bytes,"foo","unset","foo",""))

This converts
mymetric{bar="baz"}
into
   mymetric{bar="baz",foo="unset"}

On Saturday, 17 June 2023 at 16:20:23 UTC Moksha Reddy G wrote:

> Hi Brian,
>
> Thanks for your response.
> Now I am clear about my case1, but still I have questions on case2 and 
> case3. I prefer to go one by one, please find my question below on Case2:
>
> *Case2: *Below queries are responding with results but it is giving *Empty 
> query results *when I choose the date/time before I introduced 
> *clusterName* as a label. Label value should be common as* $clustername *in 
> both label keys.
> sum(kube_pod_info{k8s_cluster=~"dd-stg*|*", clusterName=~"dd-stg*|*"}) by 
> (clusterName, k8s_cluster)
> sum(irate(pilot_xds_pushes{type=~"rds", clusterName=~"dd-stg|", 
> k8s_cluster=~"dd-stg|"}[5m])) by (clusterName, k8s_cluster)
> *For example:* If I choose a 1 month old timeline(*2023-05-01*) when 
> there is no label called *clusterName *then NO results displayed*. *If I 
> choose the current timeframe then it works. *clusterName* label has been 
> introduced recently one month back, It is because of the availability of 
> the label at that point in time. *BUT we need old data also with both(*
> clusterName, k8s_cluster*) labels so what is the correct way of defining 
> the query in such cases?*
> [image: image.png]
> [image: image.png]
> [image: image.png]
>
> Best,
> Reddy
>
>
>
> On Sat, Jun 17, 2023 at 3:28 AM Brian Candler  wrote:
>
>> In your case 1, 
>> *clusterName=" $clustername|"*
>> should be
>> *clusterName=~" $clustername|"*
>> Your screenshot shows this mistake as well.
>>
>> You stated "Below query is not at all working as it contains other 
>> condition in beginning."  I think you need to clarify both parts of that 
>> statement:
>>
>> (1) in what way is it not working? Show the input metrics, the result of 
>> the query, and what result you're actually looking for.
>>
>> If the problem is that it returns an empty result set (as per your 
>> screenshot), that's because you used the wrong label match operator, "=" 
>> instead of "=~".  It will only match a clusterName which has the exact 
>> literal value "d3-prd-w2|" (including the vertical bar).
>>
>> (2) "as it contains other condition in beginning" doesn't mean anything 
>> to me. What conditions? Do you mean the label filter type="cds"?
>> Then clearly it will only match metrics that have that label. Is that not 
>> what you want?
>>
>> In your case 3, I think you have another bug:
>> (kube_pod_info{k8s_cluster=~"$clustername"} or {clusterName=~"$
>> clustername"})
>> should be
>> (kube_pod_info{k8s_cluster=~"$clustername"} or kube_pod_info
>> {clusterName=~"$clustername"})
>>
>> >  OR is not working in all cases due to binary expressions in queries
>>
>> That doesn't mean anything. The semantics of the OR operator are clearly 
>> defined. Show examples of what metrics you are feeding into this query, 
>> what you get as the output, and what you would *like* to see instead, and 
>> we may be able to help you formulate a query that does what you want.
>>
>> In other words, the problem is not that "OR is not working" - the problem 
>> is that you haven't formulated your PromQL query in a way which gets you 
>> the results you're looking for.
>>
>> > Could you please look into this issue on high priority
>>
>> I refer you to http://www.catb.org/~esr/faqs/smart-questions.html#urgent
>> (however, the whole of that document is well worth reading)
>>
>> On Saturday, 17 June 2023 at 02:39:50 UTC Moksha Reddy G wrote:
>>
>>> Hey there,
>>>
>>> I am struggling to find simple and efficient OR logic operator for my 
>>> Prometheus queries which are running through Grafana dashboards. 
>>>
>>> *Problem Statement: *
>>> *Case1*: Below query is not at all working as it contains other 
>>> condition in beginning. We have many queries like that and need the correct 
>>> syntax to use logic OR condition. Label value should be common as
>>> * $clustername *in both label keys
>>> sum(irate(pilot_xds_pushes{type="cds",*clusterName=" $clustername|", 
>>> k8s_cluster=" $clustername|"*}[5m]))
>>>
>>> *Case2*: Below query is responding with results but it is giving *Empty 
>>> query result *when I choose the date/time before I introduced 
>>> *clusterName* as a label. Label value should be common as
>>> * $clustername *in both label keys.
>>> sum(kube_pod_info{k8s_cluster=~"$clustername*|*", 
>>> clusterName=~"$clustername*|*"}) by (clusterName, k8s_cluster)
>>> *For example:* If I choose 1 month old timeframe when there is no label 
>>> called *clusterName *then NO results displayed*. *If I choose current 
>>> timeframe then

[prometheus-users] Re: PromQL queries not supporting OR and | logical operator

2023-06-17 Thread Brian Candler
In your case 1, 
*clusterName=" $clustername|"*
should be
*clusterName=~" $clustername|"*
Your screenshot shows this mistake as well.

You stated "Below query is not at all working as it contains other 
condition in beginning."  I think you need to clarify both parts of that 
statement:

(1) in what way is it not working? Show the input metrics, the result of 
the query, and what result you're actually looking for.

If the problem is that it returns an empty result set (as per your 
screenshot), that's because you used the wrong label match operator, "=" 
instead of "=~".  It will only match a clusterName which has the exact 
literal value "d3-prd-w2|" (including the vertical bar).

(2) "as it contains other condition in beginning" doesn't mean anything to 
me. What conditions? Do you mean the label filter type="cds"?
Then clearly it will only match metrics that have that label. Is that not 
what you want?

In your case 3, I think you have another bug:
(kube_pod_info{k8s_cluster=~"$clustername"} or {clusterName=~"$clustername
"})
should be
(kube_pod_info{k8s_cluster=~"$clustername"} or kube_pod_info{clusterName=~"$
clustername"})

>  OR is not working in all cases due to binary expressions in queries

That doesn't mean anything. The semantics of the OR operator are clearly 
defined. Show examples of what metrics you are feeding into this query, 
what you get as the output, and what you would *like* to see instead, and 
we may be able to help you formulate a query that does what you want.

In other words, the problem is not that "OR is not working" - the problem 
is that you haven't formulated your PromQL query in a way which gets you 
the results you're looking for.

> Could you please look into this issue on high priority

I refer you to http://www.catb.org/~esr/faqs/smart-questions.html#urgent
(however, the whole of that document is well worth reading)

On Saturday, 17 June 2023 at 02:39:50 UTC Moksha Reddy G wrote:

> Hey there,
>
> I am struggling to find simple and efficient OR logic operator for my 
> Prometheus queries which are running through Grafana dashboards. 
>
> *Problem Statement: *
> *Case1*: Below query is not at all working as it contains other condition 
> in beginning. We have many queries like that and need the correct syntax to 
> use logic OR condition. Label value should be common as* $clustername *in 
> both label keys
> sum(irate(pilot_xds_pushes{type="cds",*clusterName=" $clustername|", 
> k8s_cluster=" $clustername|"*}[5m]))
>
> *Case2*: Below query is responding with results but it is giving *Empty 
> query result *when I choose the date/time before I introduced 
> *clusterName* as a label. Label value should be common as* $clustername *in 
> both label keys.
> sum(kube_pod_info{k8s_cluster=~"$clustername*|*", 
> clusterName=~"$clustername*|*"}) by (clusterName, k8s_cluster)
> *For example:* If I choose 1 month old timeframe when there is no label 
> called *clusterName *then NO results displayed*. *If I choose current 
> timeframe then it works. I guess it is because the availability of the 
> label in Prometheus database. What is the logic from Prometheus backend?
>
> *Case3**: *If I put *OR* operator in between these two conditions then 
> its working but OR is not working in all cases due to binary expressions in 
> queries. Below is the query:
> sum(kube_pod_info{k8s_cluster=~"$clustername"} or {clusterName=~"$
> clustername"}) by (clusterName, k8s_cluster)
>
> Could you please look into this issue on high priority and kindly share 
> your inputs to use this OR operator without any issues in all these 
> scenarios?
>
> Best,
> Reddy
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/54816f0f-ffe5-4684-9ed3-489eb0debcd8n%40googlegroups.com.


Re: [prometheus-users] Re: promql - what is promql for calcuate percetile

2023-03-27 Thread Bjoern Rabenstein
On 20.03.23 03:28, Brian Candler wrote:
> > Note - I have no bucket metrics for histogram. 
> 
> What you say doesn't make sense to me.  What you showed *is* a histogram, 
> and the metrics *prometheus_rule_evaluation_duration_seconds* *are* the 
> buckets.

Strictly speaking, it's a summary, and the metrics labeled with
"quantile" are precalculated
quantiles. Cf. https://prometheus.io/docs/practices/histograms/ 

> Therefore, if those are the metrics you have, then the 50th percentile is 
> simply
> prometheus_rule_evaluation_duration_seconds{quantile="0.5"}
> and the 90th percentile is simply
> prometheus_rule_evaluation_duration_seconds{quantile="0.9"}
> 
> There is no need to "calculate" the p50/p90/p99 latencies because you 
> already have them.

That's correct. Note that there is no way to further aggregate the
pre-calculated quantile (or change them for example to a different
quantile or to a different time interval).

If you need aggregatability or more flexibility for add-hoc queries,
you have to use an actual histogram in your instrumentation of the
monitored target (either the classic histograms or the new
experimental native histograms).

-- 
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ZCGTQddnGDnFW5vL%40mail.rabenste.in.


[prometheus-users] Re: promql - what is promql for calcuate percetile

2023-03-20 Thread Brian Candler
> Note - I have no bucket metrics for histogram. 

What you say doesn't make sense to me.  What you showed *is* a histogram, 
and the metrics *prometheus_rule_evaluation_duration_seconds* *are* the 
buckets.

Therefore, if those are the metrics you have, then the 50th percentile is 
simply
prometheus_rule_evaluation_duration_seconds{quantile="0.5"}
and the 90th percentile is simply
prometheus_rule_evaluation_duration_seconds{quantile="0.9"}

There is no need to "calculate" the p50/p90/p99 latencies because you 
already have them.

On Monday, 20 March 2023 at 08:47:46 UTC Prashant Singh wrote:

> Hi , 
>
> Need to be know what is promql for calculate p50th ,p90th , and p99th 
> latency or perctile for bleow metrics
>
> Note - I have no bucket metrics for histogram. 
>
>
> # HELP prometheus_rule_evaluation_duration_seconds The duration for a rule 
> to execute.
> # TYPE prometheus_rule_evaluation_duration_seconds summary
> prometheus_rule_evaluation_duration_seconds{quantile="0.5"} 6.4853e-05
> prometheus_rule_evaluation_duration_seconds{quantile="0.9"} 0.00010102
> prometheus_rule_evaluation_duration_seconds{quantile="0.99"} 0.000177367
> prometheus_rule_evaluation_duration_seconds_sum 1.623860968846092e+06
> prometheus_rule_evaluation_duration_seconds_count 1.112293682e+09
>
> Thanks
> Prashant
> Thanks,
> Prashant
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/32fbb9cb-20d1-4ecd-b539-284df5e9dddbn%40googlegroups.com.


[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-26 Thread marc koser
To close the loop on this, I was able to get this working using this query:

max by (group) (redis_cluster_known_nodes) != count by (group) 
(up{service=~"exporter-redis-.*"})

Thanks for your insight on this Brian.
On Wednesday, October 19, 2022 at 4:24:41 AM UTC-4 Brian Candler wrote:

> Or even:
> redis_cluster_known_nodes != redis_cluster_known_nodes offset 5m
>
> On Tuesday, 18 October 2022 at 20:12:27 UTC+1 marc.k...@gmail.com wrote:
>
>> Perhaps an easier option would be to compare redis_cluster_known_nodes 
>> against what it was n-time_interval_ago:
>> redis_cluster_known_nodes != avg_over_time 
>> (redis_cluster_known_nodes[1d:4h])
>>
>> It's less-than ideal since it's not using a static,expected value of 
>> total cluster nodes and it would match when the cluster nodes become what 
>> is expected but I can deal with that for now.
>>
>> Thanks for your help! 
>>
>> On Tuesday, October 18, 2022 at 7:27:10 AM UTC-4 marc koser wrote:
>>
>>> > So really it boils down to, what's a "node" and how do you count them? 
>>>  Is a single "node" a whole cluster, or is a cluster a collection of nodes?
>>>
>>> A node is a redis service that is part of a cluster (id'ed by the 
>>> `group` label), so a cluster is a collection of nodes. The sum of all nodes 
>>> is a determinate and, under normal circumstances, a static value but since 
>>> a redis 'node' is never forgotten unless told to I want to alert on this 
>>> case since it can skew the interpolation of other metrics.
>>>
>>> > In particular, what do these metrics mean?
>>> > 
>>> > redis_cluster_known_nodes{group="group-a", instance="node-1", 
>>> job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
>>> > redis_cluster_known_nodes{group="group-a", instance="node-2", 
>>> job="redis-cluster", service="exporter-redis-6379"} 11
>>> > redis_cluster_known_nodes{group="group-a", instance="node-3", 
>>> job="redis-cluster", service="exporter-redis-6379"} 16
>>> > redis_cluster_known_nodes{group="group-a", instance="node-4", 
>>> job="redis-cluster", service="exporter-redis-6379"} 16
>>> > redis_cluster_known_nodes{group="group-a", instance="node-5", 
>>> job="redis-cluster", service="exporter-redis-6379"} 16
>>>
>>> This represents the state of all known redis nodes belonging to a single 
>>> cluster relative to a running node.
>>>
>>> > They are all the same "service", but how come instance "node-1" 
>>> contains or sees 10 "nodes", but instance "node-2" contains or sees 11 
>>> "nodes", and the other instances contain or see 16 "nodes"?  Perhaps this 
>>> inconsistency is the error you're trying to detect - in which case, what do 
>>> you think is the correct number of nodes?
>>>
>>> This is indeed the scenario I'm attempting to query for. In this case; 
>>> when a node is joined to the cluster but is unreachable for any reason (ie: 
>>> redis is uninstalled / re-installed and the node rejoins the cluster) the 
>>> node's ID changes (the new ID is valid and reachable, the old ID is no 
>>> longer valid and unreachable).
>>>
>>> The correct value is 10: 5 `instance`'s x 2 `service`'s
>>>
>>> > Let's say 16 is the correct answer for group="group-a" and 
>>> service="exporter-redis-6379".  Perhaps you didn't show the full set of 
>>> "up" metrics.  In which case, I'd first try to build an "up" query which 
>>> gives the expected answer 16 on the right-hand side.  Maybe something like 
>>> this:
>>> >
>>> > count by (service, group) (up{service=~"exporter-redis-.*"})
>>> >
>>> > What does that expression show?
>>>
>>> {group="group-a", service="exporter-redis-6379"} 5
>>> {group="group-a", service="exporter-redis-6380"} 5
>>>
>>> > When you have that part working, then we can work on matching the LHS. 
>>>  Since each *instance* seems to have its own distinct idea of the total 
>>> number of nodes, then I expect this requires an N:1 match on 
>>> (group,service).  That is, there is 1 "should be" value for a given 
>>> (service,group) on the RHS, and multiple nodes each with their own count of 
>>> (service,group) on the LHS.
>>>
>>> That sounds accurate
>>>
>>> > If that's the case, it might end up something like this:
>>> > 
>>> > redis_cluster_known_nodes != on (service, group) group left() 
>>> count by (service, group) (up{service=~"exporter-redis-.*"})
>>> > 
>>> > but at this point I'm just speculating.
>>>
>>> This gives the same result as before.
>>>
>>> I'll keep plugging away at this to see what I can come up with.
>>>
>>> On Tuesday, October 18, 2022 at 3:36:49 AM UTC-4 Brian Candler wrote:
>>>
 Sorry, I missed an underscore there.

redis_cluster_known_nodes != on (service, group) *group_left*() 
 count by (service, group) (up{service=~"exporter-redis-.*"})



-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this

[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-19 Thread Brian Candler
Or even:
redis_cluster_known_nodes != redis_cluster_known_nodes offset 5m

On Tuesday, 18 October 2022 at 20:12:27 UTC+1 marc.k...@gmail.com wrote:

> Perhaps an easier option would be to compare redis_cluster_known_nodes 
> against what it was n-time_interval_ago:
> redis_cluster_known_nodes != avg_over_time 
> (redis_cluster_known_nodes[1d:4h])
>
> It's less-than ideal since it's not using a static,expected value of total 
> cluster nodes and it would match when the cluster nodes become what is 
> expected but I can deal with that for now.
>
> Thanks for your help! 
>
> On Tuesday, October 18, 2022 at 7:27:10 AM UTC-4 marc koser wrote:
>
>> > So really it boils down to, what's a "node" and how do you count them? 
>>  Is a single "node" a whole cluster, or is a cluster a collection of nodes?
>>
>> A node is a redis service that is part of a cluster (id'ed by the `group` 
>> label), so a cluster is a collection of nodes. The sum of all nodes is a 
>> determinate and, under normal circumstances, a static value but since a 
>> redis 'node' is never forgotten unless told to I want to alert on this case 
>> since it can skew the interpolation of other metrics.
>>
>> > In particular, what do these metrics mean?
>> > 
>> > redis_cluster_known_nodes{group="group-a", instance="node-1", 
>> job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
>> > redis_cluster_known_nodes{group="group-a", instance="node-2", 
>> job="redis-cluster", service="exporter-redis-6379"} 11
>> > redis_cluster_known_nodes{group="group-a", instance="node-3", 
>> job="redis-cluster", service="exporter-redis-6379"} 16
>> > redis_cluster_known_nodes{group="group-a", instance="node-4", 
>> job="redis-cluster", service="exporter-redis-6379"} 16
>> > redis_cluster_known_nodes{group="group-a", instance="node-5", 
>> job="redis-cluster", service="exporter-redis-6379"} 16
>>
>> This represents the state of all known redis nodes belonging to a single 
>> cluster relative to a running node.
>>
>> > They are all the same "service", but how come instance "node-1" 
>> contains or sees 10 "nodes", but instance "node-2" contains or sees 11 
>> "nodes", and the other instances contain or see 16 "nodes"?  Perhaps this 
>> inconsistency is the error you're trying to detect - in which case, what do 
>> you think is the correct number of nodes?
>>
>> This is indeed the scenario I'm attempting to query for. In this case; 
>> when a node is joined to the cluster but is unreachable for any reason (ie: 
>> redis is uninstalled / re-installed and the node rejoins the cluster) the 
>> node's ID changes (the new ID is valid and reachable, the old ID is no 
>> longer valid and unreachable).
>>
>> The correct value is 10: 5 `instance`'s x 2 `service`'s
>>
>> > Let's say 16 is the correct answer for group="group-a" and 
>> service="exporter-redis-6379".  Perhaps you didn't show the full set of 
>> "up" metrics.  In which case, I'd first try to build an "up" query which 
>> gives the expected answer 16 on the right-hand side.  Maybe something like 
>> this:
>> >
>> > count by (service, group) (up{service=~"exporter-redis-.*"})
>> >
>> > What does that expression show?
>>
>> {group="group-a", service="exporter-redis-6379"} 5
>> {group="group-a", service="exporter-redis-6380"} 5
>>
>> > When you have that part working, then we can work on matching the LHS. 
>>  Since each *instance* seems to have its own distinct idea of the total 
>> number of nodes, then I expect this requires an N:1 match on 
>> (group,service).  That is, there is 1 "should be" value for a given 
>> (service,group) on the RHS, and multiple nodes each with their own count of 
>> (service,group) on the LHS.
>>
>> That sounds accurate
>>
>> > If that's the case, it might end up something like this:
>> > 
>> > redis_cluster_known_nodes != on (service, group) group left() count 
>> by (service, group) (up{service=~"exporter-redis-.*"})
>> > 
>> > but at this point I'm just speculating.
>>
>> This gives the same result as before.
>>
>> I'll keep plugging away at this to see what I can come up with.
>>
>> On Tuesday, October 18, 2022 at 3:36:49 AM UTC-4 Brian Candler wrote:
>>
>>> Sorry, I missed an underscore there.
>>>
>>>redis_cluster_known_nodes != on (service, group) *group_left*() 
>>> count by (service, group) (up{service=~"exporter-redis-.*"})
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0280feae-1f5f-427f-9000-6803626be449n%40googlegroups.com.


[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-18 Thread marc koser
Perhaps an easier option would be to compare redis_cluster_known_nodes 
against what it was n-time_interval_ago:
redis_cluster_known_nodes != avg_over_time 
(redis_cluster_known_nodes[1d:4h])

It's less-than ideal since it's not using a static,expected value of total 
cluster nodes and it would match when the cluster nodes become what is 
expected but I can deal with that for now.

Thanks for your help! 

On Tuesday, October 18, 2022 at 7:27:10 AM UTC-4 marc koser wrote:

> > So really it boils down to, what's a "node" and how do you count them? 
>  Is a single "node" a whole cluster, or is a cluster a collection of nodes?
>
> A node is a redis service that is part of a cluster (id'ed by the `group` 
> label), so a cluster is a collection of nodes. The sum of all nodes is a 
> determinate and, under normal circumstances, a static value but since a 
> redis 'node' is never forgotten unless told to I want to alert on this case 
> since it can skew the interpolation of other metrics.
>
> > In particular, what do these metrics mean?
> > 
> > redis_cluster_known_nodes{group="group-a", instance="node-1", 
> job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
> > redis_cluster_known_nodes{group="group-a", instance="node-2", 
> job="redis-cluster", service="exporter-redis-6379"} 11
> > redis_cluster_known_nodes{group="group-a", instance="node-3", 
> job="redis-cluster", service="exporter-redis-6379"} 16
> > redis_cluster_known_nodes{group="group-a", instance="node-4", 
> job="redis-cluster", service="exporter-redis-6379"} 16
> > redis_cluster_known_nodes{group="group-a", instance="node-5", 
> job="redis-cluster", service="exporter-redis-6379"} 16
>
> This represents the state of all known redis nodes belonging to a single 
> cluster relative to a running node.
>
> > They are all the same "service", but how come instance "node-1" contains 
> or sees 10 "nodes", but instance "node-2" contains or sees 11 "nodes", and 
> the other instances contain or see 16 "nodes"?  Perhaps this inconsistency 
> is the error you're trying to detect - in which case, what do you think is 
> the correct number of nodes?
>
> This is indeed the scenario I'm attempting to query for. In this case; 
> when a node is joined to the cluster but is unreachable for any reason (ie: 
> redis is uninstalled / re-installed and the node rejoins the cluster) the 
> node's ID changes (the new ID is valid and reachable, the old ID is no 
> longer valid and unreachable).
>
> The correct value is 10: 5 `instance`'s x 2 `service`'s
>
> > Let's say 16 is the correct answer for group="group-a" and 
> service="exporter-redis-6379".  Perhaps you didn't show the full set of 
> "up" metrics.  In which case, I'd first try to build an "up" query which 
> gives the expected answer 16 on the right-hand side.  Maybe something like 
> this:
> >
> > count by (service, group) (up{service=~"exporter-redis-.*"})
> >
> > What does that expression show?
>
> {group="group-a", service="exporter-redis-6379"} 5
> {group="group-a", service="exporter-redis-6380"} 5
>
> > When you have that part working, then we can work on matching the LHS. 
>  Since each *instance* seems to have its own distinct idea of the total 
> number of nodes, then I expect this requires an N:1 match on 
> (group,service).  That is, there is 1 "should be" value for a given 
> (service,group) on the RHS, and multiple nodes each with their own count of 
> (service,group) on the LHS.
>
> That sounds accurate
>
> > If that's the case, it might end up something like this:
> > 
> > redis_cluster_known_nodes != on (service, group) group left() count 
> by (service, group) (up{service=~"exporter-redis-.*"})
> > 
> > but at this point I'm just speculating.
>
> This gives the same result as before.
>
> I'll keep plugging away at this to see what I can come up with.
>
> On Tuesday, October 18, 2022 at 3:36:49 AM UTC-4 Brian Candler wrote:
>
>> Sorry, I missed an underscore there.
>>
>>redis_cluster_known_nodes != on (service, group) *group_left*() count 
>> by (service, group) (up{service=~"exporter-redis-.*"})
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5ae502d0-ed8d-4767-b62f-e90702e643b0n%40googlegroups.com.


[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-18 Thread marc koser
> So really it boils down to, what's a "node" and how do you count them? 
 Is a single "node" a whole cluster, or is a cluster a collection of nodes?

A node is a redis service that is part of a cluster (id'ed by the `group` 
label), so a cluster is a collection of nodes. The sum of all nodes is a 
determinate and, under normal circumstances, a static value but since a 
redis 'node' is never forgotten unless told to I want to alert on this case 
since it can skew the interpolation of other metrics.

> In particular, what do these metrics mean?
> 
> redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
> redis_cluster_known_nodes{group="group-a", instance="node-2", 
job="redis-cluster", service="exporter-redis-6379"} 11
> redis_cluster_known_nodes{group="group-a", instance="node-3", 
job="redis-cluster", service="exporter-redis-6379"} 16
> redis_cluster_known_nodes{group="group-a", instance="node-4", 
job="redis-cluster", service="exporter-redis-6379"} 16
> redis_cluster_known_nodes{group="group-a", instance="node-5", 
job="redis-cluster", service="exporter-redis-6379"} 16

This represents the state of all known redis nodes belonging to a single 
cluster relative to a running node.

> They are all the same "service", but how come instance "node-1" contains 
or sees 10 "nodes", but instance "node-2" contains or sees 11 "nodes", and 
the other instances contain or see 16 "nodes"?  Perhaps this inconsistency 
is the error you're trying to detect - in which case, what do you think is 
the correct number of nodes?

This is indeed the scenario I'm attempting to query for. In this case; when 
a node is joined to the cluster but is unreachable for any reason (ie: 
redis is uninstalled / re-installed and the node rejoins the cluster) the 
node's ID changes (the new ID is valid and reachable, the old ID is no 
longer valid and unreachable).

The correct value is 10: 5 `instance`'s x 2 `service`'s

> Let's say 16 is the correct answer for group="group-a" and 
service="exporter-redis-6379".  Perhaps you didn't show the full set of 
"up" metrics.  In which case, I'd first try to build an "up" query which 
gives the expected answer 16 on the right-hand side.  Maybe something like 
this:
>
> count by (service, group) (up{service=~"exporter-redis-.*"})
>
> What does that expression show?

{group="group-a", service="exporter-redis-6379"} 5
{group="group-a", service="exporter-redis-6380"} 5

> When you have that part working, then we can work on matching the LHS. 
 Since each *instance* seems to have its own distinct idea of the total 
number of nodes, then I expect this requires an N:1 match on 
(group,service).  That is, there is 1 "should be" value for a given 
(service,group) on the RHS, and multiple nodes each with their own count of 
(service,group) on the LHS.

That sounds accurate

> If that's the case, it might end up something like this:
> 
> redis_cluster_known_nodes != on (service, group) group left() count 
by (service, group) (up{service=~"exporter-redis-.*"})
> 
> but at this point I'm just speculating.

This gives the same result as before.

I'll keep plugging away at this to see what I can come up with.

On Tuesday, October 18, 2022 at 3:36:49 AM UTC-4 Brian Candler wrote:

> Sorry, I missed an underscore there.
>
>redis_cluster_known_nodes != on (service, group) *group_left*() count 
> by (service, group) (up{service=~"exporter-redis-.*"})
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c4e11aaa-30cd-42f5-a902-f866f977d1f9n%40googlegroups.com.


[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-18 Thread Brian Candler
Sorry, I missed an underscore there.

   redis_cluster_known_nodes != on (service, group) *group_left*() count by 
(service, group) (up{service=~"exporter-redis-.*"})

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d17db356-0c1a-4693-99d3-c08fb8579342n%40googlegroups.com.


[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-18 Thread Brian Candler
If you run the two halves of the query separately:

redis_cluster_known_nodes

and

count by (instance, service, group) (up{service=~"exporter-redis-.*"})

then I think the reason will become clear.

If that set of "up" metrics is complete, then I'd expect the "count by" 
results for node-1 to be to be

{group="group-a",instance="node-1",service="exporter-redis-6379"} 1
{group="group-a",instance="node-1",service="exporter-redis-6380"} 1

and these values (of 1) are clearly different to

redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6380", team="sre"} 10

Aside: the "count by" seems superfluous here, since every "up" metric has a 
distinct combination of (instance,service,group).  I guess it ensures that 
up values of 0 are turned into 1.

Without knowing more about what you're trying to do and what these metrics 
represent, I can't really help.  A value of redis_cluster_known_nodes of 10 
suggests there are 10 "nodes" of some sort, whatever they are.  But the 
"up" metric will only be 1 or 0 (success or fail on scrape).  If you had a 
separate scrape target for each node then you could count or sum these to 
get the number of nodes, but the list of "up" metrics you showed suggests 
there's only one scrape job for each instance+service combination.

So really it boils down to, what's a "node" and how do you count them?  Is 
a single "node" a whole cluster, or is a cluster a collection of nodes?

In particular, what do these metrics mean?

redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-2", 
job="redis-cluster", service="exporter-redis-6379"} 11
redis_cluster_known_nodes{group="group-a", instance="node-3", 
job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-4", 
job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-5", 
job="redis-cluster", service="exporter-redis-6379"} 16

They are all the same "service", but how come instance "node-1" contains or 
sees 10 "nodes", but instance "node-2" contains or sees 11 "nodes", and the 
other instances contain or see 16 "nodes"?  Perhaps this inconsistency is 
the error you're trying to detect - in which case, what do you think is the 
correct number of nodes?

Let's say 16 is the correct answer for group="group-a" and 
service="exporter-redis-6379".  Perhaps you didn't show the full set of 
"up" metrics.  In which case, I'd first try to build an "up" query which 
gives the expected answer 16 on the right-hand side.  Maybe something like 
this:

count by (service, group) (up{service=~"exporter-redis-.*"})

What does that expression show?

When you have that part working, then we can work on matching the LHS.  
Since each *instance* seems to have its own distinct idea of the total 
number of nodes, then I expect this requires an N:1 match on 
(group,service).  That is, there is 1 "should be" value for a given 
(service,group) on the RHS, and multiple nodes each with their own count of 
(service,group) on the LHS.

If that's the case, it might end up something like this:

redis_cluster_known_nodes != on (service, group) group left() count by 
(service, group) (up{service=~"exporter-redis-.*"})

but at this point I'm just speculating.

On Monday, 17 October 2022 at 21:12:49 UTC+1 marc.k...@gmail.com wrote:

> Thanks for the pointer Brian.
>
> From what you suggested; I updated my query to include `service` rather 
> than `job` to cover the different values (representing either redis service 
> on each `instance`), however I'm still not getting the results I expect:
>
> query: 
> redis_cluster_known_nodes != on (instance, service, group) count by 
> (instance, service, group) (up{service=~"exporter-redis-.*"})
>
> result:
> {group="group-a", instance="node-1", service="exporter-redis-6379"} 10
> {group="group-a", instance="node-1", service="exporter-redis-6380"} 10
> {group="group-a", instance="node-2", service="exporter-redis-6379"} 11
> {group="group-a", instance="node-2", service="exporter-redis-6380"} 16
> {group="group-a", instance="node-3", service="exporter-redis-6379"} 16
> {group="group-a", instance="node-3", service="exporter-redis-6380"} 16
> {group="group-a", instance="node-4", service="exporter-redis-6379"} 16
> {group="group-a", instance="node-4", service="exporter-redis-6380"} 16
> {group="group-a", instance="node-5", service="exporter-redis-6379"} 16
> {group="group-a", instance="node-5", service="exporter-redis-6380"} 16
>
> I would expect only those who's count is != 10 be included in the result.
>
>
> Here's a metric sample of those used in the query:
> ``` 
> up{group="group-a", instance="node

[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-17 Thread marc koser
Thanks for the pointer Brian.

>From what you suggested; I updated my query to include `service` rather 
than `job` to cover the different values (representing either redis service 
on each `instance`), however I'm still not getting the results I expect:

query: 
redis_cluster_known_nodes != on (instance, service, group) count by 
(instance, service, group) (up{service=~"exporter-redis-.*"})

result:
{group="group-a", instance="node-1", service="exporter-redis-6379"} 10
{group="group-a", instance="node-1", service="exporter-redis-6380"} 10
{group="group-a", instance="node-2", service="exporter-redis-6379"} 11
{group="group-a", instance="node-2", service="exporter-redis-6380"} 16
{group="group-a", instance="node-3", service="exporter-redis-6379"} 16
{group="group-a", instance="node-3", service="exporter-redis-6380"} 16
{group="group-a", instance="node-4", service="exporter-redis-6379"} 16
{group="group-a", instance="node-4", service="exporter-redis-6380"} 16
{group="group-a", instance="node-5", service="exporter-redis-6379"} 16
{group="group-a", instance="node-5", service="exporter-redis-6380"} 16

I would expect only those who's count is != 10 be included in the result.


Here's a metric sample of those used in the query:
``` 
up{group="group-a", instance="node-1", job="redis-cluster", 
service="exporter-redis-6379", team="sre"} 1
up{group="group-a", instance="node-1", job="redis-cluster", 
service="exporter-redis-6380", team="sre"} 1
up{group="group-a", instance="node-2", job="redis-cluster", 
service="exporter-redis-6379"} 1
up{group="group-a", instance="node-2", job="redis-cluster", 
service="exporter-redis-6380"} 1
up{group="group-a", instance="node-3", job="redis-cluster", 
service="exporter-redis-6379"} 1
up{group="group-a", instance="node-3", job="redis-cluster", 
service="exporter-redis-6380"} 1
up{group="group-a", instance="node-4", job="redis-cluster", 
service="exporter-redis-6379"} 1
up{group="group-a", instance="node-4", job="redis-cluster", 
service="exporter-redis-6380"} 1
up{group="group-a", instance="node-5", job="redis-cluster", 
service="exporter-redis-6379"} 1
up{group="group-a", instance="node-5", job="redis-cluster", 
service="exporter-redis-6380"} 1

redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6380", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-2", 
job="redis-cluster", service="exporter-redis-6379"} 11
redis_cluster_known_nodes{group="group-a", instance="node-2", 
job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-3", 
job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-3", 
job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-4", 
job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-4", 
job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-5", 
job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-5", 
job="redis-cluster", service="exporter-redis-6380"} 16
```
On Thursday, October 13, 2022 at 9:17:55 AM UTC-4 Brian Candler wrote:

> Sorry, second to last sentence was unclear.  What I meant was:
>
>
> *If the LHS vector contains N metrics with a particular value of the 
> "group" label, which correspond to exactly 1 metric on the RHS with the 
> matching label value, or vice versa, then you can use N:1 matching.*
> On Thursday, 13 October 2022 at 14:13:42 UTC+1 Brian Candler wrote:
>
>> > Is it possible to have one side of a query limit the results of another 
>> part of the same query?
>>
>> Yes, but it depends on exactly what you mean. The details are here:
>> https://prometheus.io/docs/prometheus/latest/querying/operators/
>> It depends on whether you can construct vectors for the LHS and RHS which 
>> have corresponding labels.
>>
>> If you can give some specific examples of the metrics themselves - 
>> including all their labels - then we can see whether it's possible to do 
>> what you want in PromQL.  Right now the requirements are unclear.
>>
>>
>> *> redis_cluster_known_nodes != 
>> scalar(count(up{service=~"redis-exporter"}))*
>> > 
>> > The shared label value would be something like, *group="cluster-a" *and 
>> should not evaluate metrics where *group="cluster-b"*
>>
>> You need to arrange both LHS and RHS to have some corresponding labels 
>> before you can combine them with any operator such as !=.  The RHS has no 
>> "group" label at the moment, in fact it's not even a vector, but you could 
>> do:
>>
>> count by (group) (up{service="redis-exporter"})
>>
>> Then, assuming that redis_cluster_k

[prometheus-users] Re: PromQL

2022-10-17 Thread Brian Candler
The expression you've written doesn't really make much sense.  If you have 
a metric "disk_used_percent", which runs between 0 and 100 (presumably), 
why are you summing it by host?  This means that if one host had three 
disks, each 40% used, that the result would be "120% used" and trigger an 
alert unnecessarily.

I would expect the expression to be simply:

expr: disk_used_percent > 85

> Now for that i need to create 2 rule for each severity. Now i have 
question can we create one query for both severity like range between 85-95 
warring and 95 up critical? 

No, you were right the first time: you need one rule for 85%+ and one for 
95%+

You can then use inhibit rules in Alertmanager so that if the 95%+ alert is 
firing, it inhibits sending the 85%+ one.  To do this you'll need to add 
labels to your alerts, and set up the inhibit rules 
 
appropriately.

Personally though, I find such rules difficult to maintain and irritating. 
Suppose you have one machine which is sitting at 88% disk full, but is 
working perfectly normally.  Do you want it to be continuously alerting?  
Suppose you've already done all the free space tidying you can.  Are you 
*really* going to add more disk space to this machine, just to bring the 
usage under 85% to silence the alert?  Probably not (unless it's a VM and 
can be grown easily). However, once you start to accept continuously firing 
alerts, then you'll find that everyone ignores them, and then *real* 
problems get lost amongst the noise.

You might decide you want to have different thresholds for each 
filesystem.  But then either you end up with lots of alerting rules, or you 
need to put the thresholds in their own timeseries, as described here:
https://www.robustperception.io/using-time-series-as-alert-thresholds
- and this is a pain to maintain.

Personally, I've ditched all static alerting thresholds on disk space.  
Instead I have rules for when the filesystem is completely full(*), plus 
rules which look at how fast the filesystem is growing, and predict when 
they will be full if they continue to grow at the current rate.  Examples:

- name: DiskRate10m
  interval: 1m
  rules:
  # Warn if rate of growth over last 10 minutes means filesystem will fill 
in 2 hours
  - alert: DiskFilling10m
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 
7200) < 0)) * 7200
for: 20m
labels:
  severity: warning
annotations:
  summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 10m growth rate'

- name: DiskRate3h
  interval: 10m
  rules:
  # Warn if rate of growth over last 3 hours means filesystem will fill in 
2 days
  - alert: DiskFilling3h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 
172800) < 0)) * 172800
for: 6h
labels:
  severity: warning
annotations:
  summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 3h growth rate'

- name: DiskRate12h
  interval: 1h
  rules:
  # Warn if rate of growth over last 12 hours means filesystem will fill in 
7 days
  - alert: DiskFilling12h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[12h], 
604800) < 0)) * 604800
for: 24h
labels:
  severity: warning
annotations:
  summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 12h growth rate'


For an explanation of how these rules work 
see https://groups.google.com/g/prometheus-users/c/PCT4MJjFFgI/m/kVfOW069BQAJ

(*) In practice I also alert at *just below* full, e.g.

- name: DiskSpace
  interval: 1m
  rules:
  # Alert if any filesystem has less than 100MB available space (except for 
filesystems which are smaller than 150MB)
  - alert: DiskFull
expr: |
  node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"} < 1 
unless node_filesystem_size_bytes{fstype!~"fuse.*|nfs.*"} < 15000
for: 10m
labels:
  severity: critical
annotations:
  summary: 'Filesystem full or less than 100MB free space'

I find this helpful for /boot partitions where if they do get completely 
full with partially-installed kernel updates, it's tricky to fix.  But I 
still wouldn't "alert" in the sense of getting someone out of bed at 3am - 
unless the system is failing in a way that your users or customers would 
notice (which is something you should be checking and alerting on 
separately), this is something that can be fixed at leisure.

Finally, I can strongly recommend this "philosophy on alerting":
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/
You might want to consider whether some of these system

[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-13 Thread Brian Candler
Sorry, second to last sentence was unclear.  What I meant was:


*If the LHS vector contains N metrics with a particular value of the 
"group" label, which correspond to exactly 1 metric on the RHS with the 
matching label value, or vice versa, then you can use N:1 matching.*
On Thursday, 13 October 2022 at 14:13:42 UTC+1 Brian Candler wrote:

> > Is it possible to have one side of a query limit the results of another 
> part of the same query?
>
> Yes, but it depends on exactly what you mean. The details are here:
> https://prometheus.io/docs/prometheus/latest/querying/operators/
> It depends on whether you can construct vectors for the LHS and RHS which 
> have corresponding labels.
>
> If you can give some specific examples of the metrics themselves - 
> including all their labels - then we can see whether it's possible to do 
> what you want in PromQL.  Right now the requirements are unclear.
>
>
> *> redis_cluster_known_nodes != 
> scalar(count(up{service=~"redis-exporter"}))*
> > 
> > The shared label value would be something like, *group="cluster-a" *and 
> should not evaluate metrics where *group="cluster-b"*
>
> You need to arrange both LHS and RHS to have some corresponding labels 
> before you can combine them with any operator such as !=.  The RHS has no 
> "group" label at the moment, in fact it's not even a vector, but you could 
> do:
>
> count by (group) (up{service="redis-exporter"})
>
> Then, assuming that redis_cluster_known_nodes also has a "group" label, 
> you can do:
>
> redis_cluster_known_nodes != on (group) count by (group) 
> (up{service="redis-exporter"})
>
> This will work as long as the LHS and RHS both have exactly *one* metric 
> for a given value of the "group" label.
>
> If the LHS has N values of "group" for 1 on the RHS, or vice versa, then 
> you can use N:1 matching as described in the documentation ("group left" or 
> "group right").
>
> If there are multiple matches on both LHS and RHS for the same value of 
> group, then the query will fail.  You will have to include some more labels 
> in the on(...) list to get a unique match.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5f17c0a4-e1aa-447c-acdc-561b0a807d9an%40googlegroups.com.


[prometheus-users] Re: PromQL: multiple queries with dependent values

2022-10-13 Thread Brian Candler
> Is it possible to have one side of a query limit the results of another 
part of the same query?

Yes, but it depends on exactly what you mean. The details are here:
https://prometheus.io/docs/prometheus/latest/querying/operators/
It depends on whether you can construct vectors for the LHS and RHS which 
have corresponding labels.

If you can give some specific examples of the metrics themselves - 
including all their labels - then we can see whether it's possible to do 
what you want in PromQL.  Right now the requirements are unclear.


*> redis_cluster_known_nodes != 
scalar(count(up{service=~"redis-exporter"}))*
> 
> The shared label value would be something like, *group="cluster-a" *and 
should not evaluate metrics where *group="cluster-b"*

You need to arrange both LHS and RHS to have some corresponding labels 
before you can combine them with any operator such as !=.  The RHS has no 
"group" label at the moment, in fact it's not even a vector, but you could 
do:

count by (group) (up{service="redis-exporter"})

Then, assuming that redis_cluster_known_nodes also has a "group" label, you 
can do:

redis_cluster_known_nodes != on (group) count by (group) 
(up{service="redis-exporter"})

This will work as long as the LHS and RHS both have exactly *one* metric 
for a given value of the "group" label.

If the LHS has N values of "group" for 1 on the RHS, or vice versa, then 
you can use N:1 matching as described in the documentation ("group left" or 
"group right").

If there are multiple matches on both LHS and RHS for the same value of 
group, then the query will fail.  You will have to include some more labels 
in the on(...) list to get a unique match.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/09bdd5d2-59ea-451f-a431-b2d9665417afn%40googlegroups.com.


[prometheus-users] Re: Promql

2022-07-24 Thread Brian Candler
Try this as a starting point:

some_metric * scalar(hour() < bool 12)

On Friday, 22 July 2022 at 19:42:47 UTC+1 hamidd...@gmail.com wrote:

> Hi everyone 
> i m looking for a formula that can make a query just for nights for 
> example from 12 pm to 12 am.
> can anyone help me,
> cheers.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6dab4844-cb9e-4a91-97b8-5da3d3beeae1n%40googlegroups.com.


[prometheus-users] Re: promql optimization

2021-11-24 Thread Brian Candler
Generate additional static timeseries such as:

es_exporter_percentiles_values_95_0_threshold{httppath="path1",app="app1"} 
10
es_exporter_percentiles_values_95_0_threshold{httppath="path2",app="app2"} 5
es_exporter_percentiles_values_95_0_threshold{httppath="path3",app="app3"} 3

Then your alerting rule becomes just:

es_exporter_percentiles_values_95_0 > on(httppath,app) 
es_exporter_percentiles_values_95_0_threshold

Those static timeseries can be put onto a webserver that you scrape, or 
using node_exporter textfile_collector.  You may need to add extra labels 
to your threshold timeseries and on(...) clause if there is overlap, e.g. 
between different environments, to ensure there's always a 1:1 match 
between value and threshold.

Original idea here: 
https://www.robustperception.io/using-time-series-as-alert-thresholds
which also shows how you can have a default threshold for those which 
aren't explicitly set.

On Wednesday, 24 November 2021 at 10:58:21 UTC ishu...@gmail.com wrote:

> Hi Team,
>
> Any suggestion is very much appreciated.
>
> I have an alert for breach of 95 percentile threshold on API requests. Now 
> problem is, different API requests has different threshold, so I cannot 
> hard code the alerts as I would have to add 10 alerts for 10 different APIs 
> with their thresholds. 
>
> The alerts would like something like this
> es_exporter_percentiles_values_95_0{httppath="path1",app="app1"} > 10
> es_exporter_percentiles_values_95_0{httppath="path2",app="app2"} > 5
> es_exporter_percentiles_values_95_0{httppath="path3",app="app3"} > 3
>
> Doesn't want to use or as again wanted to do this in a better smarter way. 
> Tried using recording rules, but that would result in adding recording 
> rules for 20 apps * 4 env * 5 http paths. 
>
> Any better way, so that I have only one alert definition but can be 
> applied against all apps, paths and envs. 
>
> Thanks
> Eswar
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b52f7b3d-5f13-4a3b-8177-3303ed6d112fn%40googlegroups.com.


[prometheus-users] Re: Promql JOIN many-to-many matching

2021-10-07 Thread Brian Candler
(Grr: I gave a detailed reply yesterday and Google Groups deleted it.  I 
will try one more time)

Let's just think about kubelet_version.  The nodes of a cluster may have 
different kubelet versions.  The deployment is deployed to a cluster, and 
hence could be deployed to any or all nodes of that cluster.  Therefore, 
there is not necessarily a single value for "kubelet_version" that you can 
associate with a deployment.

What you *can* do is to group together the kubelet versions:

count by (instance, kubelet_version) (kube_node_info) # I'm 
assuming that "instance" is the cluster name

This will give a unique value for each (instance, kubelet_version) pair.  
If all the nodes in a given cluster have identical kubelet versions, then 
you'll just get 1 metric per cluster.  At that point you can do a N:1 join:

kube_deployment_labels * on (instance) group_left(kubelet_version) 
(count by (instance, kubelet_version) (kube_node_info))

That will annotate each deployment with the kubelet_version, as you 
wanted.  But if the cluster has inconsistent kubelet versions then it will 
fail, as expected, because it can't associate each cluster with a single 
kubelet version.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e885cf77-c29a-4a7c-aafc-7cd00fced12cn%40googlegroups.com.


[prometheus-users] Re: Promql JOIN many-to-many matching

2021-10-07 Thread Oltion Alimerko
Hello,

the relationship between both metrics is that they have 2 fields in 
common,namespace and project.What i want to achieve is a JOIN of both 
metrics with these labels

   1. label_chart (from kube_deployment_label)
   2. namespace  (from kube_deployment_label) 
   3. project (from kube_deployment_label) 
   4. kernel_version  (from kube_node_info) 
   5. kubelet_version (from kube_node_info) 
   6. os_image  (from kube_node_info) 

This result i want to visualize in grafana.But this combination leads to 
N:N matching because i have N rows on each side.The solution would be to 
add some filter at least on one side so that i can get 1:N matching but 
that its not what i wantend.The main idea was that for every project to 
list all enviroments(namespaces) and then join the k8s version.But since on 
other side i have more then 1 node its very hard how match all these labels
On Wednesday, October 6, 2021 at 4:44:45 PM UTC+2 Brian Candler wrote:

> You can't do a many-to-many join.  Even if you could it's unclear what the 
> semantics would be.  (Would it be a cross-product, and what labels would 
> the results have?)
>
> Usually the solution is to summarise one side, or to add more fields to 
> the on (...) clause, so that there is a one-to-many relationship.
>
> After reformatting, I think the metrics you posted are these:
>
> 
> kube_deployment_labels{deployment="sdc", instance="cfor-aks-dev", 
> job="metrics-forwarder", label_app="sdc", 
> label_app_kubernetes_io_managed_by="Helm", label_chart="sdc-5.17.2-HF01", 
> label_heritage="Helm", label_release="sdc", namespace="dev-workloads", 
> project="C4R"} 1
>
> kube_deployment_labels{deployment="sdc", instance="sop-aks-dev", 
> job="metrics-forwarder", label_app="sdc", 
> label_app_kubernetes_io_managed_by="Helm", label_chart="sdc-5.17.1-b03", 
> label_heritage="Helm", label_release="sdc", namespace="dev-workloads", 
> project="SOP"} 1
>
> kube_deployment_labels{deployment="sdc", instance="sop-aks-dev", 
> job="metrics-forwarder", label_app="sdc", 
> label_app_kubernetes_io_managed_by="Helm", label_chart="sdc-5.17.1-b03", 
> label_heritage="Helm", label_release="sdc", namespace="test-workloads", 
> project="SOP"} 1
>
> kube_deployment_labels{deployment="sdc", instance="stu-aks-dev", 
> job="metrics-forwarder", label_app="sdc", 
> label_app_kubernetes_io_managed_by="Helm", 
> label_chart="sdc-5.17.2-HF04-b01", label_heritage="Helm", 
> label_release="sdc", namespace="dev-workloads", project="STU"} 1
> 
>
> 
> kube_node_info{container_runtime_version="containerd://1.4.4+azure", 
> instance="cfor-aks-dev", job="metrics-forwarder", 
> kernel_version="5.4.0-1051-azure", kubelet_version="v1.20.7", 
> kubeproxy_version="v1.20.7", node="aks-default-13254112-vmss01", 
> os_image="Ubuntu 18.04.5 LTS", pod_cidr="10.244.2.0/24", project="C4R", 
> provider_id="azure:///subscriptions/693c9868-a960-4590-b23d-7220a5a8ba04/resourceGroups/mc_rg_aks-c4r-aks-dev_c4r-aks-dev_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-default-13254112-vmss/virtualMachines/1"}
>  
> 1
>
> kube_node_info{container_runtime_version="containerd://1.4.4+azure", 
> instance="cfor-aks-dev", job="metrics-forwarder", 
> kernel_version="5.4.0-1051-azure", kubelet_version="v1.20.7", 
> kubeproxy_version="v1.20.7", node="aks-default-13254112-vmss04", 
> os_image="Ubuntu 18.04.5 LTS", pod_cidr="10.244.3.0/24", project="C4R", 
> provider_id="azure:///subscriptions/693c9868-a960-4590-b23d-7220a5a8ba04/resourceGroups/mc_rg_aks-c4r-aks-dev_c4r-aks-dev_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-default-13254112-vmss/virtualMachines/4"}
>  
> 1
>
> kube_node_info{container_runtime_version="containerd://1.4.4+azure", 
> instance="stu-aks-dev", job="metrics-forwarder", 
> kernel_version="5.4.0-1047-azure", kubelet_version="v1.20.7", 
> kubeproxy_version="v1.20.7", node="aks-default-36930916-vmss02", 
> os_image="Ubuntu 18.04.5 LTS", pod_cidr="10.244.3.0/24", project="STU", 
> provider_id="azure:///subscriptions/3fb69224-6feb-4c6c-9f55-0b233b82d4a2/resourceGroups/mc_rg_aks-stu-aks-dev_stu-aks-dev_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-default-36930916-vmss/virtualMachines/2"}
>  
> 1
> 
>
> The question here is, is there *any* relationship between 
> kube_deployment_labels and kube_node_info, and if so, what is it?  It looks 
> like "instance" is a common label - does the instance refer to the 
> cluster?  But then a cluster can have many deployments, and a cluster can 
> also have many nodes.
>
> In Kubernetes, a "deployment" is a cluster-level object, and the nodes are 
> associated with the cluster.  But there's no relationship between a 
> deployment and a node that I can think of.
>
> Where it's more typical is to do joins between *pods* and *nodes*, because 
> each pod is running on a node. There is an N:1 relationship between pod and 
> node, and that lets you do a group_left or group_right

[prometheus-users] Re: Promql JOIN many-to-many matching

2021-10-06 Thread Brian Candler
You can't do a many-to-many join.  Even if you could it's unclear what the 
semantics would be.  (Would it be a cross-product, and what labels would 
the results have?)

Usually the solution is to summarise one side, or to add more fields to the 
on (...) clause, so that there is a one-to-many relationship.

After reformatting, I think the metrics you posted are these:


kube_deployment_labels{deployment="sdc", instance="cfor-aks-dev", 
job="metrics-forwarder", label_app="sdc", 
label_app_kubernetes_io_managed_by="Helm", label_chart="sdc-5.17.2-HF01", 
label_heritage="Helm", label_release="sdc", namespace="dev-workloads", 
project="C4R"} 1

kube_deployment_labels{deployment="sdc", instance="sop-aks-dev", 
job="metrics-forwarder", label_app="sdc", 
label_app_kubernetes_io_managed_by="Helm", label_chart="sdc-5.17.1-b03", 
label_heritage="Helm", label_release="sdc", namespace="dev-workloads", 
project="SOP"} 1

kube_deployment_labels{deployment="sdc", instance="sop-aks-dev", 
job="metrics-forwarder", label_app="sdc", 
label_app_kubernetes_io_managed_by="Helm", label_chart="sdc-5.17.1-b03", 
label_heritage="Helm", label_release="sdc", namespace="test-workloads", 
project="SOP"} 1

kube_deployment_labels{deployment="sdc", instance="stu-aks-dev", 
job="metrics-forwarder", label_app="sdc", 
label_app_kubernetes_io_managed_by="Helm", 
label_chart="sdc-5.17.2-HF04-b01", label_heritage="Helm", 
label_release="sdc", namespace="dev-workloads", project="STU"} 1



kube_node_info{container_runtime_version="containerd://1.4.4+azure", 
instance="cfor-aks-dev", job="metrics-forwarder", 
kernel_version="5.4.0-1051-azure", kubelet_version="v1.20.7", 
kubeproxy_version="v1.20.7", node="aks-default-13254112-vmss01", 
os_image="Ubuntu 18.04.5 LTS", pod_cidr="10.244.2.0/24", project="C4R", 
provider_id="azure:///subscriptions/693c9868-a960-4590-b23d-7220a5a8ba04/resourceGroups/mc_rg_aks-c4r-aks-dev_c4r-aks-dev_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-default-13254112-vmss/virtualMachines/1"}
 
1

kube_node_info{container_runtime_version="containerd://1.4.4+azure", 
instance="cfor-aks-dev", job="metrics-forwarder", 
kernel_version="5.4.0-1051-azure", kubelet_version="v1.20.7", 
kubeproxy_version="v1.20.7", node="aks-default-13254112-vmss04", 
os_image="Ubuntu 18.04.5 LTS", pod_cidr="10.244.3.0/24", project="C4R", 
provider_id="azure:///subscriptions/693c9868-a960-4590-b23d-7220a5a8ba04/resourceGroups/mc_rg_aks-c4r-aks-dev_c4r-aks-dev_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-default-13254112-vmss/virtualMachines/4"}
 
1

kube_node_info{container_runtime_version="containerd://1.4.4+azure", 
instance="stu-aks-dev", job="metrics-forwarder", 
kernel_version="5.4.0-1047-azure", kubelet_version="v1.20.7", 
kubeproxy_version="v1.20.7", node="aks-default-36930916-vmss02", 
os_image="Ubuntu 18.04.5 LTS", pod_cidr="10.244.3.0/24", project="STU", 
provider_id="azure:///subscriptions/3fb69224-6feb-4c6c-9f55-0b233b82d4a2/resourceGroups/mc_rg_aks-stu-aks-dev_stu-aks-dev_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-default-36930916-vmss/virtualMachines/2"}
 
1


The question here is, is there *any* relationship between 
kube_deployment_labels and kube_node_info, and if so, what is it?  It looks 
like "instance" is a common label - does the instance refer to the 
cluster?  But then a cluster can have many deployments, and a cluster can 
also have many nodes.

In Kubernetes, a "deployment" is a cluster-level object, and the nodes are 
associated with the cluster.  But there's no relationship between a 
deployment and a node that I can think of.

Where it's more typical is to do joins between *pods* and *nodes*, because 
each pod is running on a node. There is an N:1 relationship between pod and 
node, and that lets you do a group_left or group_right join.

On Wednesday, 6 October 2021 at 14:04:49 UTC+1 oltion@gmail.com wrote:

> Hi all,
>
> is it possible to use in promql the JOIN with many-to-many matching?I 
> merged 2 metrics using group_left but it looks like the many-to-many 
> matching is not supported because i get the error:
>
> *"Error executing query: found duplicate series for the match group {} on 
> the left hand-side of the operation. Many-to-many matching not allowed: 
> matching labels must be unique on one side"*
>
> On both sides of JOIN i have multitiple rows 
>
> Metric 1
>
> kube_deployment_labels{label_chart=~".*sdc-5.17.*",project!=""}
>
>
> Result of metric 1 
>
> kube_deployment_labels{deployment="sdc", instance="cfor-aks-dev", job=
> "metrics-forwarder", label_app="sdc", label_app_kubernetes_io_managed_by=
> "Helm", label_chart="sdc-5.17.2-HF01", label_heritage="Helm", 
> label_release="sdc", namespace="dev-workloads", project="C4R"} 1 
> kube_deployment_labels{deployment="sdc", instance="sop-aks-dev", job=
> "metrics-forwarder", label_app="sdc", label_app_kubernetes_io_managed_by=
> "Helm", label_char

[prometheus-users] Re: Promql query

2021-09-06 Thread Brian Candler
Mixing two different metrics in the results of same PromQL query doesn't 
make sense, because those values represent different things.  For example 
you might get back 4 values from the query, 2 of which are numbers of CPU 
cores and 2 of which are numbers of bytes, which are not comparable and 
indeed have very different magnitude.  For this reason it's intentionally 
not straightforward to do.

What you really want is to overlay two *separate* PromQL queries onto the 
same graph, where the separate queries can be displayed with different 
colours, different scales and axes etc.  That will be a function of 
whatever dashboarding tool you are using - e.g. grafana is a common 
choice.  In grafana it's easy to add multiple queries to a panel.  Note 
that if you have questions about grafana, it has its own separate 
discussion forum.

On Tuesday, 7 September 2021 at 06:10:06 UTC+1 kshiti...@gmail.com wrote:

> Hi all,
> I wanted to ask a question , I am trying to perform cpu_cores and 
> memory_bytes query on the same graph for a particular namespace.How can i 
> write a query for this .I am new to promql programming.Would appreciate it 
> if anyone could help.
> Thanks 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5ee7339a-053d-4891-962c-c666488f7a80n%40googlegroups.com.


[prometheus-users] Re: PromQL Query Using Hour Function

2021-02-09 Thread Chad Thielen
I managed to figure this out. The key is to use the on() function which 
ignores the labels. The query now looks like this:

up{job="calculator"} == 0 and on() count(hour(vector(time())) >= 1 and 
hour(vector(time())) < 13)

On Monday, February 8, 2021 at 3:33:51 PM UTC-6 Chad Thielen wrote:

> Hello,
>
> I'm having some trouble writing a query for an alerting rule that we only 
> want to run at certain times of the day. This is what I original had, which 
> works, but not in the way we now need:
>
> > count(up{job="calculator"} == 0) and count(hour(vector(time())) >= 1 and 
> hour(vector(time())) < 13)
>
> Now our problem is that we need the instance label from the up metric to 
> persist to the alert so we can include it in the pager duty incident, 
> otherwise our engineers don't know what instance is down without checking 
> prometheus first. This is where I'm having problems, using count strips all 
> those labels and just basically gives a true or false of if the alert is 
> firing.
>
> Any suggestions?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b582828e-8828-468c-9d69-0f0cb70d29b2n%40googlegroups.com.


Re: [prometheus-users] Re: promql.(*Engine).execEvalStmt very high cpu

2020-11-10 Thread Laurent Dumont
How many nodes do you have as targets? Are all of the 2000 rules the same?

With a large number of nodes and a rule that is returning a large number of
metrics, it might be a lot to process.

On Tue, Nov 10, 2020 at 1:18 AM 万锐  wrote:

> On Tuesday, November 10, 2020 at 2:16:02 PM UTC+8 万锐 wrote:
>
>> *What did you do?*
>>
>> add about 2000 rules, all of them like regexp.
>> (node_network_up{env="prod",idc!="some-idc*",interface=~"bond.*|em.*|p.*|eth.*",maintain_department="*some-depart*",server_type="aws"}
>>
>>
>> *What did you expect to see?*
>>
>> cpu load very low.
>>
>> *What did you see instead? Under which circumstances?*
>>
>> cpu load very high . scrap cpu profile 30s , Total samples = 290.73s
>> (962.84%)
>> File: prometheus Type: cpu Time: Nov 6, 2020 at 4:53pm (CST) Duration:
>> 30.20s, Total samples = 290.73s (962.84%) Active filters:
>> focus=tsdb.querier.Select Showing nodes accounting for 108.54s, 37.33% of
>> 290.73s total
>> --+-
>> flat flat% sum% cum cum% calls calls% + context
>> --+-
>> 0.62s 100% |
>> github.com/prometheus/prometheus/storage/tsdb.(*querier).Select
>> :1 0 0% 0% 0.62s 0.21% |
>> github.com/prometheus/prometheus/storage/tsdb.querier.Select /go/src/
>> github.com/prometheus/prometheus/storage/tsdb/tsdb.go:195 0.62s 100% |
>> github.com/prometheus/prometheus/storage/tsdb.convertMatcher /go/src/
>> github.com/prometheus/prometheus/storage/tsdb/tsdb.go:270
>> --+-
>> 107.78s 99.87% |
>> github.com/prometheus/prometheus/storage/tsdb.(*querier).Select
>> :1 0 0% 0% 107.92s 37.12% |
>> github.com/prometheus/prometheus/storage/tsdb.querier.Select /go/src/
>> github.com/prometheus/prometheus/storage/tsdb/tsdb.go:197 107.92s 100% |
>> github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*querier).Select
>> /go/src/
>> github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:87
>> --+-
>> File: prometheus Type: cpu Time: Nov 6, 2020 at 4:53pm (CST) Duration:
>> 30.20s, Total samples = 290.73s (962.84%) Active filters:
>> focus=execEvalStmt Showing nodes accounting for 225.83s, 77.68% of 290.73s
>> total
>> --+-
>> flat flat% sum% cum cum% calls calls% + context
>> --+-
>> 121.45s 99.93% | github.com/prometheus/prometheus/promql.(*Engine).exec
>> /go/src/github.com/prometheus/prometheus/promql/engine.go:366 0 0% 0%
>> 121.54s 41.81% |
>> github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt /go/src/
>> github.com/prometheus/prometheus/promql/engine.go:385 121.52s 100% |
>> github.com/prometheus/prometheus/promql.(*Engine).populateSeries /go/src/
>> github.com/prometheus/prometheus/promql/engine.go:506 0.02s 0.016% |
>> github.com/prometheus/prometheus/promql.(*Engine).populateSeries /go/src/
>> github.com/prometheus/prometheus/promql/engine.go:479
>> --+-
>> 0.01s 100% | github.com/prometheus/prometheus/promql.(*Engine).exec
>> /go/src/github.com/prometheus/prometheus/promql/engine.go:366 0 0% 0%
>> 0.01s 0.0034% |
>> github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt /go/src/
>> github.com/prometheus/prometheus/promql/engine.go:399 0.01s 100% |
>> github.com/prometheus/prometheus/util/stats.(*QueryTimers).GetSpanTimer
>> /go/src/github.com/prometheus/prometheus/util/stats/query_stats.go:156
>> --+-
>> 104.28s 100% | github.com/prometheus/prometheus/promql.(*Engine).exec
>> /go/src/github.com/prometheus/prometheus/promql/engine.go:366 0 0% 0%
>> 104.28s 35.87% |
>> github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt /go/src/
>> github.com/prometheus/prometheus/promql/engine.go:411 104.28s 100% |
>> github.com/prometheus/prometheus/promql.(*evaluator).Eval /go/src/
>> github.com/prometheus/prometheus/promql/engine.go:639
>> --+-
>>
>>
>> 
>>
>>
>> 
>>
>>
>> *[image: 98496497-b7cebb00-227c-11eb-89d9-5962a579f756.png]*
>>
>> *Environment*
>> PowerEdge R730xd 256G 24 core
>>
>>- System information:
>>
>> Linux 3.10.0-327.el7.x86_64 x86_64
>>
>>- Prometheus version:
>>
>> prometheus, version 2.5.0 (branch: HEAD, revision:
>> 67dc912ac8b24f94a1fc478f352d25179c94ab9b) build user: root@578ab108d0b9
>> build date: 20181106-11:40:44 go version: go1.11.1
>>
>>- Alertmanager version:
>>
>> alertmanager, version 0

[prometheus-users] Re: promQL: getting data from multiple metrics in single query

2020-11-07 Thread Brian Candler
On Saturday, 7 November 2020 14:12:53 UTC, kiran wrote:
>
> In my use case I have custom metrics that I will be sending to victoria 
> metrics and will be using PormQL. 
>

Using remote_write from prometheus, or your applicatino directly writing to 
VictoriaMetrics?  VM supports import in a bunch of different formats, such 
as influxdb line protocol and CSV, so you can choose whatever you find most 
convenient.

Since I have control over the data structure, I was thinking instead of 
> sending vertical data(one time series per metric per function), I could 
> send flat structure e.g metrics as labels. Is it recommended this way or 
> this would increase cardinality?
>

Definitely not.  The metrics need to be the metrics, not the labels.  The 
entire storage mechanism depends on this: it's the bag of labels which 
defines "what is a timeseries", and all the metrics for that timeseries are 
stored adjacent to each other, for efficient compression and retrieval.  
Plus: all the PromQL functions which operate on numbers, like sum() and 
rate() and so on, operate on the value and not the labels.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3ac8d6ce-91ed-4010-a517-b0a6e6c2d71bo%40googlegroups.com.


[prometheus-users] Re: promQL: getting data from multiple metrics in single query

2020-11-07 Thread kiran
Thank you Brian.
In my use case I have custom metrics that I will be sending to victoria
metrics and will be using PormQL.
The metrics constitute application level metadata and some high level
metrics for each application.
I may have 20-25 such data points for each application.

Ultimately I need to present all the data for each application in tables
morethan graphs/charts.
Since I have control over the data structure, I was thinking instead of
sending vertical data(one time series per metric per function), I could
send flat structure e.g metrics as labels. Is it recommended this way or
this would increase cardinality? Or is there any better way I am not able
to visualize.

*Typical way*:
Metric1{appname=‘app1’} value timepoint
Metric2{appname=‘app1’} value timepoint
Metric3{appname=‘app1’} value timepoint
.
.
Metric25{appname=‘app1’} value timepoint

Metric1{appname=‘app2’} value timepoint
Metric2{appname=‘app2’} value timepoint
Metric3{appname=‘app2’} value timepoint
.
.
Metric25{appname=‘app2’} value timepoint

*Alternative way I am thinking of(one time series per function)*:
Application metadata does not change that often, so I am thinking to send
only once in 24 hrs and other metrics in 1 min interval.
I am thinking I can have 2 metrics(not really metrics, but one for
application metadata and one for actual application metrics).

app_meta{appname="app1", metric1="value", metric2="value",
metric3="value"metric25="value"} 1 timepoint*[Here
1 is just a dummy value]*
app_meta{appname="app2", metric1="value", metric2="value",
metric3="value"metric25="value"} 1 timepoint
app_meta{appname="app3", metric1="value", metric2="value",
metric3="value"metric25="value"} 1 timepoint

app_metric{appname="app1", metric1="value", metric2="value",
metric3="value"metric25="value"} 1 timepoint*[Here
1 is just a dummy value]*
app_metric{appname="app2", metric1="value", metric2="value",
metric3="value"metric25="value"} 1 timepoint
app_metric{appname="app3", metric1="value", metric2="value",
metric3="value"metric25="value"} 1 timepoint

I may be completely wrong with my approach. Please suggest.


On Saturday, November 7, 2020, Brian Candler  wrote:

> You can do a PromQL query like {__name__="foo|bar"} but that's messy if
> you also want to filter on different labels for metrics foo and bar.
>
> If you're only interested in the current values of each metric, then you
> can query the /federate endpoint where you can provide the match[]
> parameter multiple times with multiple queries.
>
> https://prometheus.io/docs/prometheus/latest/federation/#configuring-federation
>
> e.g.
>
> curl -g
> 'prometheus:9090/federate?match[]=up&match[]={__name__=~"scrape_.*"}'
>
> In any case, I wouldn't worry too much about inefficiency of separate
> queries. The main time taken is reading out the data and formatting it;
> whether that's done in a single query or spread over two queries isn't
> going to make much difference.  If you want to ensure that the table data
> lines up to the same instant in time, then pick an instant and provide the
> same time=xxx parameter to each separate query.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/bd76547c-e534-4133-a54f-28615df90ff2o%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAOnWYZUpx-ppkOu8sqsg%3DJQ_2smKssDBJNHbu1qhL89TP%3D4OgQ%40mail.gmail.com.


[prometheus-users] Re: promQL: getting data from multiple metrics in single query

2020-11-07 Thread Brian Candler
You can do a PromQL query like {__name__="foo|bar"} but that's messy if you 
also want to filter on different labels for metrics foo and bar.

If you're only interested in the current values of each metric, then you 
can query the /federate endpoint where you can provide the match[] 
parameter multiple times with multiple queries.
https://prometheus.io/docs/prometheus/latest/federation/#configuring-federation

e.g.

curl -g 
'prometheus:9090/federate?match[]=up&match[]={__name__=~"scrape_.*"}'

In any case, I wouldn't worry too much about inefficiency of separate 
queries. The main time taken is reading out the data and formatting it; 
whether that's done in a single query or spread over two queries isn't 
going to make much difference.  If you want to ensure that the table data 
lines up to the same instant in time, then pick an instant and provide the 
same time=xxx parameter to each separate query.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bd76547c-e534-4133-a54f-28615df90ff2o%40googlegroups.com.


[prometheus-users] Re: promql

2020-07-01 Thread Brian Candler
Prometheus is not a general-purpose database.

You cannot populate data using PromQL - there is no "insert" statement 
equivalent.

In fact, you cannot populate data into Prometheus' internal time series 
database (TSDB) in any way at all, except by having Prometheus scrape the 
data from an exporter.  You cannot backfill historical data, for instance.

You *can* get Prometheus to write data to a remote storage system, and read 
it back again.  There are some integrations here 
.
  
I don't see mongodb listed, so you might end up having to write that 
yourself.

It could be that some other system will suit your needs better - 
TimescaleDB, InfluxDB, VictoriaMetrics etc.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6140b571-42c1-46ce-8a89-4754f3f65c42o%40googlegroups.com.


[prometheus-users] Re: Promql for filtering pods restert for the last workload deployed

2020-05-27 Thread Sally Lehman
Rules are based on the data you have piped to prometheus, usually via 
exporters. What data do you have in prometheus to create rules from? Do you 
need to send it there first? 

On Friday, May 22, 2020 at 1:34:37 PM UTC-7, Neha Gupta wrote:
>
> Hi,
>
> I have a specific requirement to set rules for pods restart notifications 
> only for the most recent workload deployed.
>
> Any help would be highly appreciated.
>
> Thanks
> Neha
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/342cd108-6cd0-41ae-bb90-f0e6c78a5ab4%40googlegroups.com.


[prometheus-users] Re: PromQL: comparison to custom metric and/or static threshold

2020-05-11 Thread Brian Candler
Here are a couple of things to help understand why the original expression 
doesn't work as expected.

1. The comparison operators are filters, returning a vector of 0 or more 
elements which is subset of all the available timeseries, not a boolean 
value.

(foo > my_custom_threshold) is comparing an instant vector with another 
instant vector.  It only gives results where foo and my_custom_threshold 
have exactly matching label sets *and* the value of foo is greater than the 
corresponding value of my_custom_threshold.  The values in the result 
vector are the values of the LHS.

(foo > 0.7) is comparing an instant vector with a scalar.  It gives results 
for *every* metric foo whose value is > 0.7.


2. (foo > x) OR (foo > y) isn't a boolean expression, it's a set (union) 
operation.

(foo > x) gives all those timeseries whose metric name is foo and value is 
> x

(foo > y) gives all those timeseries whose metric name is foo and value is 
> y

(foo > x) OR (foo > y)
will give you all the metrics from the LHS, *plus* all metrics on the RHS 
which don't have any matching label set on the LHS.

The behaviour of this effectively is "foo > min(x,y)" [although that is not 
valid PromQL].  This is why you'll always get an alert for value over 0.7; 
if you set my_custom_threshold higher, meaning that the LHS doesn't give a 
result, the RHS will fill in the gap.

I find that using the PromQL query interface in the prometheus web 
interface is very useful for graphing the expressions or subexpressions.  
Once you realise that "foo > 0.7" is basically just showing you the graph 
of foo, but with gaps where its value falls below 0.7, suddenly things 
become a lot clearer.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/90318052-a5c7-432d-a365-303842ad016e%40googlegroups.com.


[prometheus-users] Re: PromQL: comparison to custom metric and/or static threshold

2020-05-11 Thread Brian Candler
There's an example of this in
https://www.robustperception.io/using-time-series-as-alert-thresholds


*"You could also provide a default, so only those teams wishing to override 
it need to configure a threshold. Here the default is 42:"*
The trick is to get some other timeseries which has the same labels as the 
set you're alerting on, and use that to provide a fixed default if the 
custom threshold doesn't exist.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/864dab1b-89a3-4958-99f2-4de455a37aa3%40googlegroups.com.