Hmm. I do care about the status. Maybe when I simplified the question, by 
labeling oversimplified the problem too much.
I got it to pretty much work by doing this:

(sum without (dst_pod) (
     route_response_total{
       direction="outbound", grpc_status!="0", grpc_status!="", 
rt_route!="", dst="bar"}))
/ on (rt_route, pod, workload_ns)
(sum without (dst_pod) (
     route_response_total{
       direction="outbound", grpc_status="0", rt_route!="", dst="bar"})) > 
10

dst_pod denotes a specific kubernetes pod in the "bar" service. dst denotes 
the service name. direction="outbound" denotes a counter for requests sent 
from a pod. 

This gives the correct answer, but only if the denominator is present (i.e. 
not "absent"). So if the pod has made at least one successful request, this 
works. But for a pod that has never made a successful request, the 
denominator is missing. Then the prometheus console returns "no data", and 
no alert can be triggered in my rule group. Dividing by zero seems like a 
separate problem, but I still appreciate any input there.

On Wednesday, December 30, 2020 at 3:55:32 AM UTC-5 [email protected] wrote:

> Since you don't care about the status, the typical thing to do is us a 
> sum() aggregator to remove the label.
>
> sum without (status) (increase(response_total{status!="200"}[10m])) / sum 
> without (status) (increase(response_total{status="200"}[10m]))
>
> On Tue, Dec 29, 2020 at 11:36 PM Alex K <[email protected]> wrote:
>
>> I have a counter metric called response_total. It has labels source, 
>> status, and service, plus a few more, but those are the important ones for 
>> this question.
>>
>> response_total{status="200", source="foo", service="bar"} is the counter 
>> for successful requests from a service or job called "foo" to a service 
>> called "bar". 
>> response_total{status!="200", source="foo", service="bar"} is the counter 
>> for failed requests from a service or job called "foo" to a service called 
>> "bar". 
>>
>> I'm trying to define an alert that will trigger if there's a sudden 
>> increase of non-200 requests from a specific source to a specific service 
>> relative the increase of 200 requests for the same (source, service). E.g., 
>> if the increase of non-200 requests over the last 10 minutes is 10x greater 
>> than the increase of 200 requests, trigger an alert.
>>
>> I'm a bit stuck on how to define this as an expression. So far I've 
>> converged on something along these lines: 
>>
>> *increase(response_total{status!="200"}[10m])) / 
>> increase(response_total{status="200"}[10m]) > 10*
>>
>> This doesn't seem to work, and it's not particularly surprising. I'm not 
>> sure how prometheus should "know" that it should be comparing 
>> response_total{status!="200", source="foo", service="bar"} to 
>> response_total{status="200", source="foo", service="bar"}.
>>
>> I could define the service up-front, but the sources are defined by our 
>> cluster manager, so I can't enumerate them all up-front.
>>
>> I appreciate any help!
>>
>> Thanks,
>> Alex
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/d5f90eb5-f7f5-4049-bbd6-50d530edc545n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/d5f90eb5-f7f5-4049-bbd6-50d530edc545n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6b787e6e-9a1a-40ba-af14-9bcdedf544a4n%40googlegroups.com.

Reply via email to