Hmm. I do care about the status. Maybe when I simplified the question, by
labeling oversimplified the problem too much.
I got it to pretty much work by doing this:
(sum without (dst_pod) (
route_response_total{
direction="outbound", grpc_status!="0", grpc_status!="",
rt_route!="", dst="bar"}))
/ on (rt_route, pod, workload_ns)
(sum without (dst_pod) (
route_response_total{
direction="outbound", grpc_status="0", rt_route!="", dst="bar"})) >
10
dst_pod denotes a specific kubernetes pod in the "bar" service. dst denotes
the service name. direction="outbound" denotes a counter for requests sent
from a pod.
This gives the correct answer, but only if the denominator is present (i.e.
not "absent"). So if the pod has made at least one successful request, this
works. But for a pod that has never made a successful request, the
denominator is missing. Then the prometheus console returns "no data", and
no alert can be triggered in my rule group. Dividing by zero seems like a
separate problem, but I still appreciate any input there.
On Wednesday, December 30, 2020 at 3:55:32 AM UTC-5 [email protected] wrote:
> Since you don't care about the status, the typical thing to do is us a
> sum() aggregator to remove the label.
>
> sum without (status) (increase(response_total{status!="200"}[10m])) / sum
> without (status) (increase(response_total{status="200"}[10m]))
>
> On Tue, Dec 29, 2020 at 11:36 PM Alex K <[email protected]> wrote:
>
>> I have a counter metric called response_total. It has labels source,
>> status, and service, plus a few more, but those are the important ones for
>> this question.
>>
>> response_total{status="200", source="foo", service="bar"} is the counter
>> for successful requests from a service or job called "foo" to a service
>> called "bar".
>> response_total{status!="200", source="foo", service="bar"} is the counter
>> for failed requests from a service or job called "foo" to a service called
>> "bar".
>>
>> I'm trying to define an alert that will trigger if there's a sudden
>> increase of non-200 requests from a specific source to a specific service
>> relative the increase of 200 requests for the same (source, service). E.g.,
>> if the increase of non-200 requests over the last 10 minutes is 10x greater
>> than the increase of 200 requests, trigger an alert.
>>
>> I'm a bit stuck on how to define this as an expression. So far I've
>> converged on something along these lines:
>>
>> *increase(response_total{status!="200"}[10m])) /
>> increase(response_total{status="200"}[10m]) > 10*
>>
>> This doesn't seem to work, and it's not particularly surprising. I'm not
>> sure how prometheus should "know" that it should be comparing
>> response_total{status!="200", source="foo", service="bar"} to
>> response_total{status="200", source="foo", service="bar"}.
>>
>> I could define the service up-front, but the sources are defined by our
>> cluster manager, so I can't enumerate them all up-front.
>>
>> I appreciate any help!
>>
>> Thanks,
>> Alex
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/d5f90eb5-f7f5-4049-bbd6-50d530edc545n%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/prometheus-users/d5f90eb5-f7f5-4049-bbd6-50d530edc545n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/6b787e6e-9a1a-40ba-af14-9bcdedf544a4n%40googlegroups.com.