Re: [prometheus-users] Re: up query

Brian Candler Sat, 27 Aug 2022 05:33:40 -0700

That's a different thing.

node_boot_time_seconds is a metric that says when the host itself thinks it 
booted - which is not necessarily the same as the host has been "up" or 
"down" from the point of view of Prometheus, which classes "up" as a 
successful scrape.  For example, the host could have been running fine, but 
the network was down: you'll get up == 0 during the network outage, but 
node_boot_time_seconds will not have changed.


Question: are you generating alerts when these machines go down?  If you 
are, then the answer is easy: there's a metric ALERTS_FOR_STATE where the 
value is the time that the alert started. See:
https://jaanhio.me/blog/visualizing-alerts-metrics-grafana/

(You could always add alerting rules which send out no alerts: add a label 
that identifies them as a silent alert, and match this tag in your 
alertmanager routing rules to route them to an empty receiver)

Otherwise, assuming the node is currently down (i.e. up == 0), I think you 
are looking for either:
* the last time at which up == 1
* the last time at which up changed from 1 to 0

However, getting this answer directly through a prometheus query is not 
easy. You can graph the transitions from "up" to "down":

    up == 0 unless up offset 5m == 1

But you want the timestamp of the last transition. There is a function 
last_over_time(...) which gets you the last available value, but 
timestamp(last_over_time(...)) doesn't tell you its timestamp.

To the best of my knowledge, you need a trick like:

    timestamp(up) and up==1

or more simply, since we know up=0 or 1 only:

    time() * up

Then you can sweep this over a range and pick the maximum value, which must 
be the most recent, since time increases monotonically (and it will give 
zero if the machine has been down over the whole period):

    max_over_time((time() * up)[24h:])

Note: This is a fairly expensive query, so make sure you only evaluate it 
at a single instant.  If you're doing this in Prometheus web interface 
select "Table", not "Graph".  If you're doing this in Grafana, turn on the 
"Instant" switch.

Want to limit the result to just machines which are down *now*?

    max_over_time((time() * up)[24h:]) unless up == 1

You want to know how long have they been down? Do the same as you did with 
node_boot_time_seconds:


*    time() - max_over_time((time() * up)[24h:]) unless up == 1*

This query gets more expensive as you increase the time range covered. If 
you're not too worried about full accuracy, e.g. the approximate number of 
hours that the machine has gone down is OK, then you can use a larger 
evaluation step in the subquery: 

    time() - max_over_time((time() * up)[30d:1h]) unless up == 1

Hopefully, this has given some ideas about how flexible and powerful PromQL 
is.  Here are some links about PromQL I've bookmarked over time, in case 
they are useful (I haven't tested they all still work):

* <https://prometheus.io/docs/prometheus/latest/querying/basics/>
* <https://github.com/infinityworks/prometheus-example-queries>
* <https://timber.io/blog/promql-for-humans/>
* <https://www.weave.works/blog/promql-queries-for-the-rest-of-us/>
* 
<https://www.slideshare.net/weaveworks/promql-deep-dive-the-prometheus-query-language>
* <https://medium.com/@valyala/promql-tutorial-for-beginners-9ab455142085>
* <https://www.robustperception.io/common-query-patterns-in-promql>
* <https://www.robustperception.io/booleans-logic-and-math>
* 
<https://www.robustperception.io/composing-range-vector-functions-in-promql>
* <https://www.robustperception.io/rate-then-sum-never-sum-then-rate>
* 
<https://www.robustperception.io/using-group_left-to-calculate-label-proportions>
* <https://www.robustperception.io/extracting-raw-samples-from-prometheus>
* <https://www.robustperception.io/prometheus-query-results-as-csv/>
* <https://www.robustperception.io/existential-issues-with-metrics>
* <https://www.robustperception.io/left-joins-in-promql>

On Thursday, 25 August 2022 at 13:33:44 UTC+1 chembakay...@gmail.com wrote:

> Thanks, Brian. It really helped me. 
>
> I want to find the Downtime of the instance in a similar way to how we 
> will find the up time of the instance.
>
> Up time : time() - node_boot_time_seconds{instance=~"$instance"}
>
> Is there any metric in node exporter so that we can find the downtime of 
> the instance?
>
> On Wednesday, 24 August 2022 at 16:57:32 UTC+5:30 Brian Candler wrote:
>
>> On Wednesday, 24 August 2022 at 11:43:15 UTC+1 chembakay...@gmail.com 
>> wrote:
>>
>>> (max_over_time(up[60s]) == bool 0) * ((up offset 61s == bool 1) * 
>>> count(up[60s]) OR vector(1)) ---> query
>>>
>>> But the above query threw me an error as below:
>>>
>>> bad_data: 1:73: parse error: expected type instant vector in aggregation 
>>> expression, got range vector
>>>
>> That expression is junk, and you didn't say where you got it from apart 
>> from "some blog".
>>
>> What I am missing here... How I can achieve this solution like "find the 
>>> instances that have been completely in down state for last X days"
>>>
>>
>> Can you explain why the answer I gave before is not usable for you?  I 
>> have already told you that:
>>
>>     max_over_time(up[30d]) == 0
>>
>> will give you a list all instances which have been down continuously for 
>> the last 30 days, and that seems to be what you keep asking for.  I have 
>> tested it, it works:
>>
>> [image: img1.png]
>> That is a table of machines which have been down for 30 days continuously.
>>
>> Note that this is a query that you should run at a single instant (the 
>> current time), not one that you make a graph from.  In Grafana, turn the 
>> "instant" toggle on to get this behaviour.
>>
>> [image: img2.png]
>>
>> You'll just get set of single data points, which is a list of all the 
>> machines that have been down continuously from (now - 30 days) to (now).
>>
>> You probably want to change the visualisation to a table, or some other 
>> panel type. Graph isn't want you want here, since it only shows data for a 
>> single point in time.  That is: those machines, which *at the current time* 
>> have been down for 30 days before *the current time*.  The reference point 
>> is the current time only; you don't want to sweep this query over previous 
>> times.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/715000fc-a8a2-432e-8cf7-e253f41d15afn%40googlegroups.com.

Re: [prometheus-users] Re: up query

Reply via email to