Re: [prometheus-users] Grafana data mismatch in different time frames.

Brian Candler Mon, 28 Mar 2022 04:40:56 -0700

Your problem is this: suppose you're recording blackbox_exporter output, 
and for simplicity I'll choose probe_success, which looks something like 
this (1 for OK, 0 when there's a problem):

---------_-------_----------------------_-------------------------_----------------

You're then viewing it in Grafana across a very wide time range, which 
picks out individual data points for each pixel:

-    -    -    -    -    -    -    -    _    -    -    -    -    -    -    
-    -

If you zoom out a long way, you can see it is likely to skip over points 
where the value was zero.  This is bound to happen when taking samples in 
this way.

In an ideal world, you'd make each failure event increment a counter:

_________________
                 _______________________--------------------------          

_________--------

Then when you look over any time period, you can see how many failures 
occurred within that window.  I think that's the best way to approach the 
problem.  Since blackbox_exporter doesn't expose a counter like this, you'd 
have to synthesise one, e.g. using a recording rule.

Assuming you only have the existing timeseries, then as a workaround for 
probe_success, you could try using something like this:

min_over_time(probe_success[$__interval])

$__interval is the time span in grafana of one data point (and changes with 
the graph resolution).  With this query, it "looks back" in time before 
each point, and if *any* of the data points is zero, the result will be 
zero for that point; if they are all 1 then the result will be 1. But you 
may find that if you zoom in too close, you get gaps in your graph.

Or you can use:

avg_over_time(probe_success[$__interval])

In this case, if one point covers 4 samples, and the samples were 1 1 0 1, 
then you will get a data point showing 0.75 as the availability.

Now, that isn't going to work for probe_httpd_status_code, which has values 
like 200 or 404 or 503; an "average" of these isn't helpful.  But you could 
do:

max_over_time(probe_httpd_status_code{instance="https://xxxxxxx
",job=blackbox-generic-endpoints"}[$__interval])

Then you'll get whatever is the highest status code over that time range.  
That is, if the results for the time window covered by one point in the 
graph were 200 200 404 200 503 200, then you'll see 503 for that point.  
That may be good enough for what you need.

On Monday, 28 March 2022 at 12:15:10 UTC+1 [email protected] wrote:

> Hi Brian,
>
> Thanks for your reply. Could you share a sample config / query to fix this 
> issue if possible ? I am a beginner and did not understand your reply 
> fully. 
>
> Thanks and regards
> Sreehari
>
>
> On Tue, Mar 8, 2022 at 12:00 AM Brian Candler <[email protected]> wrote:
>
>> My guess is: when the plugin queries over a large time range, it is 
>> sending a large step time to the prometheus API 
>> <https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries>, 
>> which is skipping over the times of interest.
>>
>> Now, you can argue that this is a problem with the way that the panel 
>> queries Prometheus. However, querying a 1 month range with a 30 second step 
>> would be extremely inefficient (returning ~86,000 data points).  So really, 
>> it would be better if you were to have a *counter* of how many times a 502 
>> status code is returned, and then the plugin can calculate a rate over each 
>> step.
>>
>> You can use a recording rule, running at the same interval as your 
>> blackbox scrapes, to increment a counter for each 502 response from 
>> blackbox_exporter.
>>
>> (Incidentally, the query that you've posted is syntactically invalid - it 
>> has mismatched quotes)
>>
>> On Monday, 7 March 2022 at 14:20:13 UTC [email protected] wrote:
>>
>>> Hi Team,
>>>
>>> We use a discrete plugin(panel) in Grafana to display the data from 
>>> blackbox_exporter to track the end-point(URL) availability and prometheus 
>>> data retention period is 50 days. This panel shows URL available and  
>>> unavailable time in percentage.
>>>
>>> Issue is smaller down time (E.g: 502 return code for 1hr )  is getting  
>>> ignored when we select a larger time range in Grafana (above 1 month) and 
>>> the panel is showing 100% URL available.  But if we select a smaller time 
>>> frame in Grafana, the URL unavailable time is displayed.
>>>
>>> Suspecting issue with  below query mentioned in panel. Can somebody 
>>> please provide a solution for this issue ?
>>>
>>>
>>> *Prom query Used in Grafana discrete plugin *
>>> probe_httpd_status_code{instance="https://xxxxxxx
>>> ",job=blackbox-generic-endpoints"}
>>>
>>>
>>> Prometheus Version - 2.31.0
>>> Blackbox exporter - 0.13.0
>>> Grafana Version - 6.7.4
>>> Scrape_interval: 30s
>>>
>>> Thanks and regards
>>> SreeHari 
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/5a281000-2be7-4f9b-8e9d-2062be33d99fn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/5a281000-2be7-4f9b-8e9d-2062be33d99fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fbc11b72-1db4-4949-9318-f2357950c2e8n%40googlegroups.com.

Re: [prometheus-users] Grafana data mismatch in different time frames.

Reply via email to