Re: [prometheus-users] Grafana data mismatch in different time frames.

Brian Candler Thu, 31 Mar 2022 06:53:07 -0700

> Could you please suggest another solution or provide a fix for this issue.


Not really I'm afraid, because your question is really about Grafana not 
about Prometheus.

All the raw data is present and correct in Prometheus, and Grafana isn't 
querying it in the correct way; the fix is on the Grafana side.

If It were me, I'd connect to the Prometheus API, do a query to collect the 
raw data, analyse it in the way that I want, and generate the result.  A 
range vector query sent to the instant query endpoint will return all the 
data points in the given time window, with their actual collection 
timestamps, e.g.
probe_http_status_code{instance="https://xxxxxxxxxx
",job="blackbox-generic-endpoints"}[30d]

On Thursday, 31 March 2022 at 10:56:40 UTC+1 [email protected] wrote:

> Thank you, Brian.
>
> I have tested the mentioned queries and results are attached.
>
>   Query Total down time in month by  checking one day time frame(sum) Down 
> time in 1 month time frame  
> Query - 1
> ( old query)
> probe_http_status_code{instance="https://xxxxxxxxxx
> ",job="blackbox-generic-endpoints"} 26 minutes 4 hrs  
> Query - 2 max_over_time(probe_http_status_code{instance="
> https://xxxxxxxxxx",job="blackbox-generic-endpoints"}[$__interval]) 34 
> minutes 12 hrs 
> Query - 3 min_over_time(probe_http_status_code{instance="
> https://xxxxxxxxxx",job="blackbox-generic-endpoints"}[$__interval]) 18 
> minutes  2 hrs ( URL unavailable )  
>
> Actual down time for this endpoint/URL is around 30 minutes and that is 
> almost matching in the first two queries when we take the sum of  downtime  
> values of 1 day time for one month. ( details attached. )
>
> However first two queries not providing exact down time and giving more 
> than 4 hrs down time in one month time frame ( results attached. ) 
>
> Could you please suggest another solution or provide a fix for this issue.
>
> Regards.
> Sreehari 
>
> On Mon, Mar 28, 2022 at 5:10 PM Brian Candler <[email protected]> wrote:
>
>> Your problem is this: suppose you're recording blackbox_exporter output, 
>> and for simplicity I'll choose probe_success, which looks something like 
>> this (1 for OK, 0 when there's a problem):
>>
>>
>> ---------_-------_----------------------_-------------------------_----------------
>>
>> You're then viewing it in Grafana across a very wide time range, which 
>> picks out individual data points for each pixel:
>>
>> -    -    -    -    -    -    -    -    _    -    -    -    -    -    -  
>>   -    -
>>
>> If you zoom out a long way, you can see it is likely to skip over points 
>> where the value was zero.  This is bound to happen when taking samples in 
>> this way.
>>
>> In an ideal world, you'd make each failure event increment a counter:
>>                                                                   
>> _________________
>>                  _______________________--------------------------        
>>          
>> _________--------
>>
>> Then when you look over any time period, you can see how many failures 
>> occurred within that window.  I think that's the best way to approach the 
>> problem.  Since blackbox_exporter doesn't expose a counter like this, you'd 
>> have to synthesise one, e.g. using a recording rule.
>>
>> Assuming you only have the existing timeseries, then as a workaround for 
>> probe_success, you could try using something like this:
>>
>> min_over_time(probe_success[$__interval])
>>
>> $__interval is the time span in grafana of one data point (and changes 
>> with the graph resolution).  With this query, it "looks back" in time 
>> before each point, and if *any* of the data points is zero, the result will 
>> be zero for that point; if they are all 1 then the result will be 1. But 
>> you may find that if you zoom in too close, you get gaps in your graph.
>>
>> Or you can use:
>>
>> avg_over_time(probe_success[$__interval])
>>
>> In this case, if one point covers 4 samples, and the samples were 1 1 0 
>> 1, then you will get a data point showing 0.75 as the availability.
>>
>> Now, that isn't going to work for probe_httpd_status_code, which has 
>> values like 200 or 404 or 503; an "average" of these isn't helpful.  But 
>> you could do:
>>
>> max_over_time(probe_httpd_status_code{instance="https://xxxxxxx
>> ",job=blackbox-generic-endpoints"}[$__interval])
>>
>> Then you'll get whatever is the highest status code over that time 
>> range.  That is, if the results for the time window covered by one point in 
>> the graph were 200 200 404 200 503 200, then you'll see 503 for that 
>> point.  That may be good enough for what you need.
>>
>> On Monday, 28 March 2022 at 12:15:10 UTC+1 [email protected] wrote:
>>
>>> Hi Brian,
>>>
>>> Thanks for your reply. Could you share a sample config / query to fix 
>>> this issue if possible ? I am a beginner and did not understand your reply 
>>> fully. 
>>>
>>> Thanks and regards
>>> Sreehari
>>>
>>>
>>> On Tue, Mar 8, 2022 at 12:00 AM Brian Candler <[email protected]> wrote:
>>>
>>>> My guess is: when the plugin queries over a large time range, it is 
>>>> sending a large step time to the prometheus API 
>>>> <https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries>,
>>>>  
>>>> which is skipping over the times of interest.
>>>>
>>>> Now, you can argue that this is a problem with the way that the panel 
>>>> queries Prometheus. However, querying a 1 month range with a 30 second 
>>>> step 
>>>> would be extremely inefficient (returning ~86,000 data points).  So 
>>>> really, 
>>>> it would be better if you were to have a *counter* of how many times a 502 
>>>> status code is returned, and then the plugin can calculate a rate over 
>>>> each 
>>>> step.
>>>>
>>>> You can use a recording rule, running at the same interval as your 
>>>> blackbox scrapes, to increment a counter for each 502 response from 
>>>> blackbox_exporter.
>>>>
>>>> (Incidentally, the query that you've posted is syntactically invalid - 
>>>> it has mismatched quotes)
>>>>
>>>> On Monday, 7 March 2022 at 14:20:13 UTC [email protected] wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> We use a discrete plugin(panel) in Grafana to display the data from 
>>>>> blackbox_exporter to track the end-point(URL) availability and prometheus 
>>>>> data retention period is 50 days. This panel shows URL available and  
>>>>> unavailable time in percentage.
>>>>>
>>>>> Issue is smaller down time (E.g: 502 return code for 1hr )  is 
>>>>> getting  ignored when we select a larger time range in Grafana (above 1 
>>>>> month) and the panel is showing 100% URL available.  But if we select a 
>>>>> smaller time frame in Grafana, the URL unavailable time is displayed.
>>>>>
>>>>> Suspecting issue with  below query mentioned in panel. Can somebody 
>>>>> please provide a solution for this issue ?
>>>>>
>>>>>
>>>>> *Prom query Used in Grafana discrete plugin *
>>>>> probe_httpd_status_code{instance="https://xxxxxxx
>>>>> ",job=blackbox-generic-endpoints"}
>>>>>
>>>>>
>>>>> Prometheus Version - 2.31.0
>>>>> Blackbox exporter - 0.13.0
>>>>> Grafana Version - 6.7.4
>>>>> Scrape_interval: 30s
>>>>>
>>>>> Thanks and regards
>>>>> SreeHari 
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-users/5a281000-2be7-4f9b-8e9d-2062be33d99fn%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-users/5a281000-2be7-4f9b-8e9d-2062be33d99fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/fbc11b72-1db4-4949-9318-f2357950c2e8n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/fbc11b72-1db4-4949-9318-f2357950c2e8n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/86e418af-c95a-42d8-847e-4d962671edbdn%40googlegroups.com.

Re: [prometheus-users] Grafana data mismatch in different time frames.

Reply via email to