Re: [prometheus-users] Random alerts missing

Brian Candler Mon, 19 Sep 2022 14:03:24 -0700

Are you collecting prometheus' own metrics? Something like this:

  - job_name: prometheus
    scrape_interval: 1m
    static_configs:
      - targets: ['localhost:9090']


If you are, then there are various metrics you should check, including:
prometheus_rule_evaluations_total
prometheus_rule_evaluation_failures_total
prometheus_rule_group_iterations_total
prometheus_rule_group_iterations_missed_total

For the rule / rule group in question, check which of these are 
incrementing during the problem period. If the 'failures' or 'missed' are 
incrementing, that points to a problem.  Similarly if the 
'evaluations_total' or 'iterations_total' *isn't* incrementing.

Also, have a look at error output from prometheus while the problem is 
occurring:
journalctl -fu prometheus

On Monday, 19 September 2022 at 21:53:46 UTC+1 [email protected] wrote:

> Correct. Restating prometheus does fix it.
>
> On Mon, Sep 19, 2022 at 3:44 PM Brian Candler <[email protected]> wrote:
>
>> "Restarting prometheus, alertmanager and blackbox-exports fixes the issue"
>>
>> Which one of these fixes the issue?  From what you've said, I am guessing 
>> that restarting only prometheus would do it - since you're saying you see 
>> no alerts in the Prometheus UI, not even in "pending" state.
>>
>> On Monday, 19 September 2022 at 21:39:11 UTC+1 [email protected] wrote:
>>
>>> Prometheus : 2.38.0
>>> Alertmanager : 0.24.0
>>> Blackbox: 0.22.0
>>>
>>> probe_success{job="blackbox_icmp-server"}  returns 0. I see it .
>>>
>>> Thanks
>>> Paras.
>>>
>>> On Mon, Sep 19, 2022 at 3:32 PM Brian Candler <[email protected]> wrote:
>>>
>>>> Prometheus version? Alertmanager version?
>>>>
>>>> What if you enter the query
>>>>     probe_success{job="blackbox_icmp-server"} == 0
>>>> in the prometheus web interface (PromQL browser) while the problem is 
>>>> happening?  Does it show any results?
>>>>
>>>> On Monday, 19 September 2022 at 19:21:29 UTC+1 [email protected] 
>>>> wrote:
>>>>
>>>>> Hello Julius
>>>>>
>>>>> * The rule is something like this:
>>>>>
>>>>> - name: ServerDown
>>>>>    rules:
>>>>>    - alert: Server-InstanceDown
>>>>>      expr: probe_success{job="blackbox_icmp-server"} == 0
>>>>>      for: 1m
>>>>>
>>>>> * When alerting is not working, they are down for hours until I 
>>>>> restart prometheus and blackbox exporters. After restarting, everything 
>>>>> is 
>>>>> normal.
>>>>>
>>>>> *  The underlying metrics (probe_sucess) get 0 when it's down but they 
>>>>> don't change to Pending/Fired. 
>>>>>
>>>>> Thanks
>>>>> Paras.
>>>>>
>>>>> On Mon, Sep 19, 2022 at 2:35 AM Julius Volz <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Hi Paras,
>>>>>>
>>>>>> Could you share more information about your setup:
>>>>>>
>>>>>> * What's the alerting rule that isn't working as intended?
>>>>>> * For how long were the hosts down without getting alerted on?
>>>>>> * What did the underlying metrics (e.g. "up" for the exporter's own 
>>>>>> scrape health and "probe_success" for the backend probe health) 
>>>>>> collected 
>>>>>> by the Blackbox Exporter look like at the time when the alert should 
>>>>>> have 
>>>>>> been firing, but didn't?
>>>>>>
>>>>>> One possibility is that your Blackbox exporter itself couldn't be 
>>>>>> scraped anymore, in which case its "up" metric would be 0 and the 
>>>>>> "probe_success" metric would be absent (and thus any alerts based on 
>>>>>> that 
>>>>>> metric would never fire).
>>>>>>
>>>>>> Regards,
>>>>>> Julius
>>>>>>
>>>>>> On Thu, Sep 15, 2022 at 6:33 PM Paras pradhan <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We use prometheus , alertmanager and blackbox-exporter to check 
>>>>>>> hosts if they respond to icmp. Host counts are 1K+.  We noticed 
>>>>>>> sometimes 
>>>>>>> and randomly  the alerts are not generated (prometheus dashboard --> 
>>>>>>> alerts) when the hosts/targets are actually down. Restarting 
>>>>>>> prometheus, 
>>>>>>> alertmanager and blackbox-exports fixes the issue. Don't see anything 
>>>>>>> that 
>>>>>>> standouts in the logs. How do I troubleshoot and is there anything like 
>>>>>>> cache data in prometheus that needs to be cleared?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Paras.
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Prometheus Users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/prometheus-users/6bfb92dc-2a18-44d9-8fda-d6f84efba0e7n%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/6bfb92dc-2a18-44d9-8fda-d6f84efba0e7n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Julius Volz
>>>>>> PromLabs - promlabs.com
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-users/8e9dedc5-38ca-4e22-883c-3f15a5f84227n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-users/8e9dedc5-38ca-4e22-883c-3f15a5f84227n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/0a344880-3ac6-4567-9e0a-7e8cec7177dan%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/0a344880-3ac6-4567-9e0a-7e8cec7177dan%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/50e6a4a9-2e0c-4804-bc01-29925565310bn%40googlegroups.com.

Re: [prometheus-users] Random alerts missing

Reply via email to