Re: [prometheus-users] Messages are dropping because too many are queued in AlertManager

shivakumar sajjan Thu, 13 Jan 2022 23:27:48 -0800

Thanks Matthias. Sure I will troubleshoot each component and get back to
you if there are any issues.




Thanks,
Shiva


On Fri, Jan 14, 2022 at 2:31 AM Matthias Rampke <matth...@prometheus.io>
wrote:

> From these logs, it's not clear. Try increasing the log level
> (--log.level=debug) on Alertmanager and Prometheus.
>
> We do not know enough about your setup and the receiving service to solve
> this for you. You will have to systematically troubleshoot every part of
> the chain.
>
> It seems that there are multiple issues at once – Alertmanager is falling
> behind on sending notifications, Prometheus is timing out sending alerts to
> Alertmanager. Make sure the node is not overloaded, make sure the webhook
> receiver is working correctly and quickly, and that Alertmanager can reach
> it (send a webhook by hand using curl from the Alertmanager host).
>
> I hope this gives you some pointers to find out more yourself!
>
> /MR
>
> On Fri, Jan 7, 2022, 06:27 shivakumar sajjan <shivusajjan...@gmail.com>
> wrote:
>
>> Hi Matthias,
>>
>> Thanks for responding my questions
>>
>> It is a service where I added an API to post alert
>> information(firing/resolved) by alertmanager whenever alerts are triggered.
>>
>> *There are below warnings in AlertManager pod logs:*
>>
>> level=warn ts=2022-01-06T20:27:41.726Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4097 limit=4096
>> level=warn ts=2022-01-06T20:42:41.726Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4121 limit=4096
>> level=warn ts=2022-01-06T21:27:41.726Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4097 limit=4096
>> level=warn ts=2022-01-06T21:42:41.726Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4098 limit=4096
>> level=warn ts=2022-01-06T21:57:41.727Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4098 limit=4096
>> level=warn ts=2022-01-06T22:42:41.727Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4123 limit=4096
>> level=warn ts=2022-01-06T22:57:41.727Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4155 limit=4096
>> level=warn ts=2022-01-06T23:12:41.727Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4100 limit=4096
>> level=warn ts=2022-01-06T23:27:41.728Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4097 limit=4096
>> level=warn ts=2022-01-06T23:42:41.728Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4099 limit=4096
>> level=warn ts=2022-01-06T23:57:41.728Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4097 limit=4096
>> level=warn ts=2022-01-07T00:27:41.728Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4124 limit=4096
>> level=warn ts=2022-01-07T00:42:41.729Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4124 limit=4096
>> level=warn ts=2022-01-07T00:57:41.729Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4097 limit=4096
>> level=warn ts=2022-01-07T01:42:41.729Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4099 limit=4096
>> level=warn ts=2022-01-07T01:57:41.730Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4098 limit=4096
>> level=warn ts=2022-01-07T02:42:41.730Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4098 limit=4096
>> level=warn ts=2022-01-07T02:57:41.730Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4155 limit=4096
>> level=warn ts=2022-01-07T03:12:41.730Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4098 limit=4096
>> level=warn ts=2022-01-07T03:27:41.731Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4098 limit=4096
>> level=warn ts=2022-01-07T03:42:41.731Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4099 limit=4096
>> level=warn ts=2022-01-07T03:57:41.731Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4098 limit=4096
>> level=warn ts=2022-01-07T04:42:41.732Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4098 limit=4096
>> level=warn ts=2022-01-07T04:57:41.732Z caller=delegate.go:272
>> component=cluster msg="dropping messages because too many are queued"
>> current=4097 limit=4096
>>
>>
>> *There are errors in prometheus server pod logs:*
>>
>> level=error ts=2021-09-06T10:11:22.754Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=0 msg="Error sending alert" err="Post
>> http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
>> level=error ts=2021-09-07T23:36:27.753Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=0 msg="Error sending alert" err="Post
>> http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
>> level=error ts=2021-09-07T23:36:52.755Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=0 msg="Error sending alert" err="Post
>> http://10.64.87.17:9093/api/v1/alerts: dial tcp 127.0.0.1:9093: i/o
>> timeout"
>> level=error ts=2021-09-07T23:37:02.756Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=64 msg="Error sending alert" err="Post
>> http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
>> level=error ts=2021-09-07T23:37:12.757Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=11 msg="Error sending alert" err="Post
>> http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
>> level=error ts=2021-09-07T23:37:27.755Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=0 msg="Error sending alert" err="Post
>> http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
>> level=error ts=2021-09-07T23:37:42.754Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=0 msg="Error sending alert" err="Post
>> http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
>> level=error ts=2021-09-07T23:37:56.967Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=2 msg="Error sending alert" err="Post
>> http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
>> level=error ts=2021-09-07T23:38:06.968Z caller=notifier.go:528
>> component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts
>> count=18 msg="Error sending alert" err="Post
>> http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded"
>>
>> *May I know what could be the cause ?*
>>
>> Thanks,
>> Shiva
>>
>>
>> On Fri, Jan 7, 2022 at 2:45 AM Matthias Rampke <matth...@prometheus.io>
>> wrote:
>>
>>> What is your webhook receiver? Are any of the resolve messages getting
>>> through? Are the requests succeeding?
>>>
>>> I think Alertmanager will retry failed webhooks, not sure for how long.
>>> This would keep them in the queue, leading to what you observe in
>>> Alertmanager.
>>>
>>> /MR
>>>
>>> On Thu, Jan 6, 2022, 07:14 shivakumar sajjan <shivusajjan...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have single instance cluster for AlertManager and I see below warning
>>>> in AlertManager
>>>>
>>>>
>>>> *container level=warn ts=2021-11-03T08:50:44.528Z
>>>> caller=delegate.go:272 component=cluster msg="dropping messages because too
>>>> many are queued" current=4125 limit=4096*
>>>>
>>>> *Alert Manager Version information:*
>>>> Branch: HEAD
>>>> BuildDate: 20190708-14:31:49
>>>> BuildUser: root@868685ed3ed0
>>>> GoVersion: go1.12.6
>>>> Revision: 1ace0f76b7101cccc149d7298022df36039858ca
>>>> Version: 0.18.0
>>>>
>>>> *AlertManager metrics*
>>>> # HELP alertmanager_cluster_members Number indicating current number of
>>>> members in cluster.
>>>> # TYPE alertmanager_cluster_members gauge alertmanager_cluster_members
>>>> 1
>>>> # HELP alertmanager_cluster_messages_pruned_total Total number of
>>>> cluster messages pruned.
>>>> # TYPE alertmanager_cluster_messages_pruned_total counter
>>>> alertmanager_cluster_messages_pruned_total 23020
>>>> # HELP alertmanager_cluster_messages_queued Number of cluster messages
>>>> which are queued.
>>>> # TYPE alertmanager_cluster_messages_queued gauge
>>>> alertmanager_cluster_messages_queued 4125
>>>>
>>>> I am new to alerting. Could you please answer for the below questions?
>>>>
>>>>
>>>>    -
>>>>
>>>>    Why are messages queueing up due to this alertmanager is not
>>>>    sending alert resolve information to webhook instance?
>>>>    -
>>>>
>>>>    What is the solution for the above issue?
>>>>    -
>>>>
>>>>    How do we see those queued messages in AlertManager?
>>>>    -
>>>>
>>>>    Do we lose alerts when messages are dropped because of too many
>>>>    queued ?
>>>>    -
>>>>
>>>>    Why are messages queued even though there is logic to prune
>>>>    messages on regular interval i.e 15 minutes ?
>>>>    -
>>>>
>>>>    Do we lose alerts when AlertManager pruned messages on regular
>>>>    interval?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Shiva
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to prometheus-users+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/prometheus-users/ea293a79-9f3f-42f1-b9a4-66ff7353cb16n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/prometheus-users/ea293a79-9f3f-42f1-b9a4-66ff7353cb16n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAL%3DnBmUpfsjP0uknShBkP%2B4_niEANMO3GDQpF2aYhi0Qy7rNFA%40mail.gmail.com.

Re: [prometheus-users] Messages are dropping because too many are queued in AlertManager

Reply via email to