Hi, we noticed a strange error message in the logfiles:
The alert-manager deployed with cephadm receives a HTTP 500 error from the inactive MGR when trying to call the URI /api/prometheus_receiver: Jul 25 09:35:25 alert-manager conmon[2426]: level=error ts=2023-07-25T07:35:25.171Z caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" num_alerts=45 err="ceph-dashboard/webhook[0]: notify retry canceled after 7 attempts: unexpected status code 500: https://mgr001.example.net:8443/api/prometheus_receiver; ceph-dashboard/webhook[2]: notify retry canceled after 8 attempts: unexpected status code 500: https://mgr003.example.net:8443/api/prometheus_receiver" Jul 25 09:35:25 alert-manager conmon[2426]: level=warn ts=2023-07-25T07:35:25.175Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[2] msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://mgr003.example.net:8443/api/prometheus_receiver" Jul 25 09:35:25 alert-manager conmon[2426]: level=warn ts=2023-07-25T07:35:25.177Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://mgr001.example.net:8443/api/prometheus_receiver" Jul 25 09:35:35 alert-manager conmon[2426]: level=error ts=2023-07-25T07:35:35.171Z caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" num_alerts=45 err="ceph-dashboard/webhook[2]: notify retry canceled after 7 attempts: unexpected status code 500: https://mgr003.example.net:8443/api/prometheus_receiver; ceph-dashboard/webhook[0]: notify retry canceled after 8 attempts: unexpected status code 500: https://mgr001.example.net:8443/api/prometheus_receiver" Jul 25 09:35:35 alert-manager conmon[2426]: level=warn ts=2023-07-25T07:35:35.176Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[2] msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://mgr003.example.net:8443/api/prometheus_receiver" Jul 25 09:35:35 alert-manager conmon[2426]: level=warn ts=2023-07-25T07:35:35.176Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://mgr001.example.net:8443/api/prometheus_receiver" This is from the logfile of mgr002, which was passive first and then became active. After being active the errors on the MGR where gone but showed on the newly passive MGR. Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard INFO request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [581dce66-9c65-4e84-a41a-8d72b450791e] /api/prometheus_receiver Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [::ffff:10.54.226.222:49904] [POST] [500] [0.001s] [513.0B] [26e1854a-3b93-49c4-8afc-1a96426a3dab] /api/prometheus_receiver Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [b'{"status": "500 Internal Server Error", "detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.", "request _id": "26e1854a-3b93-49c4-8afc-1a96426a3dab"} '] Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard INFO request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [26e1854a-3b93-49c4-8afc-1a96426a3dab] /api/prometheus_receiver Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [::ffff:10.54.226.222:49904] [POST] [500] [0.001s] [513.0B] [46d7e78c-49d5-4652-9877-973129ad3977] /api/prometheus_receiver Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [b'{"status": "500 Internal Server Error", "detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.", "request _id": "46d7e78c-49d5-4652-9877-973129ad3977"} '] Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard INFO request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [46d7e78c-49d5-4652-9877-973129ad3977] /api/prometheus_receiver Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [a9b25e54-f1e1-42eb-90b2-af5aa22769cf] /api/prometheus_receiver Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [b'{"status": "500 Internal Server Error", "detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.", "request _id": "a9b25e54-f1e1-42eb-90b2-af5aa22769cf"} '] Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard INFO request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [a9b25e54-f1e1-42eb-90b2-af5aa22769cf] /api/prometheus_receiver Jul 25 09:25:28 mgr002 ceph-mgr[1841]: mgr handle_mgr_map Activating! Jul 25 09:25:28 mgr002 ceph-mgr[1841]: mgr handle_mgr_map I am now activating We have a test cluster running also with version 17.2.6 where this does not happen. In this test cluster the passive MGRs return an HTTP code 204 when the alert-manager tries to request /api/prometheus_receiver. What is happening here? Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io