[prometheus-users] How to sense disk read/write errors
Had two users come at me with "why didn't you...?" because of a machine that had disk hardware failures, but no alerts before the device died. They pointed at these messages in the kernel dmesg: > [Wed May 17 06:07:05 2023] nvme nvme3: async event result 00010300 > [Wed May 17 06:07:25 2023] nvme nvme3: controller is down; will reset: CSTS=0x2, PCI_STATUS=0x10 > [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1, sector 3125627392 > [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1, sector 3125627392 > [Thu May 18 08:06:04 2023] Buffer I/O error on dev nvme3n1, logical block 390703424, async page read > [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1, sector 0 > [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1, sector 256 I didn't find an "errors" counter in iostats[1] so I can guess node_exporter won't have it. I did find node_filesystem_device_error but that was zero the whole time. What would be the prometheus-y way to sense these errors so my users can have their alerts?" I'm hoping to avoid "logtail | grep -c 'error' " in a counter. [1: https://www.kernel.org/doc/html/latest/admin-guide/iostats.html ] -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CACDZGiKxT-kKodJQe44TL5-DRKwZ5fpazPhvkb4FijGS8iWjsQ%40mail.gmail.com.
Re: [prometheus-users] EOF errors waiting on consul_sd_config requests
> which version prometheus, version 2.19.0 (branch: HEAD, revision: 5d7e3e970602c755855340cb190a972cebdd2ebf) go version: go1.14.4 does it go away in a newer version? -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CACDZGi%2BHyWM7-66Ak2BKCRfYcD9ED3-3%2BZgSAzHe4-Lexav54Q%40mail.gmail.com.
[prometheus-users] EOF errors waiting on consul_sd_config requests
I have a Prometheus server using consul_sd_config for target discovery. Some services aren't always present, but when they appear I want Prometheus to scrape them. consul_sd_config does this nicely. But it complains a lot. level=error ts=2020-10-06T16:52:58.990Z caller=consul.go:503 component="discovery manager scrape" discovery=consul msg="Error refreshing service" service=nginx tags= err="Get \" http://consul-cluster:8500/v1/health/service/nginx?dc=sandbox=9==12ms\": EOF" I don't currently have any nginx services, but that's okay, they come and go as cloud stuff does. While there are no nginx services published in the consul catalog, when I try that URL myself I get time curl ' http://consul-cluster:8500/v1/health/service/apache?dc=sandbox=9==12ms ' curl: (52) Empty reply from server real 0m59.077s So the server drops the connection after only 59s, but Prometheus waits 2 * time.Minute for the server to compose a non-empty answer that will not come. This also ties up a TCP connection for 59 seconds (intending 120s) for an answer that will not come. If I try the url without the "wait=" I get the correct answer immediately: an empty list. And seeing it wait longer than the scrape_interval seems nonsensical, since the next scrape could start before it completes waiting for an answer. Looking in prometheus/discovery/consul/consul.go it seems this two-minute wait is hardcoded, but maybe I'm wrong. I want prometheus not to fill up the logs with these insignificant error messages. I believe I can get there by telling it to stop sending the "wait=" parameter that is longer than the server's willingness to wait, but it seems impossible to change this without editing source code and recompiling prometheus. How do I get prometheus to stop reporting these insignificant error messages? How do I tell Prometheus not to hold a TCP socket to consul open longer than necessary (especially longer than scrape_interval)? -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CACDZGiLcJZKtXdBp4fRJ5aWtKtk-T%2B6L2iTsyXe3kFxzjKVsgw%40mail.gmail.com.