[prometheus-users] How to sense disk read/write errors

2023-05-18 Thread M Moore
Had two users come at me with "why didn't you...?" because of a machine
that had disk
hardware failures, but no alerts before the device died.  They pointed at
these messages
in the kernel dmesg:

> [Wed May 17 06:07:05 2023] nvme nvme3: async event result 00010300
> [Wed May 17 06:07:25 2023] nvme nvme3: controller is down; will reset:
CSTS=0x2, PCI_STATUS=0x10
> [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 3125627392
> [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 3125627392 > [Thu May 18 08:06:04 2023] Buffer I/O error on dev
nvme3n1, logical block 390703424, async page read
> [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 0
> [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 256 I didn't find an "errors" counter in iostats[1] so I can guess
node_exporter won't have it. I did find node_filesystem_device_error but
that was zero the whole time. What would be the prometheus-y way to sense
these errors so my users can have their alerts?" I'm hoping to avoid
"logtail | grep -c 'error' " in a counter. [1:
https://www.kernel.org/doc/html/latest/admin-guide/iostats.html ]

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CACDZGiKxT-kKodJQe44TL5-DRKwZ5fpazPhvkb4FijGS8iWjsQ%40mail.gmail.com.


Re: [prometheus-users] EOF errors waiting on consul_sd_config requests

2020-10-08 Thread M Moore
> which version
prometheus, version 2.19.0 (branch: HEAD, revision:
5d7e3e970602c755855340cb190a972cebdd2ebf)
  go version:   go1.14.4

does it go away in a newer version?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CACDZGi%2BHyWM7-66Ak2BKCRfYcD9ED3-3%2BZgSAzHe4-Lexav54Q%40mail.gmail.com.


[prometheus-users] EOF errors waiting on consul_sd_config requests

2020-10-06 Thread M Moore
I have a Prometheus server using consul_sd_config for target discovery.
Some services aren't always present, but when they appear I want Prometheus
to scrape them.  consul_sd_config does this nicely.  But it complains a lot.

level=error ts=2020-10-06T16:52:58.990Z caller=consul.go:503
component="discovery manager scrape" discovery=consul msg="Error refreshing
service" service=nginx tags= err="Get \"
http://consul-cluster:8500/v1/health/service/nginx?dc=sandbox=9==12ms\":
EOF"

I don't currently have any nginx services, but that's okay, they come and
go as cloud stuff does.  While there are no nginx services published in the
consul catalog, when I try that URL myself I get

time curl '
http://consul-cluster:8500/v1/health/service/apache?dc=sandbox=9==12ms
'
curl: (52) Empty reply from server
real 0m59.077s

So the server drops the connection after only 59s, but Prometheus waits 2 *
time.Minute for the server to compose a non-empty answer that will not
come.   This also ties up a TCP connection for 59 seconds (intending 120s)
for an answer that will not come.

If I try the url without the "wait=" I get the correct answer immediately:
an empty list.  And seeing it wait longer than the scrape_interval seems
nonsensical, since the next scrape could start before it completes waiting
for an answer.

Looking in prometheus/discovery/consul/consul.go it seems this two-minute
wait is hardcoded, but maybe I'm wrong.

I want prometheus not to fill up the logs with these insignificant error
messages.  I believe I can get there by telling it to stop sending the
"wait=" parameter that is longer than the server's willingness to wait, but
it seems impossible to change this without editing source code and
recompiling prometheus.

How do I get prometheus to stop reporting these insignificant error
messages?  How do I tell Prometheus not to hold a TCP socket to consul open
longer than necessary (especially longer than scrape_interval)?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CACDZGiLcJZKtXdBp4fRJ5aWtKtk-T%2B6L2iTsyXe3kFxzjKVsgw%40mail.gmail.com.