Had two users come at me with "why didn't you...?" because of a machine
that had disk
hardware failures, but no alerts before the device died.  They pointed at
these messages
in the kernel dmesg:

> [Wed May 17 06:07:05 2023] nvme nvme3: async event result 00010300
> [Wed May 17 06:07:25 2023] nvme nvme3: controller is down; will reset:
CSTS=0x2, PCI_STATUS=0x10
> [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 3125627392
> [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 3125627392 > [Thu May 18 08:06:04 2023] Buffer I/O error on dev
nvme3n1, logical block 390703424, async page read
> [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 0
> [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 256 I didn't find an "errors" counter in iostats[1] so I can guess
node_exporter won't have it. I did find node_filesystem_device_error but
that was zero the whole time. What would be the prometheus-y way to sense
these errors so my users can have their alerts?" I'm hoping to avoid
"logtail | grep -c 'error' " in a counter. [1:
https://www.kernel.org/doc/html/latest/admin-guide/iostats.html ]

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CACDZGiKxT-kKodJQe44TL5-DRKwZ5fpazPhvkb4FijGS8iWjsQ%40mail.gmail.com.

Reply via email to