I just noticed that one of my oVirt physical hosts has been rebooting due to an apparent hardware voltage fault. It's a Dell, and I've got their tools installed and am monitoring status, but the issue clears itself. It has apparently been doing this for a bit now, and we didn't catch it because (a) there weren't any VMs on it (probably were the first time but they were restarted elsewhere fast enough that it wasn't noticed) and (b) it reboots fast enough that at most it pops up in our monitoring system for one pass and then clears so our NOC either didn't see it or assumed it was okay since it cleared.
oVirt has been logging alerts when it happens, but seeing that requires someone to log in and check the logs (and we've got a bunch of different systems to manage, including multiple oVirt clusters, so nobody is doing that on a regular basis). We monitor most things with SNMP and/or CLI checks (we have PRTG, Nagios, and LibreNMS for various different things). What are people doing to monitor the health of their oVirt systems? Is it possible to get alerts emailed to admins? Is there any SNMP support in oVirt to allow external systems to monitor its health? This setup is on 4.3.10 if that matters. -- Chris Adams <c...@cmadams.net> _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/APZCECDMZDOGFBMXKAPSDJJENUSEOEOJ/