Prelude: So I'm ~15 yrs out of date on dealing with datacenters, and consciously avoided big commercial apps like SolarWinds, so the only tooling I know is the kind that looks like a small woodland creature cobbled it together.

On 2024-03-01 21:36, Ben Koenig wrote:
[snip]
I'm basically looking for tools to help me set up the following
infrastructure:

- server documentation. Type, hardware configuration, and parts compatibility

Big Google spreadsheet with the basics, per-model info and per-server exceptions on other tabs. Or if it doesn't change often, all of that stuffed onto a wiki/Confluence page. You can cross-check it against your server monitoring agent data periodically. I think of this as "prescriptive" - what the systems _should_ be; monitoring (see below) is "descriptive" - what the systems _are_.

- temp monitoring. Many of the servers are running CUDA applications
on Dual/Quad GPU systems. They get toasty.

I'm shocked they don't already have a monitoring agent on those guys. There's SAAS infra monitoring products from Datadog, Scout, Splunk etc., but be mindful of cost ($25/mo/server?) For self-hosted, Telegraf/Influxdb/Grafana is a good combo. That gets you alerting as well for those delicious 3AM wake-up calls.

- power consumption monitoring. Our PDUs are able to report usage via
a network interface, but nobody ever bothered to set it up. Would be
nice to have a dashboard that shows when one of the servers freaks out
and trips the breaker.

The server monitoring agent can snag that info so you can graph, dashboard and alert on it alongside your servers, ambient temp/humidity sensors and whatever else you can find to measure.

I hope you have some kind of platform-management in place such as Ansible, Puppet or SaltStack.

How's your HVAC looking? That's a good %age of your OpEx, and whether or not your predecessor documented technician visits, you probably want to get someone trustworthy to check things over while the weather's still cool.

Good luck and catch a nap when you can,
  Aaron

Reply via email to