Prelude: So I'm ~15 yrs out of date on dealing with datacenters, and
consciously avoided big commercial apps like SolarWinds, so the only
tooling I know is the kind that looks like a small woodland creature
cobbled it together.
On 2024-03-01 21:36, Ben Koenig wrote:
[snip]
I'm basically looking for tools to help me set up the following
infrastructure:
- server documentation. Type, hardware configuration, and parts
compatibility
Big Google spreadsheet with the basics, per-model info and per-server
exceptions on other tabs. Or if it doesn't change often, all of that
stuffed onto a wiki/Confluence page. You can cross-check it against
your server monitoring agent data periodically. I think of this as
"prescriptive" - what the systems _should_ be; monitoring (see below) is
"descriptive" - what the systems _are_.
- temp monitoring. Many of the servers are running CUDA applications
on Dual/Quad GPU systems. They get toasty.
I'm shocked they don't already have a monitoring agent on those guys.
There's SAAS infra monitoring products from Datadog, Scout, Splunk etc.,
but be mindful of cost ($25/mo/server?) For self-hosted,
Telegraf/Influxdb/Grafana is a good combo. That gets you alerting as
well for those delicious 3AM wake-up calls.
- power consumption monitoring. Our PDUs are able to report usage via
a network interface, but nobody ever bothered to set it up. Would be
nice to have a dashboard that shows when one of the servers freaks out
and trips the breaker.
The server monitoring agent can snag that info so you can graph,
dashboard and alert on it alongside your servers, ambient temp/humidity
sensors and whatever else you can find to measure.
I hope you have some kind of platform-management in place such as
Ansible, Puppet or SaltStack.
How's your HVAC looking? That's a good %age of your OpEx, and whether
or not your predecessor documented technician visits, you probably want
to get someone trustworthy to check things over while the weather's
still cool.
Good luck and catch a nap when you can,
Aaron