On Saturday, March 2nd, 2024 at 11:04 AM, Aaron Burt <aa...@bavariati.org> wrote:
> Prelude: So I'm ~15 yrs out of date on dealing with datacenters, and > consciously avoided big commercial apps like SolarWinds, so the only > tooling I know is the kind that looks like a small woodland creature > cobbled it together. > Well that's better than nothing. The previous technician wasn't actually an IT person - He was some dude who ran electrical cables for a construction project at Intel and talked real confidently in the interview. During my first week one of the network switches went down and in their mad dash to fix it I watched in horror as he plugged a gigabit switch into itself. The managed switch it was connected to then worked as intended and disabled the port which led everyone to believe the ports weren't working. At the moment my focus is inventory management since I keep finding RTX 2000/3000 cards in the "new" bin that aren't actually new. He has since been fired. But that just means I have to clean up the mess. Focusing on small modular systems that don't collide with other tools I think is the better approach here. I sense a lot of nervousness when it comes to who does what. > On 2024-03-01 21:36, Ben Koenig wrote: > [snip] > > > I'm basically looking for tools to help me set up the following > > infrastructure: > > > > - server documentation. Type, hardware configuration, and parts > > compatibility > > > Big Google spreadsheet with the basics, per-model info and per-server > exceptions on other tabs. Or if it doesn't change often, all of that > stuffed onto a wiki/Confluence page. You can cross-check it against > your server monitoring agent data periodically. I think of this as > "prescriptive" - what the systems should be; monitoring (see below) is > "descriptive" - what the systems are. > That's actually a really good way to phrase it. The 'prescriptive' side is really what is missing right now. There are attempts to describe what is installed, and how, but without knowing what it should be everyone keeps getting confused. > > - temp monitoring. Many of the servers are running CUDA applications > > on Dual/Quad GPU systems. They get toasty. > > > I'm shocked they don't already have a monitoring agent on those guys. > There's SAAS infra monitoring products from Datadog, Scout, Splunk etc., > but be mindful of cost ($25/mo/server?) For self-hosted, > Telegraf/Influxdb/Grafana is a good combo. That gets you alerting as > well for those delicious 3AM wake-up calls. They do... sort of. The software team is in another country and lets us know when there is a problem. Based on some of the random tidbits they've mentioned I can tell they have something to manage the applications. But it's not a system I have access to appears to be tied specifically to their apps or docker instances. The PDU's all appear to have SNMP functionality and are all mostly the same model, but none of them are connected. Some of them even have old configurations on them and give a flashing red light because they aren't plugged into a network. No ticket system, just a chat message. But I did manage to train one guy to give me the PCIe ID of the card that fails. Once I document how the PCIe slots are ordered on all the various systems then I can just jump right to the card that failed, no more guess-and-check nonsense. > > > - power consumption monitoring. Our PDUs are able to report usage via > > a network interface, but nobody ever bothered to set it up. Would be > > nice to have a dashboard that shows when one of the servers freaks out > > and trips the breaker. > > > The server monitoring agent can snag that info so you can graph, > dashboard and alert on it alongside your servers, ambient temp/humidity > sensors and whatever else you can find to measure. > > I hope you have some kind of platform-management in place such as > Ansible, Puppet or SaltStack. > They do, and they don't. I have seen ansible components running on the systems when I SSH in, but there is also a lot of manual configuration that they have me do. For example, if a server fails and needs to be replaced, we have spare servers with the OS already installed that I just slide into the rack. But I have to reconfigure the IP address manually in order for the remote team to see it. IP, IPMI, Gateway, DNS, and hostname are all manually configured by me. OS installation is also me... there is no custom image I just use a vanilla ubuntu-server USB stick. > How's your HVAC looking? That's a good %age of your OpEx, and whether > or not your predecessor documented technician visits, you probably want > to get someone trustworthy to check things over while the weather's > still cool. > HVAC is not my problem! We lease cage space at a datacenter that handles all the building facilities. They have a dashboard to monitor power consumption but that is not connected to our individual servers. Luckily "management" is open to improvements in this area. I am kind of flip-flopping between creating my own application, or using one that already exists. i am familiar with django since I use it for my personal website, but at the moment I'm the only person running the servers so trying to wrangle python dependencies while also bench-testing hundreds of used GPUs isn't going to end well. They are looking to hire another technician, but I need to make sure we have something that I can point to and say "this is how it is configured". ATM I'm experimenting with inventree for inventory management. The parts we have are not cheap, and 10GB fiber cards of different modules/form factors are just stuffed into plastic bins labeled "networking". Sure, we might have over 100 parts for a given card... but do all those cards work in all servers? Turns out no... so in some cases the numbers of cards available for a given server is actually 0. Inventree is built on django. But they have created a fresh new kind of dependency hell. -Ben