On Saturday, March 2nd, 2024 at 11:04 AM, Aaron Burt <aa...@bavariati.org> 
wrote:

> Prelude: So I'm ~15 yrs out of date on dealing with datacenters, and
> consciously avoided big commercial apps like SolarWinds, so the only
> tooling I know is the kind that looks like a small woodland creature
> cobbled it together.
> 

Well that's better than nothing. The previous technician wasn't actually an IT 
person - He was some dude who ran electrical cables for a construction project 
at Intel and talked real confidently in the interview. During my first week one 
of the network switches went down and in their mad dash to fix it I watched in 
horror as he plugged a gigabit switch into itself. The managed switch it was 
connected to then worked as intended and disabled the port which led everyone 
to believe the ports weren't working. At the moment my focus is inventory 
management since I keep finding RTX 2000/3000 cards in the "new" bin that 
aren't actually new.

He has since been fired. But that just means I have to clean up the mess. 
Focusing on small modular systems that don't collide with other tools I think 
is the better approach here. I sense a lot of nervousness when it comes to who 
does what.


> On 2024-03-01 21:36, Ben Koenig wrote:
> [snip]
> 
> > I'm basically looking for tools to help me set up the following
> > infrastructure:
> > 
> > - server documentation. Type, hardware configuration, and parts
> > compatibility
> 
> 
> Big Google spreadsheet with the basics, per-model info and per-server
> exceptions on other tabs. Or if it doesn't change often, all of that
> stuffed onto a wiki/Confluence page. You can cross-check it against
> your server monitoring agent data periodically. I think of this as
> "prescriptive" - what the systems should be; monitoring (see below) is
> "descriptive" - what the systems are.
> 

That's actually a really good way to phrase it. The 'prescriptive' side is 
really what is missing right now.  There are attempts to describe what is 
installed, and how, but without knowing what it should be everyone keeps 
getting confused. 


> > - temp monitoring. Many of the servers are running CUDA applications
> > on Dual/Quad GPU systems. They get toasty.
> 
> 
> I'm shocked they don't already have a monitoring agent on those guys.
> There's SAAS infra monitoring products from Datadog, Scout, Splunk etc.,
> but be mindful of cost ($25/mo/server?) For self-hosted,
> Telegraf/Influxdb/Grafana is a good combo. That gets you alerting as
> well for those delicious 3AM wake-up calls.


They do... sort of. The software team is in another country and lets us know 
when there is a problem. Based on some of the random tidbits they've mentioned 
I can tell they have something to manage the applications. But it's not a 
system I have access to appears to be tied specifically to their apps or docker 
instances. The PDU's all appear to have SNMP functionality and are all mostly 
the same model, but none of them are connected. Some of them even have old 
configurations on them and give a flashing red light because they aren't 
plugged into a network.


No ticket system, just a chat message. But I did manage to train one guy to 
give me the PCIe ID of the card that fails.  Once I document how the PCIe slots 
are ordered on all the various systems then I can just jump right to the card 
that failed, no more guess-and-check nonsense.

> 
> > - power consumption monitoring. Our PDUs are able to report usage via
> > a network interface, but nobody ever bothered to set it up. Would be
> > nice to have a dashboard that shows when one of the servers freaks out
> > and trips the breaker.
> 
> 
> The server monitoring agent can snag that info so you can graph,
> dashboard and alert on it alongside your servers, ambient temp/humidity
> sensors and whatever else you can find to measure.
> 
> I hope you have some kind of platform-management in place such as
> Ansible, Puppet or SaltStack.
> 

They do, and they don't. I have seen ansible components running on the systems 
when I SSH in, but there is also a lot of manual configuration that they have 
me do. For example, if a server fails and needs to be replaced, we have spare 
servers with the OS already installed that I just slide into the rack. But I 
have to reconfigure the IP address manually in order for the remote team to see 
it. IP, IPMI, Gateway, DNS, and hostname are all manually configured by me. OS 
installation is also me... there is no custom image I just use a vanilla 
ubuntu-server USB stick.


> How's your HVAC looking? That's a good %age of your OpEx, and whether
> or not your predecessor documented technician visits, you probably want
> to get someone trustworthy to check things over while the weather's
> still cool.
> 
 
HVAC is not my problem! We lease cage space at a datacenter that handles all 
the building facilities. They have a dashboard to monitor power consumption but 
that is not connected to our individual servers. 


Luckily "management" is open to improvements in this area. I am kind of 
flip-flopping between creating my own application, or using one that already 
exists. i am familiar with django since I use it for my personal website, but 
at the moment I'm the only person running the servers so trying to wrangle 
python dependencies while also bench-testing hundreds of used GPUs isn't going 
to end well. They are looking to hire another technician, but I need to make 
sure we have something that I can point to and say "this is how it is 
configured".

ATM I'm experimenting with inventree for inventory management. The parts we 
have are not cheap, and 10GB fiber cards of different modules/form factors are 
just stuffed into plastic bins labeled "networking". Sure, we might have over 
100 parts for a given card... but do all those cards work in all servers? Turns 
out no... so in some cases the numbers of cards available for a given server is 
actually 0. 

Inventree is built on django. But they have created a fresh new kind of 
dependency hell. 
-Ben

Reply via email to