Are these 800 servers virtual or physical?

Are the physical servers home-built or commercial from a major brand (HP 
Proliant, etc.)

Are the servers all the same brand and model or are they a mismash of pieces 
from different makers?

Are the servers yours or owned by customers?  That is, if they are virtual 
servers owned by remote customers do you have any responsibility to monitor 
them?

For "emergency notifications" the go-to for FOSS is "Big Sister"   
https://bigsister.ch/  Set that up to ping the server interface and if it trips 
a breaker and goes offline then have Big Sister email a text-to-SMS gateway for 
your cell phone number

For monitoring power consumption you have to configure the PDUs for that.  I've 
yet to see one of these that supports current monitoring but does not support 
SNMP, so once you get that going you can monitor power consumption with mrtg 
or, if you want to get fancy, https://www.cacti.net/   Cacti is based on 
RRDtool with is the successor to MRTG  https://oss.oetiker.ch/rrdtool/

For monitoring piles of parts, you need a ticketing system.  The largest and 
oldest FOSS one with a large user community is Request Tracker, RT  you can 
download here

https://bestpractical.com/download-page

You will want to read the wiki for it:

https://rt-wiki.bestpractical.com/wiki/Main_Page

One thing I found very annoying with it (earlier versions) is that it "hides" 
menu items that the user isn't authorized for so quite often you will run 
across advice saying "click X to do Y" on the forums yet X does not exist in 
your menu causing a deep dive and drill down to find out that X is only 
available to users in some admin group you haven't yet created, etc.  So 
basically you need to read all the documentation on it before you ever start 
installing it.

Note that if you are going to go the Django route, there's a ticketing system 
already out there written in Django  https://django-todo.org/

One last piece of advice for you and I know you are likely NOT going to take it 
now, but you will eventually,

This isn't a one-man show if you are the top dog admin you need to be managing 
the tech under you and the vendors, NOT doing a deep dive into writing some 
Real Cool program.  With all due respect to Rich Shepard, you need to be 
writing ONLY the SOP manual he was talking about - and stay far far away from 
the scripting/coding like Django.  At best, push the techs under you to install 
and familiarize themselves with apps like Cacti and RT, do NOT do it yourself.  
That will give them "skin in the game" as it were you can't have them come 
running to you the minute something breaks in the management software (which it 
will)  Alternatively if that's beyond their capabilities - farm it out to 
someone like Software Technology Group, Inc. - have them come in for a hit job, 
grab one of the techs, and make them sit through the install and setup and 
configuration.

Set the policies and procedures and leave the how of doing it to the people 
under you, you can give them suggestions like RT but if they find something 
they like better, back off and let them run with it.

If you AREN'T the top dog admin and were just hired to "maintain the hardware" 
then no problem - outsource outsource outsource.  Go into your boss's office 
and tell them "if you aren't gonna give me your application developers time or 
let me hire people then I'm gonna spend money on vendors"

Your job is to be responsible, the outsourcers can flake out as-will, they are 
outsourcers specifically because they don’t WANT to be responsible.

You have a setup that could go South very very quickly and unless you have 
support behind you, you will drown.  If you don't have peeps on site you can 
have vendors.  If your superiors don't understand this, then you are just the 
latest in a series of revolving admins and won't last.

Ted


-----Original Message-----
From: PLUG <plug-boun...@lists.pdxlinux.org> On Behalf Of Ben Koenig
Sent: Friday, March 1, 2024 9:37 PM
To: Portland Linux/Unix Group <plug@lists.pdxlinux.org>
Subject: [PLUG] Linux Software for Data Center Monitoring

Hey all,

I have a somewhat strange (or maybe not so strange) question regarding 
datacenter management at the hardware and software level. For some context: I 
have recently found myself in charge of on-site maintenance for a datacenter 
with 800+ servers. While the job itself is pretty simple as far as the RAID 
arrays and general hardware configuration is concerned there has been some 
drama regarding past technicians who weren't actually keeping track of 
anything. So I have piles of parts that may or may not be good, servers that 
are completely undocumented, and a grotesque mismatch of labeling schemes for 
the various ethernet/fiber cables and server types.

Does anyone here who works with SMB scale datacenter environments have any tips 
or industry standard strategies for wrangling this type of setup? Are there any 
good FOSS software tools to help organize and monitor a mess like this? We have 
a software team that keeps and eye on the applications, but they do not appear 
to be monitoring things like power consumption, temperature, or even tracking 
parts as they get re-used. Our server "map" is literally just a Google Sheets 
document that was formatted to look like server rows with IP addresses listed 
by physical location. And I'm pretty sure everyone hates it. So I'm basically 
looking for tools to help me set up the following infrastructure:

- server documentation. Type, hardware configuration, and parts compatibility
- temp monitoring. Many of the servers are running CUDA applications on 
Dual/Quad GPU systems. They get toasty.
- power consumption monitoring. Our PDUs are able to report usage via a network 
interface, but nobody ever bothered to set it up. Would be nice to have a 
dashboard that shows when one of the servers freaks out and trips the breaker.

Thoughts? Solutions? Apps? I'm just looking for ideas at the moment. Everything 
is running (or so I'm told) but we currently have a bus number of 1 which is 
obviously a recipe for disaster. I don't mind piecing together my own set of 
scripts and utilities but if something already exists that does the work for 
me, even better :) -Ben

Reply via email to