Are these 800 servers virtual or physical? Are the physical servers home-built or commercial from a major brand (HP Proliant, etc.)
Are the servers all the same brand and model or are they a mismash of pieces from different makers? Are the servers yours or owned by customers? That is, if they are virtual servers owned by remote customers do you have any responsibility to monitor them? For "emergency notifications" the go-to for FOSS is "Big Sister" https://bigsister.ch/ Set that up to ping the server interface and if it trips a breaker and goes offline then have Big Sister email a text-to-SMS gateway for your cell phone number For monitoring power consumption you have to configure the PDUs for that. I've yet to see one of these that supports current monitoring but does not support SNMP, so once you get that going you can monitor power consumption with mrtg or, if you want to get fancy, https://www.cacti.net/ Cacti is based on RRDtool with is the successor to MRTG https://oss.oetiker.ch/rrdtool/ For monitoring piles of parts, you need a ticketing system. The largest and oldest FOSS one with a large user community is Request Tracker, RT you can download here https://bestpractical.com/download-page You will want to read the wiki for it: https://rt-wiki.bestpractical.com/wiki/Main_Page One thing I found very annoying with it (earlier versions) is that it "hides" menu items that the user isn't authorized for so quite often you will run across advice saying "click X to do Y" on the forums yet X does not exist in your menu causing a deep dive and drill down to find out that X is only available to users in some admin group you haven't yet created, etc. So basically you need to read all the documentation on it before you ever start installing it. Note that if you are going to go the Django route, there's a ticketing system already out there written in Django https://django-todo.org/ One last piece of advice for you and I know you are likely NOT going to take it now, but you will eventually, This isn't a one-man show if you are the top dog admin you need to be managing the tech under you and the vendors, NOT doing a deep dive into writing some Real Cool program. With all due respect to Rich Shepard, you need to be writing ONLY the SOP manual he was talking about - and stay far far away from the scripting/coding like Django. At best, push the techs under you to install and familiarize themselves with apps like Cacti and RT, do NOT do it yourself. That will give them "skin in the game" as it were you can't have them come running to you the minute something breaks in the management software (which it will) Alternatively if that's beyond their capabilities - farm it out to someone like Software Technology Group, Inc. - have them come in for a hit job, grab one of the techs, and make them sit through the install and setup and configuration. Set the policies and procedures and leave the how of doing it to the people under you, you can give them suggestions like RT but if they find something they like better, back off and let them run with it. If you AREN'T the top dog admin and were just hired to "maintain the hardware" then no problem - outsource outsource outsource. Go into your boss's office and tell them "if you aren't gonna give me your application developers time or let me hire people then I'm gonna spend money on vendors" Your job is to be responsible, the outsourcers can flake out as-will, they are outsourcers specifically because they don’t WANT to be responsible. You have a setup that could go South very very quickly and unless you have support behind you, you will drown. If you don't have peeps on site you can have vendors. If your superiors don't understand this, then you are just the latest in a series of revolving admins and won't last. Ted -----Original Message----- From: PLUG <plug-boun...@lists.pdxlinux.org> On Behalf Of Ben Koenig Sent: Friday, March 1, 2024 9:37 PM To: Portland Linux/Unix Group <plug@lists.pdxlinux.org> Subject: [PLUG] Linux Software for Data Center Monitoring Hey all, I have a somewhat strange (or maybe not so strange) question regarding datacenter management at the hardware and software level. For some context: I have recently found myself in charge of on-site maintenance for a datacenter with 800+ servers. While the job itself is pretty simple as far as the RAID arrays and general hardware configuration is concerned there has been some drama regarding past technicians who weren't actually keeping track of anything. So I have piles of parts that may or may not be good, servers that are completely undocumented, and a grotesque mismatch of labeling schemes for the various ethernet/fiber cables and server types. Does anyone here who works with SMB scale datacenter environments have any tips or industry standard strategies for wrangling this type of setup? Are there any good FOSS software tools to help organize and monitor a mess like this? We have a software team that keeps and eye on the applications, but they do not appear to be monitoring things like power consumption, temperature, or even tracking parts as they get re-used. Our server "map" is literally just a Google Sheets document that was formatted to look like server rows with IP addresses listed by physical location. And I'm pretty sure everyone hates it. So I'm basically looking for tools to help me set up the following infrastructure: - server documentation. Type, hardware configuration, and parts compatibility - temp monitoring. Many of the servers are running CUDA applications on Dual/Quad GPU systems. They get toasty. - power consumption monitoring. Our PDUs are able to report usage via a network interface, but nobody ever bothered to set it up. Would be nice to have a dashboard that shows when one of the servers freaks out and trips the breaker. Thoughts? Solutions? Apps? I'm just looking for ideas at the moment. Everything is running (or so I'm told) but we currently have a bus number of 1 which is obviously a recipe for disaster. I don't mind piecing together my own set of scripts and utilities but if something already exists that does the work for me, even better :) -Ben