Dang, shafted by proton's lack of mailing list support - can't see MC's answer :(
I'll have to take a look at those. Nagios was the one we used at FreeGeek years ago but I wasn't sure if that was the name of the service or the server hostname. I have a lot of leeway when it comes to fiddling with stuff like this as long as I don't do anything stupid. These days it seems like rolling my own solution hinges on how sane the build systems are for pre-existing solutions. -Ben On Saturday, March 2nd, 2024 at 3:25 PM, Eldo Varghese <e...@poningru.com> wrote: > Based on what you have mentioned in the other threads, I'd say go for > something like Zabbix (originally suggested by MC). > This gets you: > 1) monitoring with agent and snmp (along with alerting etc.) > - This gets you the power, temp and network monitoring > 2) Inventory including for networks > - While Zabbix does have automated inventory, you will have to populate > the rack charts. > 3) Dashboards > > I will absolutely dissuade folks from rolling your own, we've done > something like this before just for integration into our other > infrastructure. > -Eldo > > On 3/1/24 21:36, Ben Koenig wrote: > > > Hey all, > > > > I have a somewhat strange (or maybe not so strange) question regarding > > datacenter management at the hardware and software level. For some context: > > I have recently found myself in charge of on-site maintenance for a > > datacenter with 800+ servers. While the job itself is pretty simple as far > > as the RAID arrays and general hardware configuration is concerned there > > has been some drama regarding past technicians who weren't actually keeping > > track of anything. So I have piles of parts that may or may not be good, > > servers that are completely undocumented, and a grotesque mismatch of > > labeling schemes for the various ethernet/fiber cables and server types. > > > > Does anyone here who works with SMB scale datacenter environments have any > > tips or industry standard strategies for wrangling this type of setup? Are > > there any good FOSS software tools to help organize and monitor a mess like > > this? We have a software team that keeps and eye on the applications, but > > they do not appear to be monitoring things like power consumption, > > temperature, or even tracking parts as they get re-used. Our server "map" > > is literally just a Google Sheets document that was formatted to look like > > server rows with IP addresses listed by physical location. And I'm pretty > > sure everyone hates it. So I'm basically looking for tools to help me set > > up the following infrastructure: > > > > - server documentation. Type, hardware configuration, and parts > > compatibility > > - temp monitoring. Many of the servers are running CUDA applications on > > Dual/Quad GPU systems. They get toasty. > > - power consumption monitoring. Our PDUs are able to report usage via a > > network interface, but nobody ever bothered to set it up. Would be nice to > > have a dashboard that shows when one of the servers freaks out and trips > > the breaker. > > > > Thoughts? Solutions? Apps? I'm just looking for ideas at the moment. > > Everything is running (or so I'm told) but we currently have a bus number > > of 1 which is obviously a recipe for disaster. I don't mind piecing > > together my own set of scripts and utilities but if something already > > exists that does the work for me, even better :) > > > > -Ben