Dang, shafted by proton's lack of mailing list support - can't see MC's answer 
:(

I'll have to take a look at those. Nagios was the one we used at FreeGeek years 
ago but I wasn't sure if that was the name of the service or the server 
hostname. I have a lot of leeway when it comes to fiddling with stuff like this 
as long as I don't do anything stupid. These days it seems like rolling my own 
solution hinges on how sane the build systems are for pre-existing solutions.

-Ben


On Saturday, March 2nd, 2024 at 3:25 PM, Eldo Varghese <e...@poningru.com> 
wrote:

> Based on what you have mentioned in the other threads, I'd say go for
> something like Zabbix (originally suggested by MC).
> This gets you:
> 1) monitoring with agent and snmp (along with alerting etc.)
> - This gets you the power, temp and network monitoring
> 2) Inventory including for networks
> - While Zabbix does have automated inventory, you will have to populate
> the rack charts.
> 3) Dashboards
> 
> I will absolutely dissuade folks from rolling your own, we've done
> something like this before just for integration into our other
> infrastructure.
> -Eldo
> 
> On 3/1/24 21:36, Ben Koenig wrote:
> 
> > Hey all,
> > 
> > I have a somewhat strange (or maybe not so strange) question regarding 
> > datacenter management at the hardware and software level. For some context: 
> > I have recently found myself in charge of on-site maintenance for a 
> > datacenter with 800+ servers. While the job itself is pretty simple as far 
> > as the RAID arrays and general hardware configuration is concerned there 
> > has been some drama regarding past technicians who weren't actually keeping 
> > track of anything. So I have piles of parts that may or may not be good, 
> > servers that are completely undocumented, and a grotesque mismatch of 
> > labeling schemes for the various ethernet/fiber cables and server types.
> > 
> > Does anyone here who works with SMB scale datacenter environments have any 
> > tips or industry standard strategies for wrangling this type of setup? Are 
> > there any good FOSS software tools to help organize and monitor a mess like 
> > this? We have a software team that keeps and eye on the applications, but 
> > they do not appear to be monitoring things like power consumption, 
> > temperature, or even tracking parts as they get re-used. Our server "map" 
> > is literally just a Google Sheets document that was formatted to look like 
> > server rows with IP addresses listed by physical location. And I'm pretty 
> > sure everyone hates it. So I'm basically looking for tools to help me set 
> > up the following infrastructure:
> > 
> > - server documentation. Type, hardware configuration, and parts 
> > compatibility
> > - temp monitoring. Many of the servers are running CUDA applications on 
> > Dual/Quad GPU systems. They get toasty.
> > - power consumption monitoring. Our PDUs are able to report usage via a 
> > network interface, but nobody ever bothered to set it up. Would be nice to 
> > have a dashboard that shows when one of the servers freaks out and trips 
> > the breaker.
> > 
> > Thoughts? Solutions? Apps? I'm just looking for ideas at the moment. 
> > Everything is running (or so I'm told) but we currently have a bus number 
> > of 1 which is obviously a recipe for disaster. I don't mind piecing 
> > together my own set of scripts and utilities but if something already 
> > exists that does the work for me, even better :)
> > 
> > -Ben

Reply via email to