MC the problem I'm seeing that Ben has is NOT solvable by most of the advice he's gotten.
He has a team of programmers overseas who have a very specific customized environment that they created that they have a process of applying to the vanilla Ubuntu installs he's putting on hardware. I can guess the very last thing they want is interference from a mere "card swapper" overseas in the US. If this was virtualized solutions, that would be one thing. Ben would be in charge of the hypervisor and they would be doing their crap in VMs, docker images or whatever the hell VM solution they choose. They wouldn't give a crap about what he was doing in hardware, he wouldn't give a crap about what they are doing in the images. But they aren't doing that. I don't know what their customers are doing - cryptocurrency mining, cracking encryptions or searching for ET's using the Chinese radio telescope that replaced the collapsed Arecebo one - but whatever it is, it needs the power available to bare metal, containerized solutions like VM's ain't gonna do it. The are NOT going to want him crapping that up with agents from Ansible or Puppet, or telling them how to build "their" software images. What he's got going well that's what Intel AMT was designed to solve, it's what HP's ILO was designed to solve. But, his managers took the El-Cheapo way out of it and instead of buying all the same thing, high end servers, that has all that hardware, they got whatever was on sale at Costco. Not a single thing anyone has posted here regarding nodes, agents, etc. is going to do squat to read the temperature of a GPU. Or the temperature of a CPU on one of his motherboards. Or tell you whether a cooling fan has failed. For all you know his cheap-assed GPU cards don't even HAVE cooling fans with the 3rd tachometer wire nor a header on the card to even pay attention to that. MAYBE you might get S.M.A.R.T data - if he's using spinning mag media - but predictive failure is a bit different on SSDs He gave you guys a hint when he said he got the overseas people to at least give him the PCI id. That's where he's trying to operate at - at the hardware level. But all his hardware is DIFFERENT. That's why the past of this company has been littered with burned out system admins who quit. The company is an example of the tail wagging the dog. The app developers - the overseas programmers - are running the show. Corporate management probably figured "well those app developers in India are the geese laying the golden eggs" and in true corporate idiocy, put them in charge. And created a nightmare, the same way that happens when you put specialist bean counters in the accounting department in charge of a company. Now I can just see you all saying "horrors" and slapping all these pretty graphs down from Nagios and other stuff showing alleged GPU temps to prove me wrong. Go ahead. But dig down under the surface and you will find all the pretty gingerbread is pulling it's data from stuff like the Linux IMPI driver which may give you that data on some hardware, may not give it on other hardware. Ben needs a crash course in the nitty gritty groveling that has to be done for getting this data. If he's got all different motherboards then that means once he's done loading the vanilla Ubuntu image - he's gonna have to customize it for that board. Then make the app developers overseas scared of undoing his customizations under pain of death when they do THEIR customizations to get their number-crunching online. He's got some painful fighting with management and the app developers overseas ahead of him, since those people don't understand any of this stuff. If he positions this as a money-saver by demonstrating that well hey, I can tell you this server is gonna crash in the next week so move your crap elsewhere so I can preemptively swap it out, instead of you wasting time picking up the pieces and figuring out how far your app got on the rainbow table and or the polynomial and getting back to there, he might be able to get management on his side. But the overseas people are still going to be irked that he's operating in "their" space, at least until they understand what he's up to And going forward he's gonna have to pick a standardized hardware profile and force them to buy it instead of them saying "gee willikers that other motherboard is $10 cheaper let's buy it as a replacement" ignoring that it is cheaper since it is missing critical monitoring bits and is only gonna be "on sale" for a month. In reality, how they have been managing stuff now is super expensive. They just haven't had a system admin in the past who understood this they have just had "card swappers" who were burned out until they quit. In short, it's an example of "you can go broke saving money" It's a mindset shift he's going to have to push them into. They got 800 servers by guess and by gosh, by flying by the seat of their pants - and they won. Whoever their customers are - they have money. That money grew the company since by guess and by gosh gave them first to market in the market they are in. Now it's time to say OK we are gonna spend some of that money shifting to a sustainable model instead of a flash in the pan model. Otherwise in 5 years when the crypto fad is over or the NSA encryption algorithm they are working on cracking is cracked, or the aliens are found and have taken out a McDonald's franchise, the company is gonna have 800 POS servers that will be headed to the dump since they are all unique and can't be managed cheaply and the company will be bankrupt. Good luck with all of this, Ben! Sounds a lot of fun! Ted -----Original Message----- From: PLUG <plug-boun...@lists.pdxlinux.org> On Behalf Of MC_Sequoia Sent: Saturday, March 2, 2024 6:49 PM To: e...@poningru.com Cc: Portland Linux/Unix Group <plug@lists.pdxlinux.org> Subject: Re: [PLUG] Linux Software for Data Center Monitoring "I will absolutely dissuade folks from rolling your own, we've done something like this before just for integration into our other infrastructure." There is a valid counter argument here to be considered that depends on the type of environment, the skillset and the proclivities of the person(s) responsible for the management & performance of a DC. For example, if you're a good hacker, in a small scale, not very dynamic, low change environment, why use a complex, bloated, all the bells & whistles application if you can use a few simple, highly useful, well designed, low overhead tools to do only a few basic things. As an example, I searched the Debian Repos for "System Management" and found Bundlewrap. Here's the pkg description: "By allowing for easy and low-overhead config management, BundleWrap fills the gap between complex deployments using Chef or Puppet and old school system administration over SSH. While most other config management systems rely on a client-server architecture, BundleWrap works off a repository cloned to your local machine. It then automates the process of SSHing into your servers and making sure everything is configured the way it's supposed to be. You won't have to install anything on managed servers." Here's some highlights from the Bundlewrap website: "Decentralized. There is no server. Just your laptop and your army of nodes." "Push configuration directly using SSH. No agent required." "Free as in bird. 100% Free Software. No Enterprise Edition." "Pythonic and hackable.Write hooks, custom items or use it as a library." https://bundlewrap.org/