MC the problem I'm seeing that Ben has is NOT solvable by most of the advice 
he's gotten.

He has a team of programmers overseas who have a very specific customized 
environment that they created that they have a process of applying to the 
vanilla Ubuntu installs he's putting on hardware.

I can guess the very last thing they want is interference from a mere "card 
swapper" overseas in the US.

If this was virtualized solutions, that would be one thing.  Ben would be in 
charge of the hypervisor and they would be doing their crap in VMs, docker 
images or whatever the hell VM solution they choose.  They wouldn't give a crap 
about what he was doing in hardware, he wouldn't give a crap about what they 
are doing in the images.

But they aren't doing that.  I don't know what their customers are doing - 
cryptocurrency mining, cracking encryptions or searching for ET's using the 
Chinese radio telescope that replaced the collapsed Arecebo one - but whatever 
it is, it needs the power available to bare metal, containerized solutions like 
VM's ain't gonna do it.

The are NOT going to want him crapping that up with agents from Ansible or 
Puppet, or telling them how to build "their" software images.

What he's got going well that's what Intel AMT was designed to solve, it's what 
HP's ILO was designed to solve.  But, his managers took the El-Cheapo way out 
of it and instead of buying all the same thing, high end servers, that has all 
that hardware, they got whatever was on sale at Costco.

Not a single thing anyone has posted here regarding nodes, agents, etc. is 
going to do squat to read the temperature of a GPU.  Or the temperature of a 
CPU on one of his motherboards.  Or tell you whether a cooling fan has failed.  
For all you know his cheap-assed GPU cards don't even HAVE cooling fans with 
the 3rd tachometer wire nor a header on the card to even pay attention to that. 
 MAYBE you might get S.M.A.R.T data - if he's using spinning mag media - but 
predictive failure is a bit different on SSDs

He gave you guys a hint when he said he got the overseas people to at least 
give him the PCI id.  That's where he's trying to operate at - at the hardware 
level.

But all his hardware is DIFFERENT.   That's why the past of this company has 
been littered with burned out system admins who quit.

The company is an example of the tail wagging the dog.  The app developers - 
the overseas programmers - are running the show.  Corporate management probably 
figured "well those app developers in India are the geese laying the golden 
eggs" and in true corporate idiocy, put them in charge.  And created a 
nightmare, the same way that happens when you put specialist bean counters in 
the accounting department in charge of a company.

Now I can just see you all saying "horrors" and slapping all these pretty 
graphs down from Nagios and other stuff showing alleged GPU temps to prove me 
wrong.  Go ahead.  But dig down under the surface and you will find all the 
pretty gingerbread is pulling it's data from stuff like the Linux IMPI driver 
which may give you that data on some hardware, may not give it on other 
hardware.

Ben needs a crash course in the nitty gritty groveling that has to be done for 
getting this data.  If he's got all different motherboards then that means once 
he's done loading the vanilla Ubuntu image - he's gonna have to customize it 
for that board.  Then make the app developers overseas scared of undoing his 
customizations under pain of death when they do THEIR customizations to get 
their number-crunching online.

He's got some painful fighting with management and the app developers overseas 
ahead of him, since those people don't understand any of this stuff.  If he 
positions this as a money-saver by demonstrating that well hey, I can tell you 
this server is gonna crash in the next week so move your crap elsewhere so I 
can preemptively swap it out, instead of you wasting time picking up the pieces 
and figuring out how far your app got on the rainbow table and or the 
polynomial and getting back to there, he might be able to get management on his 
side.  But the overseas people are still going to be irked that he's operating 
in "their" space, at least until they understand what he's up to

And going forward he's gonna have to pick a standardized hardware profile and 
force them to buy it instead of them saying "gee willikers that other 
motherboard is $10 cheaper let's buy it as a replacement" ignoring that it is 
cheaper since it is missing critical monitoring bits and is only gonna be "on 
sale" for a month.  In reality, how they have been managing stuff now is super 
expensive.  They just haven't had a system admin in the past who understood 
this they have just had "card swappers" who were burned out until they quit.  
In short, it's an example of "you can go broke saving money"

It's a mindset shift he's going to have to push them into.  They got 800 
servers by guess and by gosh, by flying by the seat of their pants - and they 
won.  Whoever their customers are - they have money.  That money grew the 
company since by guess and by gosh gave them first to market in the market they 
are in.

Now it's time to say OK we are gonna spend some of that money shifting to a 
sustainable model instead of a flash in the pan model.  Otherwise in 5 years 
when the crypto fad is over or the NSA encryption algorithm they are working on 
cracking is cracked, or the aliens are found and have taken out a McDonald's 
franchise, the company is gonna have 800 POS servers that will be headed to the 
dump since they are all unique and can't be managed cheaply and the company 
will be bankrupt.

Good luck with all of this, Ben!  Sounds a lot of fun!

Ted

-----Original Message-----
From: PLUG <plug-boun...@lists.pdxlinux.org> On Behalf Of MC_Sequoia
Sent: Saturday, March 2, 2024 6:49 PM
To: e...@poningru.com
Cc: Portland Linux/Unix Group <plug@lists.pdxlinux.org>
Subject: Re: [PLUG] Linux Software for Data Center Monitoring

"I will absolutely dissuade folks from rolling your own, we've done something 
like this before just for integration into our other infrastructure."

There is a valid counter argument here to be considered that depends on the 
type of environment, the skillset and the proclivities of the person(s) 
responsible for the management & performance of a DC.

For example, if you're a good hacker, in a small scale, not very dynamic, low 
change environment, why use a complex, bloated, all the bells & whistles 
application if you can use a few simple, highly useful, well designed, low 
overhead tools to do only a few basic things. 

As an example, I searched the Debian Repos for "System Management" and found 
Bundlewrap. 

Here's the pkg description: 
"By allowing for easy and low-overhead config management, BundleWrap fills the 
gap between complex deployments using Chef or Puppet and old school system 
administration over SSH.

While most other config management systems rely on a client-server 
architecture, BundleWrap works off a repository cloned to your local machine.

It then automates the process of SSHing into your servers and making sure 
everything is configured the way it's supposed to be. You won't have to install 
anything on managed servers."

Here's some highlights from the Bundlewrap website:

"Decentralized. There is no server. Just your laptop and your army of nodes."

"Push configuration directly using SSH. No agent required."

"Free as in bird. 100% Free Software. No Enterprise Edition."

"Pythonic and hackable.Write hooks, custom items or use it as a library."

https://bundlewrap.org/







Reply via email to