Re: [PLUG] Linux Software for Data Center Monitoring

2024-03-03 Thread Ben Koenig
Relax, everyone here has offered a good set of options to consider. My goal was 
to brainstorm possible solutions here since I haven't worked directly with this 
kind of software in a Linux environment. First is always to list all your 
options, even if those options may not be the best and I've got plenty to think 
about here. I'm currently in the "there are no bad ideas here" stage. 


On Sunday, March 3rd, 2024 at 8:22 AM, Ted Mittelstaedt 
 wrote:

> ...
> But all his hardware is DIFFERENT. That's why the past of this company has 
> been littered with burned out system admins who quit.
> ...

Not exactly. The previous admin did not quit, he was FIRED. He had no concept 
of Linux, IT, or even basic PC troubleshooting but operated as if he assumed he 
understood why everything was broken and blamed the remote software team for 
pretty much everything. A lot of the jankyness right now comes from the fact 
that the onsite technican was a complete doofus who needed his hand held when 
replacing a bad GPU because he wasn't able to verify that it was actually 
working via lspci/nvidia-smi. "Training" the remote team to give me the PCIe ID 
for a bad GPU was more about building confidence that I could actually handle 
that info. They were also having issues with basic inventory management. Just 
getting this guy to write down a tracking number or count server rails was a 
massive undertaking so you can see why they might not want to splurge for nice 
parts.

I ran into a similar problem last year at a big corporation - lower tier 
support technicians were mad at management for making bad decisions, but 
management makes those decisions based on ticket data. So I looked at the 
ticket data and noticed that people weren't creating their SNOW tickets 
properly, which resulted in all of their work being massively under-reported. 
So of course management is going to assume you have free time... that's what 
your own ticket data says. Why are we blaming the managers if we didn't do our 
job correctly?

Similar problem here, just much smaller scale. Tech who didn't properly manage 
the physical location ended up making it difficult for decision makers to 
choose a path forward. The short term solution might be to modularize the 
different tasks, and follow a one-application-per-task mindset. If it gets to 
the point where everyone else wants to integrate these tools together, that 
could become the opportunity to suggest a more robust system.
-Ben


Re: [PLUG] Linux Software for Data Center Monitoring

2024-03-03 Thread Ted Mittelstaedt
MC the problem I'm seeing that Ben has is NOT solvable by most of the advice 
he's gotten.

He has a team of programmers overseas who have a very specific customized 
environment that they created that they have a process of applying to the 
vanilla Ubuntu installs he's putting on hardware.

I can guess the very last thing they want is interference from a mere "card 
swapper" overseas in the US.

If this was virtualized solutions, that would be one thing.  Ben would be in 
charge of the hypervisor and they would be doing their crap in VMs, docker 
images or whatever the hell VM solution they choose.  They wouldn't give a crap 
about what he was doing in hardware, he wouldn't give a crap about what they 
are doing in the images.

But they aren't doing that.  I don't know what their customers are doing - 
cryptocurrency mining, cracking encryptions or searching for ET's using the 
Chinese radio telescope that replaced the collapsed Arecebo one - but whatever 
it is, it needs the power available to bare metal, containerized solutions like 
VM's ain't gonna do it.

The are NOT going to want him crapping that up with agents from Ansible or 
Puppet, or telling them how to build "their" software images.

What he's got going well that's what Intel AMT was designed to solve, it's what 
HP's ILO was designed to solve.  But, his managers took the El-Cheapo way out 
of it and instead of buying all the same thing, high end servers, that has all 
that hardware, they got whatever was on sale at Costco.

Not a single thing anyone has posted here regarding nodes, agents, etc. is 
going to do squat to read the temperature of a GPU.  Or the temperature of a 
CPU on one of his motherboards.  Or tell you whether a cooling fan has failed.  
For all you know his cheap-assed GPU cards don't even HAVE cooling fans with 
the 3rd tachometer wire nor a header on the card to even pay attention to that. 
 MAYBE you might get S.M.A.R.T data - if he's using spinning mag media - but 
predictive failure is a bit different on SSDs

He gave you guys a hint when he said he got the overseas people to at least 
give him the PCI id.  That's where he's trying to operate at - at the hardware 
level.

But all his hardware is DIFFERENT.   That's why the past of this company has 
been littered with burned out system admins who quit.

The company is an example of the tail wagging the dog.  The app developers - 
the overseas programmers - are running the show.  Corporate management probably 
figured "well those app developers in India are the geese laying the golden 
eggs" and in true corporate idiocy, put them in charge.  And created a 
nightmare, the same way that happens when you put specialist bean counters in 
the accounting department in charge of a company.

Now I can just see you all saying "horrors" and slapping all these pretty 
graphs down from Nagios and other stuff showing alleged GPU temps to prove me 
wrong.  Go ahead.  But dig down under the surface and you will find all the 
pretty gingerbread is pulling it's data from stuff like the Linux IMPI driver 
which may give you that data on some hardware, may not give it on other 
hardware.

Ben needs a crash course in the nitty gritty groveling that has to be done for 
getting this data.  If he's got all different motherboards then that means once 
he's done loading the vanilla Ubuntu image - he's gonna have to customize it 
for that board.  Then make the app developers overseas scared of undoing his 
customizations under pain of death when they do THEIR customizations to get 
their number-crunching online.

He's got some painful fighting with management and the app developers overseas 
ahead of him, since those people don't understand any of this stuff.  If he 
positions this as a money-saver by demonstrating that well hey, I can tell you 
this server is gonna crash in the next week so move your crap elsewhere so I 
can preemptively swap it out, instead of you wasting time picking up the pieces 
and figuring out how far your app got on the rainbow table and or the 
polynomial and getting back to there, he might be able to get management on his 
side.  But the overseas people are still going to be irked that he's operating 
in "their" space, at least until they understand what he's up to

And going forward he's gonna have to pick a standardized hardware profile and 
force them to buy it instead of them saying "gee willikers that other 
motherboard is $10 cheaper let's buy it as a replacement" ignoring that it is 
cheaper since it is missing critical monitoring bits and is only gonna be "on 
sale" for a month.  In reality, how they have been managing stuff now is super 
expensive.  They just haven't had a system admin in the past who understood 
this they have just had "card swappers" who were burned out until they quit.  
In short, it's an example of "you can go broke saving money"

It's a mindset shift he's going to have to push them into.  They got 800 
servers by guess and by gosh, by fly

Re: [PLUG] [PLUG-ANNOUNCE] Portland Linux/Unix Group General Meeting Announcement: Two half-talks

2024-03-03 Thread Russell Senior




On 3/2/24 07:56, Michael Galassi wrote:

The sentence "All in-person events are on hold until further notice."
still shows up in the first paragraph of pdxlinux.org, maybe those
words can be retired (for now).


Fixed.



See you Thursday (if it stops snowing).

-michael

On Fri, Mar 1, 2024 at 5:41 PM Russell Senior  wrote:

Portland Linux/Unix Group General Meeting Announcement

Who: Russell Senior
What: Part 1: A Network Relay via Cloud Instance ; Part 2: Retro Linux
Tape Recovery Show and Tell
Where: 5500 SW Dosch Rd, Portland
When: Thursday, March 7, 2024 at 7pm (Help with chairs a few minutes
early is always appreciated)
Why: The pursuit of technology freedom

https://pdxlinux.org

This is going to be a two-part talk, because each of the parts alone
isn't enough to fill an hour (let's hope).

The first part is going to be a description of how I relay network
connections from the Internet to my low-volume home-based email server
to evade potential ISP blockages.

The second part is going to be a show and tell about my resurrection of
an ancient Linux version in order to recover data from Quarter Inch
Cartridge tapes and ancillary topics. It will also include a short demo
of my MS-DOS 5.0 environment also (resurrected from tape) the month
before I installed Linux for the first time in December 1992.

About Russell:

I am a person for whom the Year of the Linux Desktop started in 1992 and
has continued annually, uninterrupted. I worked for a couple decades in
scientific data management and analysis. Since 2005, I have been
involved with the Personal Telco Project, a volunteer-based 501c3
non-profit trying to unscrew telecommunications policy in the Portland
metropolitan area.  I did a short stint in data management for an
Oceanographic organization when it was housed at OH&SU. I also volunteer
at Portland State Aerospace Society working on their OreSat program. My
name, misspelled in glorious circuit board silkscreen, has literally
been in orbit for most of the last 2 years. I have done a bunch of PLUG
talks over the years (scrolling through the log, I recognize these):

2023-03-02 Anatomy of a mailing list meltdown
2021-10-07 Russell's Excellent High Altitude Balloon Adventure
2020-01-02 Reading wireless temperature sensors with RTL-SDR and rtl_433
2019-02-07 PGP Key Storage with a Yubikey 4
2018-02-01 How to get a Municipal Broadband network in the City of
Portland
2017-05-04 Going Coastal, Russell's Excellent Adventure at the Center
for Coastal Margin Observation and Prediction
2013-06-06 Hacking on the Beagle Bone Black
2008-11-19 OpenWrt, it's not just for Linksys Routers anymore
2008-02-07 MetroFi: How Lame is It?
2005-11-16 Mississippi Grant Project and Personal Telco
2004-10-20 Detection of electromagnetic fields with Linux
2003-12-17 Russell’s excellent hardlink adventure (disk-to-disk
backup systems)

Rules and Requests:

Masks are encouraged but not required.

PLUG is open to everyone and does not tolerate abusive behavior on its
mailing lists or at its meetings

Do not leave valuables in your car


Calagator Page: https://calagator.org/events/1250480986

Google Maps Link:
https://www.google.com/maps/place/5500+SW+Dosch+Rd,+Portland,+OR+97239

Some might head to Hillsdale Brewery & Public House near the Library:
https://www.mcmenamins.com/hillsdale-brewery-public-house

Rideshares likely available

PLUG Page with information about all PLUG events: http://pdxlinux.org/

Russell Senior
PLUG Volunteer
___
PLUG: https://pdxlinux.org
PLUG-announce mailing list
plug-annou...@lists.pdxlinux.org
https://lists.pdxlinux.org/mailman/listinfo/plug-announce
___
PLUG: https://pdxlinux.org
PLUG-announce mailing list
plug-annou...@lists.pdxlinux.org
https://lists.pdxlinux.org/mailman/listinfo/plug-announce