Re: [PLUG] Linux Software for Data Center Monitoring
Relax, everyone here has offered a good set of options to consider. My goal was to brainstorm possible solutions here since I haven't worked directly with this kind of software in a Linux environment. First is always to list all your options, even if those options may not be the best and I've got plenty to think about here. I'm currently in the "there are no bad ideas here" stage. On Sunday, March 3rd, 2024 at 8:22 AM, Ted Mittelstaedt wrote: > ... > But all his hardware is DIFFERENT. That's why the past of this company has > been littered with burned out system admins who quit. > ... Not exactly. The previous admin did not quit, he was FIRED. He had no concept of Linux, IT, or even basic PC troubleshooting but operated as if he assumed he understood why everything was broken and blamed the remote software team for pretty much everything. A lot of the jankyness right now comes from the fact that the onsite technican was a complete doofus who needed his hand held when replacing a bad GPU because he wasn't able to verify that it was actually working via lspci/nvidia-smi. "Training" the remote team to give me the PCIe ID for a bad GPU was more about building confidence that I could actually handle that info. They were also having issues with basic inventory management. Just getting this guy to write down a tracking number or count server rails was a massive undertaking so you can see why they might not want to splurge for nice parts. I ran into a similar problem last year at a big corporation - lower tier support technicians were mad at management for making bad decisions, but management makes those decisions based on ticket data. So I looked at the ticket data and noticed that people weren't creating their SNOW tickets properly, which resulted in all of their work being massively under-reported. So of course management is going to assume you have free time... that's what your own ticket data says. Why are we blaming the managers if we didn't do our job correctly? Similar problem here, just much smaller scale. Tech who didn't properly manage the physical location ended up making it difficult for decision makers to choose a path forward. The short term solution might be to modularize the different tasks, and follow a one-application-per-task mindset. If it gets to the point where everyone else wants to integrate these tools together, that could become the opportunity to suggest a more robust system. -Ben
Re: [PLUG] Linux Software for Data Center Monitoring
MC the problem I'm seeing that Ben has is NOT solvable by most of the advice he's gotten. He has a team of programmers overseas who have a very specific customized environment that they created that they have a process of applying to the vanilla Ubuntu installs he's putting on hardware. I can guess the very last thing they want is interference from a mere "card swapper" overseas in the US. If this was virtualized solutions, that would be one thing. Ben would be in charge of the hypervisor and they would be doing their crap in VMs, docker images or whatever the hell VM solution they choose. They wouldn't give a crap about what he was doing in hardware, he wouldn't give a crap about what they are doing in the images. But they aren't doing that. I don't know what their customers are doing - cryptocurrency mining, cracking encryptions or searching for ET's using the Chinese radio telescope that replaced the collapsed Arecebo one - but whatever it is, it needs the power available to bare metal, containerized solutions like VM's ain't gonna do it. The are NOT going to want him crapping that up with agents from Ansible or Puppet, or telling them how to build "their" software images. What he's got going well that's what Intel AMT was designed to solve, it's what HP's ILO was designed to solve. But, his managers took the El-Cheapo way out of it and instead of buying all the same thing, high end servers, that has all that hardware, they got whatever was on sale at Costco. Not a single thing anyone has posted here regarding nodes, agents, etc. is going to do squat to read the temperature of a GPU. Or the temperature of a CPU on one of his motherboards. Or tell you whether a cooling fan has failed. For all you know his cheap-assed GPU cards don't even HAVE cooling fans with the 3rd tachometer wire nor a header on the card to even pay attention to that. MAYBE you might get S.M.A.R.T data - if he's using spinning mag media - but predictive failure is a bit different on SSDs He gave you guys a hint when he said he got the overseas people to at least give him the PCI id. That's where he's trying to operate at - at the hardware level. But all his hardware is DIFFERENT. That's why the past of this company has been littered with burned out system admins who quit. The company is an example of the tail wagging the dog. The app developers - the overseas programmers - are running the show. Corporate management probably figured "well those app developers in India are the geese laying the golden eggs" and in true corporate idiocy, put them in charge. And created a nightmare, the same way that happens when you put specialist bean counters in the accounting department in charge of a company. Now I can just see you all saying "horrors" and slapping all these pretty graphs down from Nagios and other stuff showing alleged GPU temps to prove me wrong. Go ahead. But dig down under the surface and you will find all the pretty gingerbread is pulling it's data from stuff like the Linux IMPI driver which may give you that data on some hardware, may not give it on other hardware. Ben needs a crash course in the nitty gritty groveling that has to be done for getting this data. If he's got all different motherboards then that means once he's done loading the vanilla Ubuntu image - he's gonna have to customize it for that board. Then make the app developers overseas scared of undoing his customizations under pain of death when they do THEIR customizations to get their number-crunching online. He's got some painful fighting with management and the app developers overseas ahead of him, since those people don't understand any of this stuff. If he positions this as a money-saver by demonstrating that well hey, I can tell you this server is gonna crash in the next week so move your crap elsewhere so I can preemptively swap it out, instead of you wasting time picking up the pieces and figuring out how far your app got on the rainbow table and or the polynomial and getting back to there, he might be able to get management on his side. But the overseas people are still going to be irked that he's operating in "their" space, at least until they understand what he's up to And going forward he's gonna have to pick a standardized hardware profile and force them to buy it instead of them saying "gee willikers that other motherboard is $10 cheaper let's buy it as a replacement" ignoring that it is cheaper since it is missing critical monitoring bits and is only gonna be "on sale" for a month. In reality, how they have been managing stuff now is super expensive. They just haven't had a system admin in the past who understood this they have just had "card swappers" who were burned out until they quit. In short, it's an example of "you can go broke saving money" It's a mindset shift he's going to have to push them into. They got 800 servers by guess and by gosh, by fly
Re: [PLUG] [PLUG-ANNOUNCE] Portland Linux/Unix Group General Meeting Announcement: Two half-talks
On 3/2/24 07:56, Michael Galassi wrote: The sentence "All in-person events are on hold until further notice." still shows up in the first paragraph of pdxlinux.org, maybe those words can be retired (for now). Fixed. See you Thursday (if it stops snowing). -michael On Fri, Mar 1, 2024 at 5:41 PM Russell Senior wrote: Portland Linux/Unix Group General Meeting Announcement Who: Russell Senior What: Part 1: A Network Relay via Cloud Instance ; Part 2: Retro Linux Tape Recovery Show and Tell Where: 5500 SW Dosch Rd, Portland When: Thursday, March 7, 2024 at 7pm (Help with chairs a few minutes early is always appreciated) Why: The pursuit of technology freedom https://pdxlinux.org This is going to be a two-part talk, because each of the parts alone isn't enough to fill an hour (let's hope). The first part is going to be a description of how I relay network connections from the Internet to my low-volume home-based email server to evade potential ISP blockages. The second part is going to be a show and tell about my resurrection of an ancient Linux version in order to recover data from Quarter Inch Cartridge tapes and ancillary topics. It will also include a short demo of my MS-DOS 5.0 environment also (resurrected from tape) the month before I installed Linux for the first time in December 1992. About Russell: I am a person for whom the Year of the Linux Desktop started in 1992 and has continued annually, uninterrupted. I worked for a couple decades in scientific data management and analysis. Since 2005, I have been involved with the Personal Telco Project, a volunteer-based 501c3 non-profit trying to unscrew telecommunications policy in the Portland metropolitan area. I did a short stint in data management for an Oceanographic organization when it was housed at OH&SU. I also volunteer at Portland State Aerospace Society working on their OreSat program. My name, misspelled in glorious circuit board silkscreen, has literally been in orbit for most of the last 2 years. I have done a bunch of PLUG talks over the years (scrolling through the log, I recognize these): 2023-03-02 Anatomy of a mailing list meltdown 2021-10-07 Russell's Excellent High Altitude Balloon Adventure 2020-01-02 Reading wireless temperature sensors with RTL-SDR and rtl_433 2019-02-07 PGP Key Storage with a Yubikey 4 2018-02-01 How to get a Municipal Broadband network in the City of Portland 2017-05-04 Going Coastal, Russell's Excellent Adventure at the Center for Coastal Margin Observation and Prediction 2013-06-06 Hacking on the Beagle Bone Black 2008-11-19 OpenWrt, it's not just for Linksys Routers anymore 2008-02-07 MetroFi: How Lame is It? 2005-11-16 Mississippi Grant Project and Personal Telco 2004-10-20 Detection of electromagnetic fields with Linux 2003-12-17 Russell’s excellent hardlink adventure (disk-to-disk backup systems) Rules and Requests: Masks are encouraged but not required. PLUG is open to everyone and does not tolerate abusive behavior on its mailing lists or at its meetings Do not leave valuables in your car Calagator Page: https://calagator.org/events/1250480986 Google Maps Link: https://www.google.com/maps/place/5500+SW+Dosch+Rd,+Portland,+OR+97239 Some might head to Hillsdale Brewery & Public House near the Library: https://www.mcmenamins.com/hillsdale-brewery-public-house Rideshares likely available PLUG Page with information about all PLUG events: http://pdxlinux.org/ Russell Senior PLUG Volunteer ___ PLUG: https://pdxlinux.org PLUG-announce mailing list plug-annou...@lists.pdxlinux.org https://lists.pdxlinux.org/mailman/listinfo/plug-announce ___ PLUG: https://pdxlinux.org PLUG-announce mailing list plug-annou...@lists.pdxlinux.org https://lists.pdxlinux.org/mailman/listinfo/plug-announce