Arranging downtime is always a challenge.... The latest is our student information database server (a RAC cluster)...one of the HBAs had failed....which wouldn't be a problem except that Hitachi says were many versions behind on firmware updates for our 9990 (because we had purchased it from Oracle a month before they announced they were dropping the line, but Oracle wouldn't let us out of the 3 year support contract...so we couldn't get updates or upgrades.) So, we have an upgrade scheduled in a couple of weeks. Which is only non-disruptive if both paths are working on every server.... There had been times when we had the 9980, that we would abort updates, when we found a server missing a path...and back there were no clustered DBs. Since then we try to check everything well before a scheduled update. Since its no fun getting up for a maintenance window and then aborting the update....
Anyways....this being student enrollment/orientation month....getting downtime is hard...especially downtime during the day, when the nearest Oracle FE shows up. After hours usually means a much further away FE comes down...so it only happens when they decide its the only option....which I can't recall when the last time was since they took over. Anyways...we did get the HBA replaced....but then the system wouldn't boot, which they eventually resolved by removing the DVD drive. Now, they want to replace it....but how do you convince the DBAs to grant us downtime to replace something that is almost never used....though it did get used once....when a new SA tripped the timebomb in a script written by a former SA. It was a script that generates passwd/shadow files from the information stored in the IDM database..... the dry-run option was open/create the passwd file, and then say what it would do instead of doing it. Though it did say instead of do for the open/create of the shadow file. This resulted in our cfengine putting empty passwd files on everything. And, since we do the runs exclusively from cron.... no chance for self repair.... So, there was a lot of booting systems from DVD and copying a minimal passwd file, sufficient for us to log in as root and run cfagent to get the systems fixed. Fortunately it didn't wipe out the entire datacenter....just a large portion of it. I came across a server yesterday that has an uptime of 2525 days. Had completely forgot that there is still one FTP server....its used for exchange data with our state government...they did upgrade to FTPS (FTP-SSL) a few years back.... There had been a call out because it wasn't responding....fortunately I could get one to fix it. Which will be another story which is why I was reading this list today.... ----- Original Message ----- > In all the process took 2-3 weeks, and 5 lots of downtime, from > reporting the problem to finally having a replacement part in the > server. Part of the delay was down to arranging downtime with the > customer rather than HP. IIRC not long after that we were racking > and > configuring a hot-spare database server for the customer, who > incidentally got through the next Christmas rush without even batting > an > eyelid. :) > > Paul -- Who: Lawrence K. Chen, P.Eng. - W0LKC - Senior Unix Systems Administrator For: Enterprise Server Technologies (EST) -- & SafeZone Ally Snail: Computing and Telecommunications Services (CTS) Kansas State University, 109 East Stadium, Manhattan, KS 66506-3102 Phone: (785) 532-4916 - Fax: (785) 532-3515 - Email: [email protected] Web: http://www-personal.ksu.edu/~lkchen - Where: 11 Hale Library _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
