Arranging downtime is always a challenge....

The latest is our student information database server (a RAC cluster)...one of 
the HBAs had failed....which wouldn't be a problem except that Hitachi says 
were many versions behind on firmware updates for our 9990 (because we had 
purchased it from Oracle a month before they announced they were dropping the 
line, but Oracle wouldn't let us out of the 3 year support contract...so we 
couldn't get updates or upgrades.)  So, we have an upgrade scheduled in a 
couple of weeks.  Which is only non-disruptive if both paths are working on 
every server....  There had been times when we had the 9980, that we would 
abort updates, when we found a server missing a path...and back there were no 
clustered DBs.  Since then we try to check everything well before a scheduled 
update.  Since its no fun getting up for a maintenance window and then aborting 
the update....

Anyways....this being student enrollment/orientation month....getting downtime 
is hard...especially downtime during the day, when the nearest Oracle FE shows 
up.  After hours usually means a much further away FE comes down...so it only 
happens when they decide its the only option....which I can't recall when the 
last time was since they took over.

Anyways...we did get the HBA replaced....but then the system wouldn't boot, 
which they eventually resolved by removing the DVD drive.

Now, they want to replace it....but how do you convince the DBAs to grant us 
downtime to replace something that is almost never used....though it did get 
used once....when a new SA tripped the timebomb in a script written by a former 
SA.

It was a script that generates passwd/shadow files from the information stored 
in the IDM database.....  the dry-run option was open/create the passwd file, 
and then say what it would do instead of doing it.  Though it did say instead 
of do for the open/create of the shadow file.  This resulted in our cfengine 
putting empty passwd files on everything.  And, since we do the runs 
exclusively from cron.... no chance for self repair....

So, there was a lot of booting systems from DVD and copying a minimal passwd 
file, sufficient for us to log in as root and run cfagent to get the systems 
fixed.

Fortunately it didn't wipe out the entire datacenter....just a large portion of 
it.  I came across a server yesterday that has an uptime of 2525 days.  Had 
completely forgot that there is still one FTP server....its used for exchange 
data with our state government...they did upgrade to FTPS (FTP-SSL) a few years 
back....  There had been a call out because it wasn't responding....fortunately 
I could get one to fix it.  Which will be another story which is why I was 
reading this list today....

----- Original Message -----

> In all the process took 2-3 weeks, and 5 lots of downtime, from
> reporting the problem to finally having a replacement part in the
> server.  Part of the delay was down to arranging downtime with the
> customer rather than HP.  IIRC not long after that we were racking
> and
> configuring a hot-spare database server for the customer, who
> incidentally got through the next Christmas rush without even batting
> an
> eyelid. :)
> 
> Paul

-- 
Who: Lawrence K. Chen, P.Eng. - W0LKC - Senior Unix Systems Administrator
For: Enterprise Server Technologies (EST) -- & SafeZone Ally
Snail: Computing and Telecommunications Services (CTS)
Kansas State University, 109 East Stadium, Manhattan, KS 66506-3102
Phone: (785) 532-4916 - Fax: (785) 532-3515 - Email: [email protected]
Web: http://www-personal.ksu.edu/~lkchen - Where: 11 Hale Library
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to