Re: ups batteries draining, can't switch to generators

2009-05-26 Thread Hal Merritt
I would respectfully disagree. I posed this question to the group a while back 
and the consensus was to let it crash. I believe the thought was that there was 
just not enough time or information to make an informed decision. Plus, anytime 
someone is frantically banging on a keyboard, screw ups are a sure bet.  

I had to make this call a while back and followed the advice. All in all, 
things went very smoothly and we were back in full operation with very few 
issues.  DB2 and JES knew what to do and did it. Batch jobs that were running 
were simply treated just like any other failure. Our most loved proprietary 
online hit the ground running. 

I did drain initiators to prevent any new batch jobs kicking off. 

I learned a few things, though. One, if you have an OS/2 HMC, shut it down 
ASAP. When power comes back, it does a check disk that takes forever. I had to 
IPL using the support elements. 

The Shark has backup batteries so that it will come in for a graceful landing. 
That worked. But the unexpected twist was that the Shark would not go ready 
until the batteries were recharged. We now have a DS8100 but I would expect it 
would be no different. 

So, my suggestion for a loss of power scenario is to immediately evacuate 
personnel to a safe place. Nothing is worth getting someone hurt. BTDT.   



-Original Message-
From: IBM Mainframe Discussion List [mailto:ibm-m...@bama.ua.edu] On Behalf Of 
Joel C Ewing
Sent: Saturday, May 23, 2009 12:14 PM
To: IBM-MAIN@bama.ua.edu
Subject: Re: ups batteries draining, can't switch to generators

While YMMV, out experience has been that any utility power failure 
lasting more than 5-15 seconds is a solid failure and the outage will 
invariably be an hour or more while the utility company locates and 
fixes the problem.  This means that unless you have an extraordinary 
UPS, or functional generators able to recharge the UPS, you are going 
down --  The only issue is when or how.  Given the choice, a controlled 
shutdown from which restart is almost guaranteed is infinitely better 
than gambling the potential of adding hours of downtime and potential 
data loss from an abrupt termination for the questionable benefit of 
staying up for a few minutes longer.

 ..snip 
   Joel C Ewing

 
NOTICE: This electronic mail message and any files transmitted with it are 
intended
exclusively for the individual or entity to which it is addressed. The message, 
together with any attachment, may contain confidential and/or privileged 
information.
Any unauthorized review, use, printing, saving, copying, disclosure or 
distribution 
is strictly prohibited. If you have received this message in error, please 
immediately advise the sender by reply email and delete all copies.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: ups batteries draining, can't switch to generators

2009-05-26 Thread Rick Fochtman

--snip
I would respectfully disagree. I posed this question to the group a 
while back and the consensus was to let it crash. I believe the thought 
was that there was just not enough time or information to make an 
informed decision. Plus, anytime someone is frantically banging on a 
keyboard, screw ups are a sure bet.


I had to make this call a while back and followed the advice. All in 
all, things went very smoothly and we were back in full operation with 
very few issues. DB2 and JES knew what to do and did it. Batch jobs that 
were running were simply treated just like any other failure. Our most 
loved proprietary online hit the ground running.


I did drain initiators to prevent any new batch jobs kicking off.

I learned a few things, though. One, if you have an OS/2 HMC, shut it 
down ASAP. When power comes back, it does a check disk that takes 
forever. I had to IPL using the support elements.


The Shark has backup batteries so that it will come in for a graceful 
landing. That worked. But the unexpected twist was that the Shark would 
not go ready until the batteries were recharged. We now have a DS8100 
but I would expect it would be no different.


So, my suggestion for a loss of power scenario is to immediately 
evacuate personnel to a safe place. Nothing is worth getting someone 
hurt. BTDT.

-unsnip--
I tend to agree with your conclusions, expecially the last one. Consider 
using your automation to accomplish an orderly shutdown, or 
alternatively, use something like the COMMAND program from the CBTTAPE 
site for same. While it's not always perfect, it will help eliminate the 
finger checks that so often happen in times of great stress. Getting a 
clean termination of batch jobs will always be problematic but 
automation of some sort can be used to at least affect an orderly 
shutdown of online systems, DBMS's, etc. We use COMMAND and other than 
batch jobs, we are completely shut down and ready for power down, or 
IPL, in about 2 minutes. (3 CICS regions, 1 DBMS, plus TSO, VTAM and 
various monitoring tools.)


(We learned just how valuable this was during the Chicago Flood of 1992, 
when Edison have us 10 minutes warning before cutting our power.)


--
Rick
--
Remember that if you’re not the lead dog, the view never changes.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: ups batteries draining, can't switch to generators

2009-05-23 Thread Joel C Ewing
While YMMV, out experience has been that any utility power failure 
lasting more than 5-15 seconds is a solid failure and the outage will 
invariably be an hour or more while the utility company locates and 
fixes the problem.  This means that unless you have an extraordinary 
UPS, or functional generators able to recharge the UPS, you are going 
down --  The only issue is when or how.  Given the choice, a controlled 
shutdown from which restart is almost guaranteed is infinitely better 
than gambling the potential of adding hours of downtime and potential 
data loss from an abrupt termination for the questionable benefit of 
staying up for a few minutes longer.


One of issues sounds like a management problem.  Placing the power to 
make a shutdown decision solely in the hands of a duty manager who is 
not 100% available obviously doesn't work for decisions requiring a 10 
minute or better response time.  The other issue is that you must have 
automation support in place to minimize the z/OS shutdown time and 
documented emergency shutdown procedure that are required reading for 
whoever may have to effect the shutdown.


We have documented procedures for emergency system and hardware shutdown 
and z/OS automated procedures (using Netview automation and CBT freebie 
program NETINIT/NETSTOP) to take down online and batch systems and DB2 
as quickly as possible.  These are the same procedures used for normal 
IPL shutdown, so they are tested regularly.  Normally Operations would 
consult with whoever is on call in Technical Services (and someone is 
always available) and we would advise whether to initiate a system 
shutdown or do it ourselves if on site; but if communication is 
impossible within the allowed time frame, that decision must be made by 
the ranking Operator on site.


Our procedures also document a quick and dirty shutdown method if there 
is reason to believe the remaining UPS time is at best only one or two 
minutes instead of the typical 15+ minutes - namely, QUIESCE z/OS, 
SYSTEM RESET the production LPAR, and power down the processor and 
other hardware ASAP.  There is greater risk of logical damage - DB2 
threads in a questionable state and possibly a need to recover some 
specific tables from archive logs -- but doing a controlled hardware 
shutdown should at least eliminate any hardware issues on restart.

  Joel C Ewing

Kelly Bert Manning wrote:

Please don't laugh.

I work with applications on a non-sysplex and non-xrf, supported, z/OS
where there have been 3 cases of UPS batteries draining flat, 
followed by uncontrolled server crashes, in the past 17 years.


They all happened in October and November, gale season (Cue background
music with the Gales of November line by Gordon Lightfoot)

After the first one the data center operator said that they would consider
giving operators authority to shut down OS/390 if they were unable to
make immediate contact with the Duty Manager after discovering that
UPS batteries were draining during a power failure and that generator
power was not available or failed after starting.

Four weeks later a carbon copy crash occurred, inspriring a promise that
operators would start draining CICS and IMS message queues and stopping
and rolling back BMPs and DB2 online jobs, while there was still power
in batteries.

Roll forward to this decade, power off during gale season, generators
start, but one fails and goes offline, followed by other mayhem in the
power hardware. Back on batteries for 22 minutes, until they drain and
the z server crashes. Current operator says what promise to shut
everything down cleanly before the batteries drain?.

Is 22 minutes an unreasonable time figure for purging IMS messaqe
queues, bringing down CICS regions, draining initiators, and abending
and rolling back online IMS and DB2 jobs to the last checkpoint, swapping
logs, writing and dismounting log backups and turning off power before 
sudden power loss starts to play mayhem with disk and other hardware?


Oh did I mention, the 2 CPU single processor was only about 30% busy at the
time, the Sunday weekly low CPU use period.

We had a different sort of power outage after the first of the 2 crashes
last decade. Somebody working for one of the potential bidders used
a metal tape measure in an attempt to measure clearance around the
power cable entrance to the building. The resulting demonstration of
how much power moves through the space around a high voltage cable
destroyed several 3380 clone drives, in addition to crashing all
the OS/390 processors. I earned my DBA pay that day.

Bottom line, what should happen when UPS batteries start to drain and
there is no prospect of reliable, high quality, utility power being
restored quickly? Leave it up and roll the dice about losing work
in progress and log data (head crashes and cache controller microcode
bugs) or shut it down cleanly?


--
For IBM-MAIN subscribe 

Re: ups batteries draining, can't switch to generators

2009-05-23 Thread Doug Fuerst
Why not simply fix the problems in the power systems, and test them 
regularly?
If this had been the case in the last 3 places I have worked, I would 
have been escorted off the premises, and my stuff thrown out after me.


Doug

snip


Kelly Bert Manning wrote:

Please don't laugh.

I work with applications on a non-sysplex and non-xrf, supported, z/OS
where there have been 3 cases of UPS batteries draining flat, 
followed by uncontrolled server crashes, in the past 17 years.


They all happened in October and November, gale season (Cue background
music with the Gales of November line by Gordon Lightfoot)

After the first one the data center operator said that they would 
consider

giving operators authority to shut down OS/390 if they were unable to
make immediate contact with the Duty Manager after discovering that
UPS batteries were draining during a power failure and that generator
power was not available or failed after starting.

Four weeks later a carbon copy crash occurred, inspriring a promise that
operators would start draining CICS and IMS message queues and stopping
and rolling back BMPs and DB2 online jobs, while there was still power
in batteries.

Roll forward to this decade, power off during gale season, generators
start, but one fails and goes offline, followed by other mayhem in the
power hardware. Back on batteries for 22 minutes, until they drain and
the z server crashes. Current operator says what promise to shut
everything down cleanly before the batteries drain?.

Is 22 minutes an unreasonable time figure for purging IMS messaqe
queues, bringing down CICS regions, draining initiators, and abending
and rolling back online IMS and DB2 jobs to the last checkpoint, 
swapping
logs, writing and dismounting log backups and turning off power 
before sudden power loss starts to play mayhem with disk and other 
hardware?


Oh did I mention, the 2 CPU single processor was only about 30% busy 
at the

time, the Sunday weekly low CPU use period.

We had a different sort of power outage after the first of the 2 crashes
last decade. Somebody working for one of the potential bidders used
a metal tape measure in an attempt to measure clearance around the
power cable entrance to the building. The resulting demonstration of
how much power moves through the space around a high voltage cable
destroyed several 3380 clone drives, in addition to crashing all
the OS/390 processors. I earned my DBA pay that day.

Bottom line, what should happen when UPS batteries start to drain and
there is no prospect of reliable, high quality, utility power being
restored quickly? Leave it up and roll the dice about losing work
in progress and log data (head crashes and cache controller microcode
bugs) or shut it down cleanly?


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html




--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: ups batteries draining, can't switch to generators

2009-05-23 Thread Scott T. Harder
Hi All,

I have a couple of points to make on this topic:

1)  For all the well documented (and well taken) information about the
importance of shutting down a system orderly and cleanly, I find it
hard to remember when - in my experience - the system ever had
problems coming back up after a hard crash (I worked in OPS for 10+
yrs.).  Maybe back in the 308x days, but 3090+ ??  The hardware was
pretty resilient, as I remember.  I'm not saying that I recommend
anything other than a clean shutdown, but

2)  Kelly's post harken's me back to an old pet peev and that is;
Operations *used* to have good, knowledgable people who could make
decisions without calling 5 people to tell them what to do!  I saw,
firsthand, the dumbing down of OPS and it disturbed me greatly.  I had
mgmt come into Operations where I worked that *never* wanted to be at
fault... that was truly their #1 priority.  They achieved this through
never making a darn decision on their own... never sticking their neck
out no matter what the situation.  I remember one time where I
restarted the master catalog to resolve a problem; as called for by
the manual (ok... I think I could have gotten away with a lesser
evil), but my point is that my mgmt thought I was nuts (and just
lucky).  Maybe so, but as long as we put zombies who won't take action
based on knowledge and experience (and who are - most importantly -
empowered to do so), then more money must be spent on hardware,
systems and automation that will take the place of that.

Just my thoughts...

All the best,
Scott T. Harder

 Kelly Bert Manning wrote:
 Please don't laugh.

 I work with applications on a non-sysplex and non-xrf, supported, z/OS
 where there have been 3 cases of UPS batteries draining flat,
 followed by uncontrolled server crashes, in the past 17 years.

 They all happened in October and November, gale season (Cue background
 music with the Gales of November line by Gordon Lightfoot)

 After the first one the data center operator said that they would consider
 giving operators authority to shut down OS/390 if they were unable to
 make immediate contact with the Duty Manager after discovering that
 UPS batteries were draining during a power failure and that generator
 power was not available or failed after starting.

 Four weeks later a carbon copy crash occurred, inspriring a promise that
 operators would start draining CICS and IMS message queues and stopping
 and rolling back BMPs and DB2 online jobs, while there was still power
 in batteries.

 Roll forward to this decade, power off during gale season, generators
 start, but one fails and goes offline, followed by other mayhem in the
 power hardware. Back on batteries for 22 minutes, until they drain and
 the z server crashes. Current operator says what promise to shut
 everything down cleanly before the batteries drain?.

 Is 22 minutes an unreasonable time figure for purging IMS messaqe
 queues, bringing down CICS regions, draining initiators, and abending
 and rolling back online IMS and DB2 jobs to the last checkpoint, swapping
 logs, writing and dismounting log backups and turning off power before
 sudden power loss starts to play mayhem with disk and other hardware?

 Oh did I mention, the 2 CPU single processor was only about 30% busy at
 the
 time, the Sunday weekly low CPU use period.

 We had a different sort of power outage after the first of the 2 crashes
 last decade. Somebody working for one of the potential bidders used
 a metal tape measure in an attempt to measure clearance around the
 power cable entrance to the building. The resulting demonstration of
 how much power moves through the space around a high voltage cable
 destroyed several 3380 clone drives, in addition to crashing all
 the OS/390 processors. I earned my DBA pay that day.

 Bottom line, what should happen when UPS batteries start to drain and
 there is no prospect of reliable, high quality, utility power being
 restored quickly? Leave it up and roll the dice about losing work
 in progress and log data (head crashes and cache controller microcode
 bugs) or shut it down cleanly?

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
 Search the archives at http://bama.ua.edu/archives/ibm-main.html



-- 
All the best,
Scott T. Harder

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: ups batteries draining, can't switch to generators

2009-05-23 Thread Scott T. Harder
Sorry should have said restarted the CATALOG address space.

On 5/23/09, Scott T. Harder scottyt.har...@gmail.com wrote:
 Hi All,

 I have a couple of points to make on this topic:

 1)  For all the well documented (and well taken) information about the
 importance of shutting down a system orderly and cleanly, I find it
 hard to remember when - in my experience - the system ever had
 problems coming back up after a hard crash (I worked in OPS for 10+
 yrs.).  Maybe back in the 308x days, but 3090+ ??  The hardware was
 pretty resilient, as I remember.  I'm not saying that I recommend
 anything other than a clean shutdown, but

 2)  Kelly's post harken's me back to an old pet peev and that is;
 Operations *used* to have good, knowledgable people who could make
 decisions without calling 5 people to tell them what to do!  I saw,
 firsthand, the dumbing down of OPS and it disturbed me greatly.  I had
 mgmt come into Operations where I worked that *never* wanted to be at
 fault... that was truly their #1 priority.  They achieved this through
 never making a darn decision on their own... never sticking their neck
 out no matter what the situation.  I remember one time where I
 restarted the master catalog to resolve a problem; as called for by
 the manual (ok... I think I could have gotten away with a lesser
 evil), but my point is that my mgmt thought I was nuts (and just
 lucky).  Maybe so, but as long as we put zombies who won't take action
 based on knowledge and experience (and who are - most importantly -
 empowered to do so), then more money must be spent on hardware,
 systems and automation that will take the place of that.

 Just my thoughts...

 All the best,
 Scott T. Harder

 Kelly Bert Manning wrote:
 Please don't laugh.

 I work with applications on a non-sysplex and non-xrf, supported, z/OS
 where there have been 3 cases of UPS batteries draining flat,
 followed by uncontrolled server crashes, in the past 17 years.

 They all happened in October and November, gale season (Cue background
 music with the Gales of November line by Gordon Lightfoot)

 After the first one the data center operator said that they would
 consider
 giving operators authority to shut down OS/390 if they were unable to
 make immediate contact with the Duty Manager after discovering that
 UPS batteries were draining during a power failure and that generator
 power was not available or failed after starting.

 Four weeks later a carbon copy crash occurred, inspriring a promise that
 operators would start draining CICS and IMS message queues and stopping
 and rolling back BMPs and DB2 online jobs, while there was still power
 in batteries.

 Roll forward to this decade, power off during gale season, generators
 start, but one fails and goes offline, followed by other mayhem in the
 power hardware. Back on batteries for 22 minutes, until they drain and
 the z server crashes. Current operator says what promise to shut
 everything down cleanly before the batteries drain?.

 Is 22 minutes an unreasonable time figure for purging IMS messaqe
 queues, bringing down CICS regions, draining initiators, and abending
 and rolling back online IMS and DB2 jobs to the last checkpoint,
 swapping
 logs, writing and dismounting log backups and turning off power before
 sudden power loss starts to play mayhem with disk and other hardware?

 Oh did I mention, the 2 CPU single processor was only about 30% busy at
 the
 time, the Sunday weekly low CPU use period.

 We had a different sort of power outage after the first of the 2 crashes
 last decade. Somebody working for one of the potential bidders used
 a metal tape measure in an attempt to measure clearance around the
 power cable entrance to the building. The resulting demonstration of
 how much power moves through the space around a high voltage cable
 destroyed several 3380 clone drives, in addition to crashing all
 the OS/390 processors. I earned my DBA pay that day.

 Bottom line, what should happen when UPS batteries start to drain and
 there is no prospect of reliable, high quality, utility power being
 restored quickly? Leave it up and roll the dice about losing work
 in progress and log data (head crashes and cache controller microcode
 bugs) or shut it down cleanly?

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
 Search the archives at http://bama.ua.edu/archives/ibm-main.html



 --
 All the best,
 Scott T. Harder



-- 
All the best,
Scott T. Harder

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: ups batteries draining, can't switch to generators

2009-05-23 Thread Ted MacNEIL
I saw, firsthand, the dumbing down of OPS and it disturbed me greatly.

When I started in this business, working as an operator was almost a 
requirement before becoming a SYSPROG.


I had mgmt come into Operations where I worked that *never* wanted to be at 
fault... that was truly their #1 priority.

BTDT. GTTS.

But, my management (at the time) still gave them the call of if/which changes 
would be implemented.
And, which changes to back out.
And, when I complained, they just said I didn't understand, having never been 
in the trenches.
I said, I have the scars to prove it.
Remember when you could do a $PQ and blow everything away (before $PQ,ALL was 
introduced and $PQ was an error).
I did that, once.


They achieved this through never making a darn decision on their own... never 
sticking their neck out no matter what the situation.

I've worked for many financial and government organisations.
That is the prevalent attitude in many departments, not just Computer Ops.
The problem becomes, eventually a decision will be made, due to the erosion of 
the situation, and you can only postpone for so long.

By the way, I think you have the wrong third letter in 'darn'.
-
Too busy driving to stop for gas!

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: ups batteries draining, can't switch to generators

2009-05-23 Thread Scott T. Harder
On 5/23/09, Ted MacNEIL eamacn...@yahoo.ca wrote:
 When I started in this business, working as an operator was almost a
 requirement before becoming a SYSPROG.

Absolutely.


 BTDT. GTTS.

;-)


 But, my management (at the time) still gave them the call of if/which
 changes would be implemented.
 And, which changes to back out.

Yup.  Warrented, I think, though.

 And, when I complained, they just said I didn't understand, having never
 been in the trenches.

Wrong.

 I said, I have the scars to prove it.
 Remember when you could do a $PQ and blow everything away (before $PQ,ALL
 was introduced and $PQ was an error).
 I did that, once.

What about just a $P?  One day, everything came to a screaching halt
and nobody could figure it out.  Turned out, a Print Room Op had
indadvertently entered $P and the whole system drained.  ;-)



They achieved this through never making a darn decision on their own...
 never sticking their neck out no matter what the situation.

 I've worked for many financial and government organisations.
 That is the prevalent attitude in many departments, not just Computer Ops.
 The problem becomes, eventually a decision will be made, due to the erosion
 of the situation, and you can only postpone for so long.

Too much time wasted.  People on the front lines need to be empowered
to make decisions and they need to be well-paid, knowledgable
professionals.


 By the way, I think you have the wrong third letter in 'darn'.

You bet!  ;-)

 -
 Too busy driving to stop for gas!

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
 Search the archives at http://bama.ua.edu/archives/ibm-main.html



-- 
All the best,
Scott T. Harder

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html