Re: ups batteries draining, can't switch to generators
I would respectfully disagree. I posed this question to the group a while back and the consensus was to let it crash. I believe the thought was that there was just not enough time or information to make an informed decision. Plus, anytime someone is frantically banging on a keyboard, screw ups are a sure bet. I had to make this call a while back and followed the advice. All in all, things went very smoothly and we were back in full operation with very few issues. DB2 and JES knew what to do and did it. Batch jobs that were running were simply treated just like any other failure. Our most loved proprietary online hit the ground running. I did drain initiators to prevent any new batch jobs kicking off. I learned a few things, though. One, if you have an OS/2 HMC, shut it down ASAP. When power comes back, it does a check disk that takes forever. I had to IPL using the support elements. The Shark has backup batteries so that it will come in for a graceful landing. That worked. But the unexpected twist was that the Shark would not go ready until the batteries were recharged. We now have a DS8100 but I would expect it would be no different. So, my suggestion for a loss of power scenario is to immediately evacuate personnel to a safe place. Nothing is worth getting someone hurt. BTDT. -Original Message- From: IBM Mainframe Discussion List [mailto:ibm-m...@bama.ua.edu] On Behalf Of Joel C Ewing Sent: Saturday, May 23, 2009 12:14 PM To: IBM-MAIN@bama.ua.edu Subject: Re: ups batteries draining, can't switch to generators While YMMV, out experience has been that any utility power failure lasting more than 5-15 seconds is a solid failure and the outage will invariably be an hour or more while the utility company locates and fixes the problem. This means that unless you have an extraordinary UPS, or functional generators able to recharge the UPS, you are going down -- The only issue is when or how. Given the choice, a controlled shutdown from which restart is almost guaranteed is infinitely better than gambling the potential of adding hours of downtime and potential data loss from an abrupt termination for the questionable benefit of staying up for a few minutes longer. ..snip Joel C Ewing NOTICE: This electronic mail message and any files transmitted with it are intended exclusively for the individual or entity to which it is addressed. The message, together with any attachment, may contain confidential and/or privileged information. Any unauthorized review, use, printing, saving, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email and delete all copies. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: ups batteries draining, can't switch to generators
--snip I would respectfully disagree. I posed this question to the group a while back and the consensus was to let it crash. I believe the thought was that there was just not enough time or information to make an informed decision. Plus, anytime someone is frantically banging on a keyboard, screw ups are a sure bet. I had to make this call a while back and followed the advice. All in all, things went very smoothly and we were back in full operation with very few issues. DB2 and JES knew what to do and did it. Batch jobs that were running were simply treated just like any other failure. Our most loved proprietary online hit the ground running. I did drain initiators to prevent any new batch jobs kicking off. I learned a few things, though. One, if you have an OS/2 HMC, shut it down ASAP. When power comes back, it does a check disk that takes forever. I had to IPL using the support elements. The Shark has backup batteries so that it will come in for a graceful landing. That worked. But the unexpected twist was that the Shark would not go ready until the batteries were recharged. We now have a DS8100 but I would expect it would be no different. So, my suggestion for a loss of power scenario is to immediately evacuate personnel to a safe place. Nothing is worth getting someone hurt. BTDT. -unsnip-- I tend to agree with your conclusions, expecially the last one. Consider using your automation to accomplish an orderly shutdown, or alternatively, use something like the COMMAND program from the CBTTAPE site for same. While it's not always perfect, it will help eliminate the finger checks that so often happen in times of great stress. Getting a clean termination of batch jobs will always be problematic but automation of some sort can be used to at least affect an orderly shutdown of online systems, DBMS's, etc. We use COMMAND and other than batch jobs, we are completely shut down and ready for power down, or IPL, in about 2 minutes. (3 CICS regions, 1 DBMS, plus TSO, VTAM and various monitoring tools.) (We learned just how valuable this was during the Chicago Flood of 1992, when Edison have us 10 minutes warning before cutting our power.) -- Rick -- Remember that if you’re not the lead dog, the view never changes. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: ups batteries draining, can't switch to generators
While YMMV, out experience has been that any utility power failure lasting more than 5-15 seconds is a solid failure and the outage will invariably be an hour or more while the utility company locates and fixes the problem. This means that unless you have an extraordinary UPS, or functional generators able to recharge the UPS, you are going down -- The only issue is when or how. Given the choice, a controlled shutdown from which restart is almost guaranteed is infinitely better than gambling the potential of adding hours of downtime and potential data loss from an abrupt termination for the questionable benefit of staying up for a few minutes longer. One of issues sounds like a management problem. Placing the power to make a shutdown decision solely in the hands of a duty manager who is not 100% available obviously doesn't work for decisions requiring a 10 minute or better response time. The other issue is that you must have automation support in place to minimize the z/OS shutdown time and documented emergency shutdown procedure that are required reading for whoever may have to effect the shutdown. We have documented procedures for emergency system and hardware shutdown and z/OS automated procedures (using Netview automation and CBT freebie program NETINIT/NETSTOP) to take down online and batch systems and DB2 as quickly as possible. These are the same procedures used for normal IPL shutdown, so they are tested regularly. Normally Operations would consult with whoever is on call in Technical Services (and someone is always available) and we would advise whether to initiate a system shutdown or do it ourselves if on site; but if communication is impossible within the allowed time frame, that decision must be made by the ranking Operator on site. Our procedures also document a quick and dirty shutdown method if there is reason to believe the remaining UPS time is at best only one or two minutes instead of the typical 15+ minutes - namely, QUIESCE z/OS, SYSTEM RESET the production LPAR, and power down the processor and other hardware ASAP. There is greater risk of logical damage - DB2 threads in a questionable state and possibly a need to recover some specific tables from archive logs -- but doing a controlled hardware shutdown should at least eliminate any hardware issues on restart. Joel C Ewing Kelly Bert Manning wrote: Please don't laugh. I work with applications on a non-sysplex and non-xrf, supported, z/OS where there have been 3 cases of UPS batteries draining flat, followed by uncontrolled server crashes, in the past 17 years. They all happened in October and November, gale season (Cue background music with the Gales of November line by Gordon Lightfoot) After the first one the data center operator said that they would consider giving operators authority to shut down OS/390 if they were unable to make immediate contact with the Duty Manager after discovering that UPS batteries were draining during a power failure and that generator power was not available or failed after starting. Four weeks later a carbon copy crash occurred, inspriring a promise that operators would start draining CICS and IMS message queues and stopping and rolling back BMPs and DB2 online jobs, while there was still power in batteries. Roll forward to this decade, power off during gale season, generators start, but one fails and goes offline, followed by other mayhem in the power hardware. Back on batteries for 22 minutes, until they drain and the z server crashes. Current operator says what promise to shut everything down cleanly before the batteries drain?. Is 22 minutes an unreasonable time figure for purging IMS messaqe queues, bringing down CICS regions, draining initiators, and abending and rolling back online IMS and DB2 jobs to the last checkpoint, swapping logs, writing and dismounting log backups and turning off power before sudden power loss starts to play mayhem with disk and other hardware? Oh did I mention, the 2 CPU single processor was only about 30% busy at the time, the Sunday weekly low CPU use period. We had a different sort of power outage after the first of the 2 crashes last decade. Somebody working for one of the potential bidders used a metal tape measure in an attempt to measure clearance around the power cable entrance to the building. The resulting demonstration of how much power moves through the space around a high voltage cable destroyed several 3380 clone drives, in addition to crashing all the OS/390 processors. I earned my DBA pay that day. Bottom line, what should happen when UPS batteries start to drain and there is no prospect of reliable, high quality, utility power being restored quickly? Leave it up and roll the dice about losing work in progress and log data (head crashes and cache controller microcode bugs) or shut it down cleanly? -- For IBM-MAIN subscribe
Re: ups batteries draining, can't switch to generators
Why not simply fix the problems in the power systems, and test them regularly? If this had been the case in the last 3 places I have worked, I would have been escorted off the premises, and my stuff thrown out after me. Doug snip Kelly Bert Manning wrote: Please don't laugh. I work with applications on a non-sysplex and non-xrf, supported, z/OS where there have been 3 cases of UPS batteries draining flat, followed by uncontrolled server crashes, in the past 17 years. They all happened in October and November, gale season (Cue background music with the Gales of November line by Gordon Lightfoot) After the first one the data center operator said that they would consider giving operators authority to shut down OS/390 if they were unable to make immediate contact with the Duty Manager after discovering that UPS batteries were draining during a power failure and that generator power was not available or failed after starting. Four weeks later a carbon copy crash occurred, inspriring a promise that operators would start draining CICS and IMS message queues and stopping and rolling back BMPs and DB2 online jobs, while there was still power in batteries. Roll forward to this decade, power off during gale season, generators start, but one fails and goes offline, followed by other mayhem in the power hardware. Back on batteries for 22 minutes, until they drain and the z server crashes. Current operator says what promise to shut everything down cleanly before the batteries drain?. Is 22 minutes an unreasonable time figure for purging IMS messaqe queues, bringing down CICS regions, draining initiators, and abending and rolling back online IMS and DB2 jobs to the last checkpoint, swapping logs, writing and dismounting log backups and turning off power before sudden power loss starts to play mayhem with disk and other hardware? Oh did I mention, the 2 CPU single processor was only about 30% busy at the time, the Sunday weekly low CPU use period. We had a different sort of power outage after the first of the 2 crashes last decade. Somebody working for one of the potential bidders used a metal tape measure in an attempt to measure clearance around the power cable entrance to the building. The resulting demonstration of how much power moves through the space around a high voltage cable destroyed several 3380 clone drives, in addition to crashing all the OS/390 processors. I earned my DBA pay that day. Bottom line, what should happen when UPS batteries start to drain and there is no prospect of reliable, high quality, utility power being restored quickly? Leave it up and roll the dice about losing work in progress and log data (head crashes and cache controller microcode bugs) or shut it down cleanly? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: ups batteries draining, can't switch to generators
Hi All, I have a couple of points to make on this topic: 1) For all the well documented (and well taken) information about the importance of shutting down a system orderly and cleanly, I find it hard to remember when - in my experience - the system ever had problems coming back up after a hard crash (I worked in OPS for 10+ yrs.). Maybe back in the 308x days, but 3090+ ?? The hardware was pretty resilient, as I remember. I'm not saying that I recommend anything other than a clean shutdown, but 2) Kelly's post harken's me back to an old pet peev and that is; Operations *used* to have good, knowledgable people who could make decisions without calling 5 people to tell them what to do! I saw, firsthand, the dumbing down of OPS and it disturbed me greatly. I had mgmt come into Operations where I worked that *never* wanted to be at fault... that was truly their #1 priority. They achieved this through never making a darn decision on their own... never sticking their neck out no matter what the situation. I remember one time where I restarted the master catalog to resolve a problem; as called for by the manual (ok... I think I could have gotten away with a lesser evil), but my point is that my mgmt thought I was nuts (and just lucky). Maybe so, but as long as we put zombies who won't take action based on knowledge and experience (and who are - most importantly - empowered to do so), then more money must be spent on hardware, systems and automation that will take the place of that. Just my thoughts... All the best, Scott T. Harder Kelly Bert Manning wrote: Please don't laugh. I work with applications on a non-sysplex and non-xrf, supported, z/OS where there have been 3 cases of UPS batteries draining flat, followed by uncontrolled server crashes, in the past 17 years. They all happened in October and November, gale season (Cue background music with the Gales of November line by Gordon Lightfoot) After the first one the data center operator said that they would consider giving operators authority to shut down OS/390 if they were unable to make immediate contact with the Duty Manager after discovering that UPS batteries were draining during a power failure and that generator power was not available or failed after starting. Four weeks later a carbon copy crash occurred, inspriring a promise that operators would start draining CICS and IMS message queues and stopping and rolling back BMPs and DB2 online jobs, while there was still power in batteries. Roll forward to this decade, power off during gale season, generators start, but one fails and goes offline, followed by other mayhem in the power hardware. Back on batteries for 22 minutes, until they drain and the z server crashes. Current operator says what promise to shut everything down cleanly before the batteries drain?. Is 22 minutes an unreasonable time figure for purging IMS messaqe queues, bringing down CICS regions, draining initiators, and abending and rolling back online IMS and DB2 jobs to the last checkpoint, swapping logs, writing and dismounting log backups and turning off power before sudden power loss starts to play mayhem with disk and other hardware? Oh did I mention, the 2 CPU single processor was only about 30% busy at the time, the Sunday weekly low CPU use period. We had a different sort of power outage after the first of the 2 crashes last decade. Somebody working for one of the potential bidders used a metal tape measure in an attempt to measure clearance around the power cable entrance to the building. The resulting demonstration of how much power moves through the space around a high voltage cable destroyed several 3380 clone drives, in addition to crashing all the OS/390 processors. I earned my DBA pay that day. Bottom line, what should happen when UPS batteries start to drain and there is no prospect of reliable, high quality, utility power being restored quickly? Leave it up and roll the dice about losing work in progress and log data (head crashes and cache controller microcode bugs) or shut it down cleanly? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html -- All the best, Scott T. Harder -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: ups batteries draining, can't switch to generators
Sorry should have said restarted the CATALOG address space. On 5/23/09, Scott T. Harder scottyt.har...@gmail.com wrote: Hi All, I have a couple of points to make on this topic: 1) For all the well documented (and well taken) information about the importance of shutting down a system orderly and cleanly, I find it hard to remember when - in my experience - the system ever had problems coming back up after a hard crash (I worked in OPS for 10+ yrs.). Maybe back in the 308x days, but 3090+ ?? The hardware was pretty resilient, as I remember. I'm not saying that I recommend anything other than a clean shutdown, but 2) Kelly's post harken's me back to an old pet peev and that is; Operations *used* to have good, knowledgable people who could make decisions without calling 5 people to tell them what to do! I saw, firsthand, the dumbing down of OPS and it disturbed me greatly. I had mgmt come into Operations where I worked that *never* wanted to be at fault... that was truly their #1 priority. They achieved this through never making a darn decision on their own... never sticking their neck out no matter what the situation. I remember one time where I restarted the master catalog to resolve a problem; as called for by the manual (ok... I think I could have gotten away with a lesser evil), but my point is that my mgmt thought I was nuts (and just lucky). Maybe so, but as long as we put zombies who won't take action based on knowledge and experience (and who are - most importantly - empowered to do so), then more money must be spent on hardware, systems and automation that will take the place of that. Just my thoughts... All the best, Scott T. Harder Kelly Bert Manning wrote: Please don't laugh. I work with applications on a non-sysplex and non-xrf, supported, z/OS where there have been 3 cases of UPS batteries draining flat, followed by uncontrolled server crashes, in the past 17 years. They all happened in October and November, gale season (Cue background music with the Gales of November line by Gordon Lightfoot) After the first one the data center operator said that they would consider giving operators authority to shut down OS/390 if they were unable to make immediate contact with the Duty Manager after discovering that UPS batteries were draining during a power failure and that generator power was not available or failed after starting. Four weeks later a carbon copy crash occurred, inspriring a promise that operators would start draining CICS and IMS message queues and stopping and rolling back BMPs and DB2 online jobs, while there was still power in batteries. Roll forward to this decade, power off during gale season, generators start, but one fails and goes offline, followed by other mayhem in the power hardware. Back on batteries for 22 minutes, until they drain and the z server crashes. Current operator says what promise to shut everything down cleanly before the batteries drain?. Is 22 minutes an unreasonable time figure for purging IMS messaqe queues, bringing down CICS regions, draining initiators, and abending and rolling back online IMS and DB2 jobs to the last checkpoint, swapping logs, writing and dismounting log backups and turning off power before sudden power loss starts to play mayhem with disk and other hardware? Oh did I mention, the 2 CPU single processor was only about 30% busy at the time, the Sunday weekly low CPU use period. We had a different sort of power outage after the first of the 2 crashes last decade. Somebody working for one of the potential bidders used a metal tape measure in an attempt to measure clearance around the power cable entrance to the building. The resulting demonstration of how much power moves through the space around a high voltage cable destroyed several 3380 clone drives, in addition to crashing all the OS/390 processors. I earned my DBA pay that day. Bottom line, what should happen when UPS batteries start to drain and there is no prospect of reliable, high quality, utility power being restored quickly? Leave it up and roll the dice about losing work in progress and log data (head crashes and cache controller microcode bugs) or shut it down cleanly? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html -- All the best, Scott T. Harder -- All the best, Scott T. Harder -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: ups batteries draining, can't switch to generators
I saw, firsthand, the dumbing down of OPS and it disturbed me greatly. When I started in this business, working as an operator was almost a requirement before becoming a SYSPROG. I had mgmt come into Operations where I worked that *never* wanted to be at fault... that was truly their #1 priority. BTDT. GTTS. But, my management (at the time) still gave them the call of if/which changes would be implemented. And, which changes to back out. And, when I complained, they just said I didn't understand, having never been in the trenches. I said, I have the scars to prove it. Remember when you could do a $PQ and blow everything away (before $PQ,ALL was introduced and $PQ was an error). I did that, once. They achieved this through never making a darn decision on their own... never sticking their neck out no matter what the situation. I've worked for many financial and government organisations. That is the prevalent attitude in many departments, not just Computer Ops. The problem becomes, eventually a decision will be made, due to the erosion of the situation, and you can only postpone for so long. By the way, I think you have the wrong third letter in 'darn'. - Too busy driving to stop for gas! -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: ups batteries draining, can't switch to generators
On 5/23/09, Ted MacNEIL eamacn...@yahoo.ca wrote: When I started in this business, working as an operator was almost a requirement before becoming a SYSPROG. Absolutely. BTDT. GTTS. ;-) But, my management (at the time) still gave them the call of if/which changes would be implemented. And, which changes to back out. Yup. Warrented, I think, though. And, when I complained, they just said I didn't understand, having never been in the trenches. Wrong. I said, I have the scars to prove it. Remember when you could do a $PQ and blow everything away (before $PQ,ALL was introduced and $PQ was an error). I did that, once. What about just a $P? One day, everything came to a screaching halt and nobody could figure it out. Turned out, a Print Room Op had indadvertently entered $P and the whole system drained. ;-) They achieved this through never making a darn decision on their own... never sticking their neck out no matter what the situation. I've worked for many financial and government organisations. That is the prevalent attitude in many departments, not just Computer Ops. The problem becomes, eventually a decision will be made, due to the erosion of the situation, and you can only postpone for so long. Too much time wasted. People on the front lines need to be empowered to make decisions and they need to be well-paid, knowledgable professionals. By the way, I think you have the wrong third letter in 'darn'. You bet! ;-) - Too busy driving to stop for gas! -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html -- All the best, Scott T. Harder -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html