Mike has a lot of good ideas here, but I'll comment on a couple of minor points.
1. We have multiple VM systems in our shop, so changing the node name during a disaster would cause more problems than it would solve. We have too much code that does things like "If node = 'VMSYS1' Then Do ..." to be able to tolerate new node names for DR. We update the logo file to change "RUNNING" in the lower right corner of the screen to "DR SYS". The node name stays the same. 2. The other thing I would never change during DR is the time zone. Your users have scheduled jobs based on the normal time zone of the system. If you change the time zone, those jobs will run early or late. If the jobs run late, users will be late getting their reports, data feeds, or whatever. If the jobs run early, they may run before feeds from other systems have arrived. Either one is not good. Dennis "See, I'm a man of simple tastes. I like dynamite, and gunpowder... And gasoline! Do you know what all of these things have in common? They're cheap!" -- Heath Ledger as The Joker, in The Dark Knight -----Original Message----- From: The IBM z/VM Operating System [mailto:ib...@listserv.uark.edu] On Behalf Of Mike Walter Sent: Tuesday, February 02, 2010 15:08 To: IBMVM@LISTSERV.UARK.EDU Subject: Re: [IBMVM] First time DR excercise Is your management 100% certain that YOU will survive the unplanned disaster? Do you agree with them? What happens when there is a natural disaster where you do survive, but your family needs you immediately? Do you choose to go to the D.R. site for an unspecified period of time, hoping that someone else will take care of your family's needs (some could involve hospitalization, right?). In that vein, see below for alternative to which I am naturally partial since it allows easy, well-tested, automated changes when running on a different CPUs. That's especially handy when an experienced systems programmer is not available. I have always tied to write my VM D.R. plan so that a typical VM operator, or even a non-VM manager (!) could bring up the VM system at least to the point that program product cpuid keys need to be updated for the recovery CPU. First, if you plan to restore your production SPOOL data at your D.R. site, placement (slot numbers) of the SPOOL volumes in the SYSTEM CONFIG's "OWN"ed list does not matter *if* you are formatting them and restoring from an SPXTAPE backup (or a product that calls SPXTAPE, such as VM:Spool). However, I share with you that... :religion on 1) it is always a "Best Practice" to place your SPOOL volumes in the SYSTEM CONFIG either starting in SLOT 1, 2) with plenty of RESERVED slots below for adding more (RESERVED slots have very little overhead, IBM won't even charge you more for them!) e.g. PRODVM: RECOVERY: CP_Owned Slot 1 VMSP01 OWN PRODVM: RECOVERY: CP_Owned Slot 2 VMSP02 OWN PRODVM: RECOVERY: CP_Owned Slot 3 VMSP03 OWN CP_Owned Slot 4 RESERVED ... and more RESERVED slots for future SPOOL growth as needed... 3) or in the very last slots (i.e. ending with SLOT 255 backing up from there with plenty of RESERVED slots above), ... and more RESERVED slots above for future SPOOL growth as needed... CP_Owned Slot 252 RESERVED PRODVM: RECOVERY: CP_Owned Slot 253 VMSP03 OWN PRODVM: RECOVERY: CP_Owned Slot 254 VMSP02 OWN PRODVM: RECOVERY: CP_Owned Slot 255 VMSP01 OWN 4) Regardless of their top or bottom location, include "flashing red-neon" comments (that's just a point of emphasis, don't search IBM doc for "flashing red-neon" comments), warning you and your sysprog heirs to NEVER CHANGE SPOOL VOLUME SLOT NUMBERS WITHOUT A CURRENT SPXTAPE BACKUP (and a thorough plan already tested on a 2nd level system, and ... an off-site copy of your resume)! :religion off Now, about that SYSTEM CONFIG file. ----------------------------------- Let's presume that your normal product system runs on an ancient, creaky old z800, with serial number 12345. And that your employer has a number of shiny newer boxes with known serial numbers, upon any of which you might be able to RECOVER your production z/VM system should your creaky rusted decrepit old z800 crash and take a while to repair -- as parts are shipped from some far-off location where old machines are stored (maybe in the desert near all old those old aircraft?). If none of those "RECOVERY" systems are available (i.e. it really was a *true* DISASTER), then you'll come up on a disaster recovery provider's "DISASTER" machine. Consider including in the SYSTEM CONFIG something along the lines of: ---<snip>--- /* ----------------------------------------------------------- */ /* Standard operating environment. See also "NODAL CONFIG Y" */ /* which contains some hard-coded device addresses that change */ /* when running on different systems based upon the defined */ /* "System_Identifer" from this file. */ /* Warning!: The last (of duplicate) System_ID record with */ /* matching model and cpuid value overrides previous System_IDs*/ /* ----------------------------------------------------------- */ System_ID 2066 %%2345 PRODVM System_ID 2094 %%nnn1 RECOVERY System_ID 2094 %%nnn2 RECOVERY System_ID 2097 %%nnn3 RECOVERY System_ID 2097 %%nnn4 RECOVERY /* Use DISASTER only when we're not in our data center so */ /* that extensive D.R. automation can take place on various */ /* service machines when running PROFILE EXEC/PROFILE GCS's. */ System_Identifier_Default DISASTER /**********************************************************************/ /* Timezone Definitions */ /* (Next assumes Central time for prod, Eastern time for DISASTER) */ /**********************************************************************/ PRODVM: TESTVM: Timezone_Definition CST West 06.00.00 PRODVM: TESTVM: Timezone_Definition CDT West 05.00.00 RECOVERY: Timezone_Definition CST West 06.00.00 RECOVERY: Timezone_Definition CDT West 05.00.00 DISASTER: Timezone_Definition EST West 05.00.00 DISASTER: Timezone_Definition EDT West 04.00.00 ---<snip>--- What's that you say? "What are those lines beginning with 'PRODVM:', 'RECOVERY:', 'DISASTER:', etc.?" Well, those are the SYSTEM CONFIG's "RECORD QUALIFERS" - oft overlooked/VERY POWERFUL! Those names "PRODVM", "RECOVERY", "DISASTER" (or anything you choose to use) were defined above when the CPUID matched one listed in the "System_ID" records. I place the "System_ID" records near the top of our SYSTEM CONFIG, so that they are available for use everywhere below that point. Any record that begins with one of the "RECORD QUALIFIER" statements, will only be processed on a CPU with a matching serial number. So, to automate start-up (maybe you can sleep through a test!?), in "PROFILE EXEC" on OPERATOR, include some code along the lines of: ---<snip>--- parse value diag(08,'QUERY USERID') with , self . ConfigSysID . '15'x . ?prodvm=(ConfigSysID)='PRODVM' ?testvm=(ConfigSysID)='TESTVM' ... more local code... If \?prodvm & \?testvm then 'EXEC OPDISAST' ... more local code... The "OPDISAST EXEC" (code yours however you see fit, it's just a "Simple Matter of Programming") should prompt the Operator along the lines of: /* Prolog; See Epilog for additional information ******************** * Exec Name - OPDISAST EXEC for OPERATOR 191 disk. * * Unit Support - IS * * Status - Version 1, Release 2.1 * ********************************************************************/ address 'COMMAND' parse source xos xct xfn xft xfm xcmd xenvir . parse upper arg parms 0 operands '(' options ')' parmrest 'CP SPOOL CONSOLE * START' 'CP SET RUN ON' ?test=wordpos('TEST',parms)>0 hi='1DE8'x /* 3270 hilight attrs */ lo='1D60'x /* 3270 std attributes */ ...more code as needed... 'STATE CLRSCRN MODULE *' /* Preferred */ If rc=0 then clear='CLRSCRN MORE' else clear='VMFCLEAR' /* Fall-back */ doc=hi'To HOLD the screen press "ENTER",'lo||, 'or to proceed press "CLEAR"' cleardoc=left(doc,78-length(clear))||'('clear')' Call Clear /* Just so it shows in a console listing */ say hi say '>>>>>>>>> WARNING <<<<<<<>>>>>>> WARNING <<<<<<<>>>>>>> WARNING' '<<<<<<<<<' say '' say center('The response to a "QUERY CPUID" does not match that',78) say center('normally returned at "home".',78) say lo parse value diag(08,'CP QUERY VIRTUAL CONSOLE') with , . . . consrdev . Do forever say hi say center('Is this system running on a Disaster Recovery', 'processor?',79) say lo say center('Reply:'hi'1, 2, or 3'lo,79) say say 'Replying:'hi||1||lo, 'will take some automated actions to bring' say ' the PRODVM system up as if it were running' say ' on its normal machine located in the PROD' say ' DATA CENTER.' say say 'Replying:'hi||2||lo, 'will initiate standard production' , 'procedures, likely' say ' causing vendor programs to fail, and multitudes' say ' of other'hi'"BAD"'lo'things to happen!' say ' If the VMPROD processor has changed, contact' , 'the z/VM sysprogs to update "SYSTEM CONFIG".' say say ' DO NOT simply reply "2" unless the zVM' , 'sysprogs have instructed you to do so!' say say 'Replying:'hi||3||lo, 'will initiate automated D.R. procedures' say ' which will write two files to the A-disk' say ' preventing AUTOLOG2 from autologging IDs;' say ' a tape will be requested from which' say ' SDFs (System Data Files) will be restored' say ' to spool. say ' This is used only at a D.R. vendor site' say ' AWAY from home.' say say 'Replying:'hi||'QUIT'||lo say ' will issue a "CP SYSTEM RESET" on this userid' say ' giving the VM Systems Programming staff a chance' say ' to change files as needed. This is recommended' say ' ONLY for the VM Systems Programming staff!' parse upper pull response 2 . If abbrev('QUIT',response,1) then Do say center(hi'Response accepted:' response,79) say center(' Issuing CP SYSTEM RESET to permit' , 'file updates.',79) say lo 'CP SLEEP 2 SEC' 'CP SYSTEM RESET' Exit 'How did we *ever* get **here**!!??' End If response=1 then Do say center(hi'Response accepted:' response,79) say time() 'Making changes for Business Resumption at' , 'the PRODUCTION DATA CENTER.' 'CP SLEEP 2 SEC' Call RecoveryAtHome /* For YOU to write, a S.M.O.P. */ say say time() 'VM:Operator should start in a few moments...' 'CP SLEEP 5 SEC' 'CP SPOOL CONSOLE CLOSE' Call Exit 0 End If response=2 then Do say center(hi'Response accepted:' response,79) say center(' Bypassing Disaster Recovery conversions.'lo,79) say center(' You *better* know what you are doing!'lo,79) 'CP SLEEP 2 SEC' Call Exit 0 End If response=3 then Leave say 'Read the message and respond YES or NO.' End ---<snip>--- Obviously, there's more that can/should be done. The system is your oyster. The key is that the matching "System_ID" (which was selected at IPL from the "SYSTEM CONFIG" file based on the CPU serial number) is displayed in the bottom right-hand corner of a 3270 terminal display, and is also returned from a 'CP QUERY USERID'. E.g. ---<snip>--- parse value diag(08,'QUERY USERID') with , self . ConfigSysID . '15'x . ---<snip>--- On the other hand, the CMS command "IDENTIFY" returns the value from the "SYSTEM NETID S" file. e.g. ---<snip>--- identify OPERATOR AT PRODVM VIA RSCS 02/02/10 16:10:11 CST TUESDAY Ready; ---<snip>--- During our D.R. tests, our "PRODVM" (the name has been changed for these examples) comes up on a different box in another of our data centers. Our "NODAL CONFIG Y" mentioned above gives us a single place to enter real device addresses based on serial number. No CP Directory changes required. No hunting though exits of myriad products. It contains (in part): ---<snip>--- * This file is read by various REXX and GCS programs for use in * normal operations and in disaster recovery both in Lincolnshire and * at a disaster recovery vendor. *SysCfgID Svm_Name Dtyp Rdev Comment * CPC4 LPAR2 as of 20090214 PRODVM VTAM CTCA 0D61 'CTC 0D61 to SYSE, Read1' PRODVM VTAM CTCA 1D61 'CTC 1D61 to SYSE, Read2' PRODVM VTAM CTCA 0D62 'CTC 0D62 to SYSE, Write1' PRODVM VTAM CTCA 1D62 'CTC 1D62 to SYSE, Write2' PRODVM TCPIP OSA 0140 0140 PRODVM TCPIP OSA 0141 0141 PRODVM TCPIP OSA 0142 0142 * CPC4 LPAR2 as of Feb 2008? PRODVM VTAM CTCA 0D71 'CTC 0D71 to SYSF, Read1' PRODVM VTAM CTCA 1D71 'CTC 1D71 to SYSF, Read2' PRODVM VTAM CTCA 0D72 'CTC 0D72 to SYSF, Write1' PRODVM VTAM CTCA 1D72 'CTC 1D72 to SYSF, Write2' PRODVM TCPIP OSA 0140 0140 PRODVM TCPIP OSA 0141 0141 PRODVM TCPIP OSA 0142 0142 * CPC4 LPAR3 TESTVM VTAM CTCA 0C50 'CTC 0C50 to SYSE, Read1' TESTVM VTAM CTCA 0C51 'CTC 0C51 to SYSE Write1' * CPC5 LPAR3 RECOVERY VTAM CTCA 0D61 'CTC 0D61 to SYSE for SNA terminals, Read1' RECOVERY VTAM CTCA 1D61 'CTC 1D61 to SYSE for SNA terminals, Read2' RECOVERY VTAM CTCA 0D62 'CTC 0D62 to SYSE for SNA terminals, Write1' RECOVERY VTAM CTCA 1D62 'CTC 1D62 to SYSE for SNA terminals, Write2' RECOVERY TCPIP OSA 6051 0140 RECOVERY TCPIP OSA 6052 0141 RECOVERY TCPIP OSA 6053 0142 ---<snip>--- When VTAM and TCPIP come up, local code in them reads the "NODAL CONFIG Y" file, making appropriate adjustments. Some service machines have been updated to check the "ConfigSysID", making appropriate changes to their start-up, or operating processes. For example, we don't want VMSCHED kicking off during D.R., trying to do things that the users and we would regret, so its 'PROFILE EXEC' issues: ---<snip>--- /* ------------------------------------------------------------- */ /* Especially at RECOVERY until we are ready for production */ /* servers and work - stop INITIATE from happening. */ /* ------------------------------------------------------------- */ parse value diag(08,'QUERY USERID') with . . ConfigSysId . '15'x . ?prodvm=(ConfigSysId='PRODVM') If \?prodvm then 'PIPE (NAME DoNotInitiate)' , '| STRLITERAL /INITIATE OFF ALL/' , '| PAD 80' , '| > VMSCHED INITIATE A F 80' ---<snip>--- We've been asked during testing to not _write_ to the Virtual Tape Servers so that time required to restore normal production is reduced. Changes made to VM:Tape exits only permit READ mounts during D.R. tests (identified by the "RECOVERY" System_ID). Our z/VM system is generally up and running within 30 minutes of being given the recovery machine upon which to IPL the 'PRODVM' system. Consider also: we are using remotely dual-copied *and* remotely mirrored DASD - which does not need to be restored. Our users have been very good about testing, usually completing their test within an hour. From IPL, through program product CPUID record updates, user testing acceptance, to SHUTDOWN is often about 2 hours. The z/OS folks here, with a MUCH MORE complicated sysplex environment including many, many CICS regions, DB2's and lots more, are generally here for 14+ hours (through cutover back to prod). I've crafted most of these auto-recovery solutions over my 25 years at Hewitt Associates. "Continuous Quality Improvements" is more than just a slogan - it lets me sleep through much of our D.R. test time. What used to be D.R. tests offsite for 48+ hours of non-stop sysprog work without any sleep has become a veritable cake walk (sometimes they do bring in cake, along with better foodstuffs, for the tired, huddled masses!). Being able to test on a different machine in our other data center, using remote-copy DASD (no restores required), is a huge part of that. But automating the recovery to be able to come up with minimal sysprog intervention was also very significant. Critically: the "ConfigSysID" automation using the "SYSTEM CONFIG" System_IDs provides the company with the means to bring the system up even if the disaster includes the VM sysprogs. Good new for them, perhaps not so good for us. ;-) Mike Walter Hewitt Associates The opinions expressed herein are mine alone, not my employer's. The information contained in this e-mail and any accompanying documents may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient of this message, or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message, including any attachments. Any dissemination, distribution or other use of the contents of this message by anyone other than the intended recipient is strictly prohibited. All messages sent to and from this e-mail address may be monitored as permitted by applicable law and regulations to ensure compliance with our internal policies and to protect our business. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, lost or destroyed, or contain viruses. You are deemed to have accepted these risks if you communicate with us by e-mail.