I had a boss of the government side with a habit of pulling random tapes during 
a DR test and saying something like "this is unreadable. Proceed with the 
test." He also pointed at random people and said "You're dead." ;the remaining 
crew had to proceed without your involvement. Nasty, but if we couldn't handle 
it during a drill then we weren't prepared for a real disaster; I absolutely 
approved. The recovery plan should cover all contingencies.

IMHO at least one set of backups should be far enough away that a major 
disaster, e.g., earthquake, doesn't take out both the site and the backup 
vault. The same separation principle applies to hot backup and hot vaulting 
sites.


--
Shmuel (Seymour J.) Metz
http://mason.gmu.edu/~smetz3

________________________________________
From: IBM Mainframe Discussion List [[email protected]] on behalf of 
Timothy Sipples [[email protected]]
Sent: Wednesday, July 27, 2022 1:12 AM
To: [email protected]
Subject: Re: Mainframe outage affecting W.Va. state agencies could take 48, 72 
hours to resolve

I have absolutely no information about this incident other than what the media 
are reporting. I wish everyone involved the best success.

My *personal* curiosity revolves around the Disaster Recovery plan and 
resources. As I'm sure we all know the standard/typical operational practice is 
to have an alternate site, separated at some distance, equipped with standby 
resources. Disk subsystems replicate between sites (primary to alternate) 
either synchronously or asynchronously. Or at least there'd be a remote tape 
library, preferably virtual to some degree (for performance reasons), 
preferably with multiple incremental backups per day. If the primary site is 
lost, for whatever reason(s), the IT operations team restores at least critical 
services from the alternate site. It might be a long RTO (24 hours for example) 
if it's a basic/entry DR arrangement, but it'd be something.

Over many years I've only ever worked with two clients that had no real DR plan 
and essentially no DR resources when I first met them. As it happens they were 
both government agencies, but they were also both located in fairly poor or 
poorer developing countries. One client took frequent tape backups and shuttled 
physical tapes off-site so at least they'd be able to recover to some point, 
eventually. (RTO="a week or two," RPO=12+ hours probably.) I wasn't happy they 
had to operate that way, but their constraints were genuine. I worked with the 
other government agency to eliminate their exposure within a tight budget, and 
they now have an alternate site with a reasonable DR capability.

I also remember working with another customer in a developing country, a bank. 
They were upgrading their systems, and their original plan involved losing DR 
protections for a couple days (about 48 hours) as I recall. That plan troubled 
me, so I worked with them to create a better, safer plan that preserved DR 
coverage throughout the upgrade project. They chose the revised plan. They 
completed their upgrade project on-time, within budget, and without incident.

So what happened to the alternate site (and DR switchover to it)?

— — — — —
Timothy Sipples
Senior Architect
Digital Assets, Industry Solutions, and Cybersecurity
IBM zSystems/LinuxONE, Asia-Pacific
[email protected]


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to