I had a boss of the government side with a habit of pulling random tapes during a DR test and saying something like "this is unreadable. Proceed with the test." He also pointed at random people and said "You're dead." ;the remaining crew had to proceed without your involvement. Nasty, but if we couldn't handle it during a drill then we weren't prepared for a real disaster; I absolutely approved. The recovery plan should cover all contingencies.
IMHO at least one set of backups should be far enough away that a major disaster, e.g., earthquake, doesn't take out both the site and the backup vault. The same separation principle applies to hot backup and hot vaulting sites. -- Shmuel (Seymour J.) Metz http://mason.gmu.edu/~smetz3 ________________________________________ From: IBM Mainframe Discussion List [[email protected]] on behalf of Timothy Sipples [[email protected]] Sent: Wednesday, July 27, 2022 1:12 AM To: [email protected] Subject: Re: Mainframe outage affecting W.Va. state agencies could take 48, 72 hours to resolve I have absolutely no information about this incident other than what the media are reporting. I wish everyone involved the best success. My *personal* curiosity revolves around the Disaster Recovery plan and resources. As I'm sure we all know the standard/typical operational practice is to have an alternate site, separated at some distance, equipped with standby resources. Disk subsystems replicate between sites (primary to alternate) either synchronously or asynchronously. Or at least there'd be a remote tape library, preferably virtual to some degree (for performance reasons), preferably with multiple incremental backups per day. If the primary site is lost, for whatever reason(s), the IT operations team restores at least critical services from the alternate site. It might be a long RTO (24 hours for example) if it's a basic/entry DR arrangement, but it'd be something. Over many years I've only ever worked with two clients that had no real DR plan and essentially no DR resources when I first met them. As it happens they were both government agencies, but they were also both located in fairly poor or poorer developing countries. One client took frequent tape backups and shuttled physical tapes off-site so at least they'd be able to recover to some point, eventually. (RTO="a week or two," RPO=12+ hours probably.) I wasn't happy they had to operate that way, but their constraints were genuine. I worked with the other government agency to eliminate their exposure within a tight budget, and they now have an alternate site with a reasonable DR capability. I also remember working with another customer in a developing country, a bank. They were upgrading their systems, and their original plan involved losing DR protections for a couple days (about 48 hours) as I recall. That plan troubled me, so I worked with them to create a better, safer plan that preserved DR coverage throughout the upgrade project. They chose the revised plan. They completed their upgrade project on-time, within budget, and without incident. So what happened to the alternate site (and DR switchover to it)? — — — — — Timothy Sipples Senior Architect Digital Assets, Industry Solutions, and Cybersecurity IBM zSystems/LinuxONE, Asia-Pacific [email protected] ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
