Cost of an outage (was: The mainframe is alive)

Phil Smith III Sun, 02 Mar 2025 07:43:35 -0800

Re outages:
> Microsoft had a major outage today that banks, Walmart, insurance companies, 
> airlines, and other companies can't


Well...*we* would say "can't". *They* would say "can't". But the reality is 
that if those happen--and they do (cf. Delta's outage last July, for one)--the 
world and the business don't stop. It makes us SMH and leaves their management 
screaming at people, but they don't go out of business. Of that list of 
industries, the airlines are the most critical in terms of real-time 
lost-revenue: if I can't complete my purchase at walmart.com, I might go to 
target.com, but also might just say "I'll try again later". Same with bank, 
insurance, and most other companies. An airline trip has a firm expiration 
date. 

But Delta is still in business, so what does "can't go down" even mean any 
more? Definitely not what it used to.

I'm convinced that some or all of this is because the industry has shifted to 
this "move fast and break things" mantra that even bleeds into areas where 
people say "No, we don't/can't/won't do that". E.g., the Delta outage was 
apparently caused by a CrowdStrike problem. Back in the day, would Delta have 
allowed a third-party tool to be used in such a critical way? I'd say "Probably 
not", or if they did, they would have insisted on being able to test any 
updates well enough to be sure that such an outage was impossible. Nowadays 
that just isn't practical, so it isn't done, and we see the result. I haz a sad.

None of this is the mainframe's fault, of course, which is why I moved this to 
a different topic.

Back in 1989, SABRE had a 12-hour outage that made the front pages. That was 
rare enough that I remember it almost four decades later. At the time, the 
quote was that it cost SABRE $20,000 per minute, a huge deal, ~$15M total.

I had to Google the Delta outage, and not just because I'm old--it's just not 
THAT remarkable any more. 

Delta is suing ClownStrike for $550M for the July follies, which is about 1/120 
of their annual revenue. This might actually be about right, since the outage 
lasted five days for them and a third of that figure appears to be actual 
costs, not just lost revenue. Hmm, doing the math--CPI is about 2.6x 1989-2024, 
and the outage was 10x as long as SABRE's: 15*2.6*10=390; Delta claims the lost 
revenue portion is $380M. Amazingly close!

BTW, for those who might be wondering, I was told by someone at SABRE that the 
1989 outage was caused by a rogue TPF job that clipped thousands of volumes 
(see http://catless.ncl.ac.uk/Risks/8.74.html for something that hints at 
this), which predictably made MVS very unhappy. People were basically running 
around with their hair on fire. The VM guy, Mike Roegner, quietly went off and 
wrote a Rexx program to drive re-clipping the packs and the rest of the outage 
was just the time it took to run that program against all those volumes. I of 
course cannot verify this and Mike is long retired; perhaps someone else here 
remembers?

So...how much does "can't go down" even mean any more? Did anyone at Delta lose 
their job over this? Or was the blame just pushed to ClownStrike--convenient, 
if so. One wonders.

Ok, this turned into a bit of a ramble, but it's a topic that I often think 
about!

...phsiii

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Cost of an outage (was: The mainframe is alive)

Reply via email to