Back in the day, we worked on RAS. So we put in error detection hardware (sometimes that was "firmware, or macrocode) and IBM and all our competitors were doing the same. And the idea was to have redundant power supplies so that a CE could do maint, and not take down the system. And if possible, redundant channel paths to a device controller so that you could pull a channel cable and replace it.

Today, with IBM, you can add or subtract CPUs while the machine is running. But, at least with the z15s, you could not add RAM without taking the system down, as in power it down.

So that would be a RAS hit, or, cause you to miss your 99.999 target.

For people who do hardware and to some degree software (O/S stuff), you do all you can to recover from any problem. I like VM and its ability to see it is injured and it will IPL itself. But, to keep those SLAs, there is SSI. So an LPAR can move its workload to another LPAR (PAIRs determined in advance here) and keep that work running. We did this at a large health insurer so that we could do VM upgrades with no outages.

So how you measure that up time depends on the equipment and ability to do HOT SWAP, and related so you do not take an outage.

What happens if a WINTEL server running MQ buys the farm? Those inflight transactions going through that server may time out and have to be re-driven. Is this considered an outage? Not if you have a second one handling the load and it takes over. But that one or 10(?) users may see an error message. Does that count as an outage if the user only loses a few seconds in getting an answer? Or a Pharmacy getting info? Or an OR getting info on drug interactions?

Need some perspective.

Steve Thompson

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to