john.mck...@healthmarkets.com (McKown, John) writes:
> X64 hardware, as much as it has improved, is still not as reliable or
> have the I/O capacity of the z hardware. E.g.: We had a TCM fail
> once. A spare picked up the work, automatically restarting the
> instruction stream, with no outage of any sort and no software
> involvement. X86, from what I'm told, would at least require the OS to
> do the equivalent of a checkpoint restart. Also had an OSA fail. The
> other OSA did an ARP takeover and no IP sessions were lost. TCPIP was
> informed, but all it did was put out a message and not start any new
> sessions on the failing OSA. Our "open" people called me a liar when I
> told them that.

big cloud operators do hundreds of thousands of blades in megadatacenter
with lots of failure/recovery infrastructure to handle individual blade
failures (usually with lots of power, telco, provisioning provisioning).

Gray had been studying mix of failures issues (both at IBM and later at
Tandem) and by '84 published report that hardware failures had become
minority of failure modes (hardware reliability had increased so other
kinds of failures were starting to dominating). scan of '84 presentation
http://www.garlic.com/~lynn/grayft84.pdf

several of the big cloud operators have published detailed studies of
different component availability as part of building their own blades
... given optimal service life per dollar.

Cluster & take-over were increasingly being used to mask all kinds of
outages ... even able to handle geographic operations and handling
disasters taking out whole datacenter.

when we were doing ha/cmp ... we did a lot of failure mode study ... and
part of our marketing was against hardware fault-tolerant systems.  We
showed availability of ha/cmp clusters was higher than the
fault-tolerant systems. In competitive situation involving 1-800 number
server (i.e. maps 1-800 number to "real" number) ... it required
five-nines availability. hardware fault-tolerant system still required
scheduled system outage to do software upgrade ... which would blow
several decades of downtime budget. With cluster operation, we showed at
least as good hardware availability (with redundant systems) along with
capability of doing rolling software upgrades with no system outage.

the hardware fault tolerant vendor eventually came back with suggestion
that they could come out with redundant, cluster system operation ... to
handle the software upgrade issue. However, given reliability of
the underlying hardware operating in redundant, cluster system mode ...
there was no longer any justification for hardware fault tolerance.

part of ha/cmp was ip-address take-over ... which according to all the
standards should time-out mac/ip-address in arp caches. at the time,
most vendors were using BSD 4.3 tahoe or reno software for their tcp/ip
stacks. In 1989, we found a bug in the BSD4.3 tahoe/reno IP/ARP lookup
software. The ARP cache management was correctly timing out the ARP
cache entries (so if there was ip-address take-over, it would discover
the new MAC mapping). However, the BSD4.3 IP code had a performance
optimization where it saved the last ip/mac lookup results ... which
would only get reset if the client communicated with a different
ip-address (otherwise that saved ip/mac mapping would exist
forever). Since the "bug" existed in nearly every vendors implementation
(all using same BSD4.3 tahoe/reno software), we had to come up with a
work-around for the saved ip/mac bug. Any time there was ip-address
take-over, we would quickly saturate the local LAN with dummy traffic
from a different ip-address ... forcing all the machines on that LAN to
perform a real ARP cache lookup (resetting the "saved" value). Then the
next activity from the taken-over ip-address would force the clients to
do a real ARP cache lookup.

misc. past posts mentioning ha/cmp
http://www.garlic.com/~lynn/subtopic.html#hacmp

-- 
virtualization experience starting Jan1968, online at home since Mar1970

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to