john.mck...@healthmarkets.com (McKown, John) writes: > X64 hardware, as much as it has improved, is still not as reliable or > have the I/O capacity of the z hardware. E.g.: We had a TCM fail > once. A spare picked up the work, automatically restarting the > instruction stream, with no outage of any sort and no software > involvement. X86, from what I'm told, would at least require the OS to > do the equivalent of a checkpoint restart. Also had an OSA fail. The > other OSA did an ARP takeover and no IP sessions were lost. TCPIP was > informed, but all it did was put out a message and not start any new > sessions on the failing OSA. Our "open" people called me a liar when I > told them that.
big cloud operators do hundreds of thousands of blades in megadatacenter with lots of failure/recovery infrastructure to handle individual blade failures (usually with lots of power, telco, provisioning provisioning). Gray had been studying mix of failures issues (both at IBM and later at Tandem) and by '84 published report that hardware failures had become minority of failure modes (hardware reliability had increased so other kinds of failures were starting to dominating). scan of '84 presentation http://www.garlic.com/~lynn/grayft84.pdf several of the big cloud operators have published detailed studies of different component availability as part of building their own blades ... given optimal service life per dollar. Cluster & take-over were increasingly being used to mask all kinds of outages ... even able to handle geographic operations and handling disasters taking out whole datacenter. when we were doing ha/cmp ... we did a lot of failure mode study ... and part of our marketing was against hardware fault-tolerant systems. We showed availability of ha/cmp clusters was higher than the fault-tolerant systems. In competitive situation involving 1-800 number server (i.e. maps 1-800 number to "real" number) ... it required five-nines availability. hardware fault-tolerant system still required scheduled system outage to do software upgrade ... which would blow several decades of downtime budget. With cluster operation, we showed at least as good hardware availability (with redundant systems) along with capability of doing rolling software upgrades with no system outage. the hardware fault tolerant vendor eventually came back with suggestion that they could come out with redundant, cluster system operation ... to handle the software upgrade issue. However, given reliability of the underlying hardware operating in redundant, cluster system mode ... there was no longer any justification for hardware fault tolerance. part of ha/cmp was ip-address take-over ... which according to all the standards should time-out mac/ip-address in arp caches. at the time, most vendors were using BSD 4.3 tahoe or reno software for their tcp/ip stacks. In 1989, we found a bug in the BSD4.3 tahoe/reno IP/ARP lookup software. The ARP cache management was correctly timing out the ARP cache entries (so if there was ip-address take-over, it would discover the new MAC mapping). However, the BSD4.3 IP code had a performance optimization where it saved the last ip/mac lookup results ... which would only get reset if the client communicated with a different ip-address (otherwise that saved ip/mac mapping would exist forever). Since the "bug" existed in nearly every vendors implementation (all using same BSD4.3 tahoe/reno software), we had to come up with a work-around for the saved ip/mac bug. Any time there was ip-address take-over, we would quickly saturate the local LAN with dummy traffic from a different ip-address ... forcing all the machines on that LAN to perform a real ARP cache lookup (resetting the "saved" value). Then the next activity from the taken-over ip-address would force the clients to do a real ARP cache lookup. misc. past posts mentioning ha/cmp http://www.garlic.com/~lynn/subtopic.html#hacmp -- virtualization experience starting Jan1968, online at home since Mar1970 ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN