john.archie.mck...@gmail.com (John McKown) writes:
> Yes, we have had a TCM fail. I was almost called a liar when I told the
> Windows people that the z simply switch the work transparently (on the
> hardware level) to another CP. They were shocked and amazed that we could
> "hot swap" a new TCM into the box without any outage. The same thing when
> an OSA failed. The other OSA simply did an "ARP rollover" and there were
> not any outages. And that, again, IBM replaced the OSA "hot" and we simply
> started using it. All automatically. But the Windows people still chant
> "Windows is BETTER than the mainframe."

I was keynote speaker at NASA dependable computing workshop (along with
Jim Gray, who I worked with at IBM SJR, but he had gone on to Tandem,
Dec, and then Microsoft) ... reference gone 404 but lives on at wayback
machine
http://web.archive.org/web/20011004023230/http://www.hdcc.cs.cmu.edu/may01/index.html

and told mainframe story

I had done this software support for channel extender ... allowing local
controllers & devices to operate at the end of some telco link. for
various reasons, i had chosen to simulate "channel check" when various
telco errors occurred ... in order to kick-off various operating system
recovery/retry routines.

along came the 3090 ... which was designed to have something like 3-5
channel check errors per annum (not per annum per machine ... but per
annum across all machines).

After 3090s had been out a year ... R-something? was reporting that
there had been an aggregate of something like 15-20 channel check errors
in the first year across all machines .... which launched a detailed
audit of what had gone wrong.

they finally found me ... and after a little bit additional
investigation, i decided that for all intents and purposes, simulating
an IFCC (interface control check) instead of a CC (channel check) would
do as well from the standpoint of the error retry/recovery procedures
activated.

... snip ...

majority of audience didn't even understand that errors & faults were
being recorded, tracked, collected, trends, etc.

I had done the support in 1980 for STL, which was bursting at the seams
and were moving 300 from the IMS group to offsite bldg. with
dataprocessing back to STL. They had tried remote 3270, but found human
factors totally unacceptable. Channel-extender support allowd local
channel attached controllers at the offsite bldg ... and the human
factors were same offsite as local in STL. Actually the STL POK
mainframes supporting the offsite bldg ran faster ... turns out 3270
controllers had lots of excessive channel busy ... the channel-extender
significantly reduced that 3270 controller channel busy ... moving it
all to the interface at the offsite bldg.

Hardware vendor had tried to get IBM to release my software, but there
was group in POK that were playing with some serial stuff and got it
vetoed (they were afraid that if it was in the market, it would make it
harder to get their stuff released). The vendor then had to (exactly)
duplicate my support from scratch (including reflecting CC on errors).
I then get them to change their implementation from CC to IFCC.

trivia: in 1988, I was asked to help LLNL standardize some stuff they
were playing with ... which quickly becomes fibre channel standard
(including some stuff I had done in 1980).

The POK people finally get their stuff released in 1990 with ES/9000 as
ESCON when it was already obsolute.

Later POK people become involved in fibre channel standard and define a
heavy-weight protocol that radically reduces the native throughput,
which eventually ships as FICON.

Our last product at IBM was HA/CMP and after leaving IBM we were bought
into the financial institution that had implemented the original
magstripe merchant/gift cards ... on a SUN 2-way "HA" platform. Turns
out SUN had implemented/copied my HA/CMP design ... even copying my
marketing pitches. System had failure and "fell over" and continued
working with no outage. SUN replaced failed component but CE forgot to
update configuration with the identifier for the new component ... so it
wasn't actually being used. Three months later when they had 2nd
failure, they found that parts of the DBMS records weren't actually
being written/replicated (more than "no single point of failure", three
problems, original failure, failure to update configuration info, 2nd
failure).

earlier HA/CMP reference/post in this thread
http://www.garlic.com/~lynn/2019c.html#11 

-- 
virtualization experience starting Jan1968, online at home since Mar1970

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to