Re: VARY too many devices offline

Kenny Fogarty Tue, 23 Oct 2007 00:54:43 -0700

>
> How would you handle a person that scrutinizes blood for a living and
> mistakes a diagnosis ?
> In some case an operator is just as "guilty" as the blood analyzer.
> If you say thats not the same, I would agree but not in all


I wouldn't even begin to make the analogy. But, mistakes happen.
That's a fact of life, and, putting an unbearable amount of strain on
someone, as in - make a mistake and you're fired, will not, under any
circumstances, help that person to not make mistakes. In fact, I'd go
as far as to say it would only make things worse.

> If an operator put in a wrong date at IPL and (because
> of that) RACF refuses to come up and there is no backout or even
> worse datasets gets scratched because of the operator error which
> leads to a fine from say the SEC (or take you pick of agency).

See all of those issues? All perfectly valid. But, if I were having to
unravel the mess that came about from the wrong input at the console,
the operator would not be the person who should be blamed. There
should be contingency in place so that if RACF refuses to come up, we
get alerted very early on as to why, and have steps in place to remedy
the situation. Perhaps by re-IPL'ing. After all, that's what you're
going to do in 99% of cases if a wrong parameter is passed at IPL
time.
If datasets get scratched, where's the back up? What's the contingency
in place to restore the data. If there isn't one, that's not the guy
who entered 'U' on the console instead of 'N''s fault.

>  There are degrees of error of course some are who cares to a possible
> company going bankrupt there are in the last case MANY people being
> out of work (possibly 1000's or more) would you not fire the person?

If the company went bankrupt, it wouldn't be because someone varied
off the wrong device.

> I think you are comparing apples and oranges. An operator can by mistake put
> the company out of business, a programmer can cause loss revenue and yes
> possibly a fine.

I'd love to see how the wrong prompt on the console was traced back to
the one thing that put the company out of business. Seriously, if
anyone has any stories along those lines, I'd love to hear it. As
would any maker of automation software, because it would be the most
amazing sales pitch ever.

>  BUT that should have been found in
> QA before the program goes live. In other words their work is checked
> by others.

QA can pick up a lot of things, but, for example, can QA pick up an
application program that performs ten million inserts and no commit
into a DB2 table, then, for whatever reason, abend, and have DB2
rollback all its work, thus rendering the objects unavailable for x
hours? I've seen it done. - Didn't make the company go bankrupt
though.

>  An operator does not have this luxury. Yes programmers can
> make mistakes but (in most cases) its not a shut the front doors and
> turn off the power whoever is the last one to leave. An operator can
> do so with a small "oops". That is why an operator, IMO must go
> through several years of training so they CAN'T make stupid mistakes.

I agree that console commands are free from any sort of QA, however,
there are ways and means to ensure that mistakes are minimised.
Automation products can help here, or, if they're not available, an
application program can write out WTO or WTOR messages with meaningful
text, which can also help an operator make a decision.

Training does not, and never will ensure that mistakes are never made.
Training educates, and helps people understand better, but it never,
ever eradicates mistakes from any process.

> Its possible that a programmer could write a program that
> misdiagnoses a test (health) result and yes that could lead to the
> persons death, but presumably there are other fingers in the stew to
> catch the errors.

I agree with that, and, broadly, that's the point I was trying to
make. There should be enough tech support/ops support/sys progs around
to see what went wrong, and implement some sort of contingency to
rectify the mistake with the minimum of outage/cost to the company, be
that restoring data, re-IPLing a system, or whatever.

> In the case of an operator there is no way to catch all errors that could 
> cause a major issue.

There are ways to catch all operator entries from the console via
various automation products which can interrogate what has been
entered, and take appropriate measures.

> Catching a Vary is a small part of any possible error. Catching a bad date at 
> lets
> say early on in the IPL process  is impossible by any of the suggestions
> mentioned as the exits (programs) are not available then.

I agree, but, if the wrong date, or IPL parm, or whatever is entered,
then the chances are you're going to have to re-IPL to rectify the
situation. As you said above, if RACF doesn't start, you can go back
to see why, and take steps to fix the issue.

There must always be contingency plans in place to catch human errors,
but, to go back to the original point, sacking someone for entering
the wrong reply is not, and never has been the answer. It reads (to me
at least) pretty much as "They (operators) are easily replaced, sack
him".

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: VARY too many devices offline

Reply via email to