Jeff McCune wrote:

Put yourself in the position of a beginner. You're overwhelmed,
your job is on the line, and everything's on fire. And you can
employ tools that with *some* probability, put out the fires,
and with *another* finite probability, make them worse. What are
you going to do? Many people just avoid anything that can make
things worse.

True, but what if we made the tools easier to understand for beginners? If we focused on understandable processes for package installation and simple file distribution it might simultaneously reduce the risk as it increased accessibility.

I think it's more important that the tool have some "impact knowledge".
The best approach would be to "protect" the beginner from "the worst
failure modes" while empowering him or her to do "safer" things
automagically.

   "Are you sure you want to change the network configuration of the
   target machine (this may render the target inaccessible...)?"

When *I* was learning to utilize automation, I made a change in network
configuration on all nodes that misconfigured my whole network so badly
that I had to do a walkaround to each machine in single-user mode to
recover! Fortunately, it was night in the summer and no one else was
a witness:) It was also a very subtle error. On SunOs machines (pre-solaris), /etc/hosts was a link to another file (I think it was /etc/inet/hosts). Some applications referred to /etc/hosts, and others to /etc/inet/hosts. My mistake was to break this link and replace it with a file. Then I tried to reboot the network with a new IP configuration in *one* of the files, and
a) the hosts came up with the old ip configuration (read from
   /etc/hosts)
b) but they didn't know each others' IP addresses, because the resolver
   read these from /etc/inet/hosts!
The result was that all machines crashed when they came up multi-user
and couldn't find their servers. OOPS!

[One of my very first tools utilized the "dammit!" convention. It asked
whether I was sure of everything. I could answer "yes", "no", "yes
dammit!" or "no dammit!". If I appended the "dammit!" to the end,
it would not ask that question again. "User interface by
intimidation:)"]

I've also had to rip automation out of a site for safety reasons.
When I released management of my site to new staff, I was using
an extremely complex cfengine script. This script was a minefield
for new admins; they were constantly stepping on mines and getting
blown up. I finally decided that it was safer just to turn the whole
darn thing off and let the new admins master automation their own way.
Not only did they evolve a new way, it was substantially safer and
easier for them to collaborate in using the new way, as a group.

There are some safe things we could start with:
a) "make this machine similar to that one."
b) "configure this machine's network this way."
c) "make this machine a mail server."
These are very high-level things. The "safety" comes from the fact
that there is no opportunity for user error at the bit level. We should
*not* encourage people to kick off large-scale deployments until they
know what they're doing... one machine at a time, automated, is fine
as a first step...

--
Dr. Alva L. Couch
Associate Professor of Computer Science
Associate Professor of Electrical and Computer Engineering
Tufts University, 161 College Avenue, Medford, MA 02155
Phone: +1 (617) 627-3674
Web: http://www.cs.tufts.edu/~couch
_______________________________________________
lssconf-discuss mailing list
lssconf-discuss@inf.ed.ac.uk
http://lists.inf.ed.ac.uk/mailman/listinfo/lssconf-discuss

Reply via email to