Re: [lssconf-discuss] testing large systems using virtualization

Narayan Desai Fri, 01 Jun 2007 06:40:04 -0700

>>>>> "Ed" == Edmund Smith <[EMAIL PROTECTED]> writes:


  Ed> The problem with doing better, ie protecting systems from going
  Ed> into "bad states" is defining what those states are. In order to
  Ed> do more powerful specification, where you say what you want at
  Ed> some high level (or verify what you're getting in some general
  Ed> way) you typically require at least first order logic, which is
  Ed> error-prone and unfamiliar for non-specialists. I wrote a proof
  Ed> of concept of a tool in this area (it could verify properties of
  Ed> a group of node configurations statically - things like "for
  Ed> every client C that's asking for a service Q from server S,
  Ed> server S exists, and provides service Q". Typically the
  Ed> predicates are pretty short, very hard to write, though
  Ed> re-usable (your predicates typically stay the same, and are
  Ed> usually just "general good things"). They're also crazily
  Ed> expensive to verify automatically without a clever solver (which
  Ed> mine wasn't). The brilliance of Sanjai/Telcordia's stuff is
  Ed> making this practical at scale, but so far (I think) their work
  Ed> is still confined to network property configuration only.

There is another aspect that makes Sanjai's stuff so useful. They have
a model of network protocol behavior. This allows them to do a much
more sophisticated analysis. We don't have any similar model for the
behavior of unix programs in general; the best that we have are models
of the sort you've described above. (for service dependence) This will
really limit the utility of the safety checks we can perform with
rules like this.

I also think these sorts of discussions, while quite interesting, put
the cart before the horse a little bit. Luke was talking about
fatfingering configs causing a massive failure. There are two ways to
do this: either causing a failure of core infrastructure in a small
number of places, or causing a per-node failure with a configuration
change or software upgrade. In both cases, a configuration file syntax
error can cause services, persistent or not, to fail. Since this is a
common occurrence, any safety mechanism that we want to be
practical needs to handle this. 

In order to detect this sort of condition prior to deployment, you
need a detailed model of program behavior, and other system
configuration. Consider the isc dhcpd configuration as an example, not
only do you need a model that understands which configuration options
are valid for a given version, you also need to know how these options
change across versions of the software, and how the configuration
interacts with underlying system configuration settings, like say the
interface IPs available on a system. 

Another thing that you need to start modelling is bug
interactions. Sanjai, does your software deal with bugs in BGP or OSPF
implementations and the like?

Don't get me wrong, once we had a model of this sort, we could analyze
all sorts of useful properties of the dhcpd service, starting with the
basics like functionality and moving up to much more sophisticated
things like which clients could boot properly from it. At the same
time, these sound a little too expensive to develop and maintain. The
reason this approach works for network management is that there are a
relatively small number of models that are needed, whereas on a unix
system there may be hundreds of models that would need to interact
with one another. (to handle everything from minicom, to the kernel,
to dns and dhcp, sendmail, postfix, etc)

For this reason, it is not clear to me that this is a practical
near-term solution on Unix systems. As an alternative, we have been
working with behavioral tests, and they provide a much more
immediately useful set of functions. Behavioral tests aren't as
powerful as the analytical ones that Sanjai does, but they sure are a
lot more incremental. 

One conclusion I have come to over the last few years is that system
administrators (in general) don't have time in large enough chunks to
readily pursue non-incremental solutions for problems. There are, of
course, exceptions to this, but in general, this seems to hold. I
think that this has pretty big impact on how people adopt tools and
techniques. I would worry that any sort of FOL type system would be
hard enough to deploy and use that it would never make it too far out
of the lab (unless it was completely turn-key, which I gather Sanjai's
tools are) 
 -nld
_______________________________________________
lssconf-discuss mailing list
lssconf-discuss@inf.ed.ac.uk
http://lists.inf.ed.ac.uk/mailman/listinfo/lssconf-discuss

Re: [lssconf-discuss] testing large systems using virtualization

Reply via email to