Yes, this kind of thing is essential if you want to
design a system that stays up.

I have another fun thing along these lines.  When
you start a request that comes into the server,
you assign it an integer.  As the handling of
the request reaches transaction boundaries
(transactions on an underlying database
management system), you decrease the
counter by one and keep passing it along
the chain of handling.  (The chain might
include sending messages from one
component to another, and getting replies,
and so on.  For now I'm assuming that
request handling only has one thread.)

Whenever the count reaches zero, the
component that is currently handling
the request is killed.  The idea is
to make sure that you test "all the
places" where a crash might happen;
that is, "all" with respect to the database

This tool is testing the system under a
certain set of assumptions.  It's assuming
that the DBMS does what it's told to do.
It's testing "stop" failures.  It works best
when the only side-effects are to the
database system.  If there are other
side effects, e.g. the components
get things into their cache that stay
there and are used in subsequent
requests, then you are less sure that
you are testing out all possible paths.

I like to call these things "failure injection"
(I didn't invent that term).  We have other
failure injection stuff all over our system.
The "transaction ticking time bomb" one
is just my favorite.

Using a random tool might not find all of
these states.  Random tools are great,
but there are other useful test tools, too.
In general, trying to "test all possible
circumstances" is very, very hard; you
never know what particular input
values might be the ones that cause
a problem, and then you have to
worry about variable A having
certain values while variable B
has certain other values, leading
to a combinatorial explosion of
circumstances to test.

In my opinion, one of the best ways
to deal with this problem is by
having experienced Q/A people
who have a knack for guessing
what cases ought to be tested.

There's a lot I can say about this,
but the main thing I'll say is
that this is one of the reasons I have
doubts about the wisdom of the
methodology used at Facebook,
in which there is no Q/A department,
and programmers are expected to do
their own Q/A.  That's just one of
the reasons.

It helps that at Facebook, you can roll
out a new feature to a very small subset
of users, and un-install it quickly if
it's causing a problem, and usually if
it doesn't work, that's not very important
when there's only a very small number
of early adopters.  Things generally don't work that
way with an airline reservation system.
The methodologies that are suited
for one situation are not necessarily
those suitable for another.

Dan

Peter Seibel wrote:
Presumably this kind of thing is the reason for the Chaos Monkey:


On Mon, Dec 20, 2010 at 6:37 PM, Scott L. Burson wrote:
On Fri, Dec 17, 2010 at 11:16 AM, Ryan Davis wrote:
We do something like this.  For lisp websites my company makes, we have
a password-protected admin section with some light UI to help us manage
the site (turn logging levels up/down, clear caches, etc), and one of
those tools is a "evaluate this code in the running lisp" textarea, with
a dropdown to select what package it runs in.  This is very rarely used
to patch the site in emergency situations or for trivial changes where
we don't want to bring down the site.  This has bit us a few times,
where we fixed a small bug directly in the running lisp and then forgot
to publish the new code and had mystery regressions when the lisp
process was restarted.
A really funny cautionary tale about this sort of thing:

-- Scott

