It is widely acknowledged that the internet is a hostile environment. There's a plethora of news about malware and other problems. And yet mostly we seem to adopt a "head in the sand" approach for dealing with these issues. Or, the software developers I have worked with seem largely unconcerned about such things, perhaps because other people's protective work has shielded them [so far] from the failure modes?
Still, an ounce of prevention is worth a pound of cure. So, here are some thoughts on how to engineer for resilience: (1) Double entry bookkeeping. https://en.wikipedia.org/wiki/Double-entry_bookkeeping_system Any critical information should be stored in multiple ways, designed so that corruption can be detected and isolated. The trick here is that you want to isolate and pursue problems which do not make sense. (If you are hiring for a position for a designer or implementer or supporter of this kind of thing, people who are fans of Agatha Christie novels might be good fits - for example.) (2) People skills. We [as programmers] are accustomed to solving technical problems, but the problems worth solving are people problems. And on the internet we have the joy and privilege of facing international conflicts, political conflicts, economic failures, war zone issues, and a multitude of other forms of insanity. All at an arms length, but all of these things are out there, lurking. As a result, there's pressures to oversimplify (who wants to deal with all that?) and while some of that simplification is necessary, simplifying away from relevant priorities can eat your lunch money for you. Plus, we all make mistakes. And, our handlings for our own personal mistakes can often serve to help ameliorate external failure modes. So there's a real need to be actively coping with failure modes while building meaningfully useful things for other people who are also coping. And, people skills seem crucial, here. (3) Gathering details on failures. Any widely deployed software has to deal with gathering information on crashes (which, in turn, requires people with some ability to digest those crash reports). Or, if you can't make sense of someone else's system, build your own, that gathers information relevant to your design process. But that's all I can think of at the moment. The most important part of this, I think, is that you need people who are level headed about the potential failures. Pretending they don't happen and/or pretending things are worse than they are tends to get in the way of reasonable solutions. But you also need a "working approach" which complements your other priorities. As a concrete examples: (1) Checksums (including cryptographic hashes) can help catch some problems (it's worth thinking about what this does and does not catch). (2) Apprenticeship as a design philosophy. If you are working on a piece of software intended to benefit a professional user, spending some time working directly for someone who is coping with the problems you are trying to address can bring the important issues into focus. I don't have any recent examples of (3). This is motivated by various ongoing failures I've been observing on some of the machines I work with. The failures themselves do not make sense, and no one else seems to report having similar problems. I do not know what to do about such things, except to encourage people to try to be building for resilience against failures. That's all, for now. Thanks, -- Raul ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
