On 9/28/2014 6:39 PM, Sean Kelly wrote:
Well... suppose you design a system with redundancy such that an error in a
specific process isn't enough to bring down the system. Say it's a quorum
method or whatever. In the instance that a process goes crazy, I would argue
that the system is in an undefined state but a state that it's designed
specifically to handle, even if that state can't be explicitly defined at design
time. Now if enough things go wrong at once the whole system will still fail,
but it's about building systems with the expectation that errors will occur.
They may even be logic errors--I think it's kind of irrelevant at that point.
Even a network of communicating processes, one getting in a bad state can
theoretically poison the entire system and you're often not in a position to
simply shut down the whole thing and wait for a repairman. And simply rebooting
the system if it's a bad sensor that's causing the problem just means a pause
before another failure cascade. I think any modern program designed to run
continuously (increasingly the typical case) must be designed with some degree
of resiliency or self-healing in mind. And that means planning for and limiting
the scope of undefined behavior.
I've said that processes are different, because the scope of the effects is
limited by the hardware.
If a system with threads that share memory cannot be restarted, there are
serious problems with the design of it, because a crash and the necessary
restart are going to happen sooner or later, probably sooner.
I don't believe that the way to get 6 sigma reliability is by ignoring errors
and hoping. Airplane software is most certainly not done that way.
I recall Toyota got into trouble with their computer controlled cars because of
their idea of handling of inevitable bugs and errors. It was one process that
controlled everything. When something unexpected went wrong, it kept right on
operating, any unknown and unintended consequences be damned.
The way to get reliable systems is to design to accommodate errors, not pretend
they didn't happen, or hope that nothing else got affected, etc. In critical
software systems, that means shut down and restart the offending system, or
engage the backup.
There's no other way that works.