Re: Program logic bugs vs input/environmental errors

Walter Bright via Digitalmars-d Sun, 28 Sep 2014 20:02:42 -0700

On 9/28/2014 6:39 PM, Sean Kelly wrote:

Well... suppose you design a system with redundancy such that an error in a
specific process isn't enough to bring down the system.  Say it's a quorum
method or whatever.  In the instance that a process goes crazy, I would argue
that the system is in an undefined state but a state that it's designed
specifically to handle, even if that state can't be explicitly defined at design
time.  Now if enough things go wrong at once the whole system will still fail,
but it's about building systems with the expectation that errors will occur.
They may even be logic errors--I think it's kind of irrelevant at that point.


Even a network of communicating processes, one getting in a bad state can
theoretically poison the entire system and you're often not in a position to
simply shut down the whole thing and wait for a repairman.  And simply rebooting
the system if it's a bad sensor that's causing the problem just means a pause
before another failure cascade.  I think any modern program designed to run
continuously (increasingly the typical case) must be designed with some degree
of resiliency or self-healing in mind.  And that means planning for and limiting
the scope of undefined behavior.

I've said that processes are different, because the scope of the effects islimited by the hardware.

If a system with threads that share memory cannot be restarted, there areserious problems with the design of it, because a crash and the necessaryrestart are going to happen sooner or later, probably sooner.

I don't believe that the way to get 6 sigma reliability is by ignoring errorsand hoping. Airplane software is most certainly not done that way.

I recall Toyota got into trouble with their computer controlled cars because oftheir idea of handling of inevitable bugs and errors. It was one process thatcontrolled everything. When something unexpected went wrong, it kept right onoperating, any unknown and unintended consequences be damned.

The way to get reliable systems is to design to accommodate errors, not pretendthey didn't happen, or hope that nothing else got affected, etc. In criticalsoftware systems, that means shut down and restart the offending system, orengage the backup.


There's no other way that works.

Re: Program logic bugs vs input/environmental errors

Reply via email to