Hi all,

Fault tolerance is a wide subject that can be approached in many ways (but
may be not fully achieved without a performance degradation?). With the
edge devices that we are focusing on edgent, I think faults are very much
an expected phenomenon. There could be node failures, network failures,
software bugs or exceptions in the code, limited resources, etc. And
therefore some form of fault tolerance could be beneficial for some use
cases.

When it comes to fault tolerance, there seems to be two main concepts;
replication and checkpointing. They both have their pros and cons. And with
fault tolerance there is the requirement of recovery. And there are many
ways of recovering from a failure as well.

But before going in to that I thought to ask from the list, what are
your thoughts on this. Do you think fault tolerance is required from a real
user/industry perspective? Do you have experiences where some form of fault
tolerance should have been implemented? Lets have a discussion on this if
possible.

(Special thanks should go to Julian for prompting this discussion on
another thread.)

/Gayashan

Reply via email to