Hi all, Fault tolerance is a wide subject that can be approached in many ways (but may be not fully achieved without a performance degradation?). With the edge devices that we are focusing on edgent, I think faults are very much an expected phenomenon. There could be node failures, network failures, software bugs or exceptions in the code, limited resources, etc. And therefore some form of fault tolerance could be beneficial for some use cases.
When it comes to fault tolerance, there seems to be two main concepts; replication and checkpointing. They both have their pros and cons. And with fault tolerance there is the requirement of recovery. And there are many ways of recovering from a failure as well. But before going in to that I thought to ask from the list, what are your thoughts on this. Do you think fault tolerance is required from a real user/industry perspective? Do you have experiences where some form of fault tolerance should have been implemented? Lets have a discussion on this if possible. (Special thanks should go to Julian for prompting this discussion on another thread.) /Gayashan