Hi Gayashan,
first, thanks for bringing your ideas to the list. I know that fault tolerance is very important in many stream processing applications and indeed we also have some failure handling in our application. And I agree with you that it doesn’t come without Performance Penalty (which is usually fine at the Edge, as in our cases our machines are „big enough“). But I’m unsure what Kind of failure handling is needed. At the Edge (in contrast to the Cloud) you have only a single instance running. So whenever the Gateway (or Device) dies, you have no fallback to switch over. And usually thats fine, because you have many single Points of failures in Edge applications (Network, Powersupply, Device itself, …). So what I think is really important are to be save against Bugs or Code changes / updates. Therefore, for example, we never process a stream directly but route it over a „Buffer“ (think of Kafka, but in small) to enable „backpressure“ and especially restartability of the processing engine or (partial) re-processing. What I would like to have is something like the Operator or Partition state from Apache Flink [1] to allow your internal state to be „kept“ (to reproduce from checkpoints) whenever Problems occur. In our Industry applications we usually also have at leas once situations or in many cases even idempotent operations where we can live fine with at least once guarantees which makes Things way more comfortable. Best Julian ________________________________ Von: Gayashan Amarasinghe <gayashan.amarasin...@gmail.com> Gesendet: Wednesday, November 7, 2018 1:24:41 AM An: dev@edgent.apache.org Betreff: How do we approach Fault Tolerance in Apache Edgent Hi all, Fault tolerance is a wide subject that can be approached in many ways (but may be not fully achieved without a performance degradation?). With the edge devices that we are focusing on edgent, I think faults are very much an expected phenomenon. There could be node failures, network failures, software bugs or exceptions in the code, limited resources, etc. And therefore some form of fault tolerance could be beneficial for some use cases. When it comes to fault tolerance, there seems to be two main concepts; replication and checkpointing. They both have their pros and cons. And with fault tolerance there is the requirement of recovery. And there are many ways of recovering from a failure as well. But before going in to that I thought to ask from the list, what are your thoughts on this. Do you think fault tolerance is required from a real user/industry perspective? Do you have experiences where some form of fault tolerance should have been implemented? Lets have a discussion on this if possible. (Special thanks should go to Julian for prompting this discussion on another thread.) /Gayashan