AW: How do we approach Fault Tolerance in Apache Edgent

Julian Feinauer Thu, 08 Nov 2018 13:42:56 -0800

Hi Gayashan,



first, thanks for bringing your ideas to the list.

I know that fault tolerance is very important in many stream processing 
applications and indeed we also have some failure handling in our application.

And I agree with you that it doesn’t come without Performance Penalty (which is 
usually fine at the Edge, as in our cases our machines are „big enough“).



But I’m unsure what Kind of failure handling is needed. At the Edge (in 
contrast to the Cloud) you have only a single instance running. So whenever the 
Gateway (or Device) dies, you have no fallback to switch over. And usually 
thats fine, because you have many single Points of failures in Edge 
applications (Network, Powersupply, Device itself, …).



So what I think is really important are to be save against Bugs or Code changes 
/ updates. Therefore, for example, we never process a stream directly but route 
it over a „Buffer“ (think of Kafka, but in small) to enable „backpressure“ and 
especially restartability of the processing engine or (partial) re-processing.



What I would like to have is something like the Operator or Partition state 
from Apache Flink [1] to allow your internal state to be „kept“ (to reproduce 
from checkpoints) whenever Problems occur.

In our Industry applications we usually also have at leas once situations or in 
many cases even idempotent operations where we can live fine with at least once 
guarantees which makes Things way more comfortable.



Best

Julian



________________________________
Von: Gayashan Amarasinghe <gayashan.amarasin...@gmail.com>
Gesendet: Wednesday, November 7, 2018 1:24:41 AM
An: dev@edgent.apache.org
Betreff: How do we approach Fault Tolerance in Apache Edgent

Hi all,

Fault tolerance is a wide subject that can be approached in many ways (but
may be not fully achieved without a performance degradation?). With the
edge devices that we are focusing on edgent, I think faults are very much
an expected phenomenon. There could be node failures, network failures,
software bugs or exceptions in the code, limited resources, etc. And
therefore some form of fault tolerance could be beneficial for some use
cases.

When it comes to fault tolerance, there seems to be two main concepts;
replication and checkpointing. They both have their pros and cons. And with
fault tolerance there is the requirement of recovery. And there are many
ways of recovering from a failure as well.

But before going in to that I thought to ask from the list, what are
your thoughts on this. Do you think fault tolerance is required from a real
user/industry perspective? Do you have experiences where some form of fault
tolerance should have been implemented? Lets have a discussion on this if
possible.

(Special thanks should go to Julian for prompting this discussion on
another thread.)

/Gayashan

AW: How do we approach Fault Tolerance in Apache Edgent

Reply via email to