Hi Xiaogang, it's an interesting discussion.

I have heard some of similar feature requirements before. Some users need a
lighter failover strategy since the correctness is not so critical for
their scenario as you mentioned. Even more some jobs may not enable
checkpointing at all, a global/region failover strategy actually doesn't
make sense for these jobs. The individual failover strategy doesn't work
well for these scenario since it only supports a topology without edges
currently.
Actually we have implemented a Best-effort failover strategy in our private
branch. There is a little difference with your proposal that it doesn't
support at-most-once mechanism. It has a weaker consistency model but with
a faster recovery ability. I think it would satisfy your scenario.


SHI Xiaogang <shixiaoga...@gmail.com> 于2019年6月11日周二 下午4:33写道:

> Flink offers a fault-tolerance mechanism to guarantee at-least-once and
> exactly-once message delivery in case of failures. The mechanism works well
> in practice and makes Flink stand out among stream processing systems.
>
> But the guarantee on at-least-once and exactly-once delivery does not come
> without price. It typically requires to restart multiple tasks and fall
> back to the place where the last checkpoint is taken. (Fined-grained
> recovery can help alleviate the cost, but it still needs certain efforts to
> recover jobs.)
>
> In some senarios, users perfer quick recovery and will trade correctness
> off. For example, in some online recommendation systems, timeliness is far
> more important than consistency. In such cases, we can restart only those
> failed tasks individually, and do not need to perform any rollback. Though
> some messages delivered to failed tasks may be lost, other tasks can
> continuously provide service to users.
>
> Many of our users are demanding for at-most-once delivery in Flink. What do
> you think of the proposal? Any feedback is appreciated.
>
> Regards,
> Xiaogang Shi
>

Reply via email to