Hi Xiaogang, it's an interesting discussion. I have heard some of similar feature requirements before. Some users need a lighter failover strategy since the correctness is not so critical for their scenario as you mentioned. Even more some jobs may not enable checkpointing at all, a global/region failover strategy actually doesn't make sense for these jobs. The individual failover strategy doesn't work well for these scenario since it only supports a topology without edges currently. Actually we have implemented a Best-effort failover strategy in our private branch. There is a little difference with your proposal that it doesn't support at-most-once mechanism. It has a weaker consistency model but with a faster recovery ability. I think it would satisfy your scenario.
SHI Xiaogang <shixiaoga...@gmail.com> 于2019年6月11日周二 下午4:33写道: > Flink offers a fault-tolerance mechanism to guarantee at-least-once and > exactly-once message delivery in case of failures. The mechanism works well > in practice and makes Flink stand out among stream processing systems. > > But the guarantee on at-least-once and exactly-once delivery does not come > without price. It typically requires to restart multiple tasks and fall > back to the place where the last checkpoint is taken. (Fined-grained > recovery can help alleviate the cost, but it still needs certain efforts to > recover jobs.) > > In some senarios, users perfer quick recovery and will trade correctness > off. For example, in some online recommendation systems, timeliness is far > more important than consistency. In such cases, we can restart only those > failed tasks individually, and do not need to perform any rollback. Though > some messages delivered to failed tasks may be lost, other tasks can > continuously provide service to users. > > Many of our users are demanding for at-most-once delivery in Flink. What do > you think of the proposal? Any feedback is appreciated. > > Regards, > Xiaogang Shi >