Hi Thomas!

For Batch, this should be working in release 1.9.

For streaming, it is a bit more tricky, mainly because of the fact that you
have to deal with downstream correctness.
Either a recovery still needs to reset downstream tasks (which means on
average half of the tasks) or needs to wait before publishing the data
downstream until a persistent point for recovery has been reached.

I have looked a bit into the second variant here. This needs a bit more
thought (currently also busy with 1.9 release) but in the course of the
next release cycle we might be able to share some initial design.

Best,
Stephan



On Fri, Jul 26, 2019 at 3:46 AM Guowei Ma <guowei....@gmail.com> wrote:

> Hi,
> 1. Currently, much work in FLINK-4256 is about failover improvements in the
> bouded dataset scenario.
> 2. For the streaming scenario,  a new shuffle plugin + proper failover
> strategy could avoid the "stop-the-word" recovery.
> 3. We have already done many works about the new shuffle in the old Flink
> shuffle architectures because many of our customers have the concern. We
> have a plan to move the work to the new Flink pluggable shuffle
> architecture.
>
> Best,
> Guowei
>
>
> Thomas Weise <t...@apache.org> 于2019年7月26日周五 上午8:54写道:
>
> > Hi,
> >
> > We are using Flink for streaming and find the "stop-the-world" recovery
> > behavior of Flink prohibitive for use cases that prioritize availability.
> > Partial recovery as outlined in FLIP-1 would probably alleviate these
> > concerns.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> >
> > Looking at the subtasks in
> > https://issues.apache.org/jira/browse/FLINK-4256 it
> > appears that much of the work was already done but not much recent
> > progress? What is missing (for streaming)? How close is version 2
> (recovery
> > from limited intermediate results)?
> >
> > Thanks!
> > Thomas
> >
>

Reply via email to