Guowei and Stephan, thanks for the reply!

The biggest gain that FLIP-1 will deliver for streaming is that parallel
processing can continue accept for those parallel paths affected by the
failure, even when all tasks in an affected path need to be reset. Assuming
task manager process failure as most common scenario and the default
scheduling, that would leave (parallelism - number of task slots)
processing paths online (all tasks in the failed TM are reset). For
applications that are very latency sensitive and can produce independent
results, this is a big deal.

Also, since by default all tasks end up in a single slot, a TM failure will
reset everything when a shuffle is involved anyways, even when it is
upstream. But where needed a shuffle can be externalized using an
intermediate Kafka topic, for example. So perhaps I should also ask how far
away we are from completing the first variant (without intermediate
results) for streaming!

Thanks,
Thomas



On Fri, Jul 26, 2019 at 12:36 AM Stephan Ewen <se...@apache.org> wrote:

> Hi Thomas!
>
> For Batch, this should be working in release 1.9.
>
> For streaming, it is a bit more tricky, mainly because of the fact that you
> have to deal with downstream correctness.
> Either a recovery still needs to reset downstream tasks (which means on
> average half of the tasks) or needs to wait before publishing the data
> downstream until a persistent point for recovery has been reached.
>
> I have looked a bit into the second variant here. This needs a bit more
> thought (currently also busy with 1.9 release) but in the course of the
> next release cycle we might be able to share some initial design.
>
> Best,
> Stephan
>
>
>
> On Fri, Jul 26, 2019 at 3:46 AM Guowei Ma <guowei....@gmail.com> wrote:
>
> > Hi,
> > 1. Currently, much work in FLINK-4256 is about failover improvements in
> the
> > bouded dataset scenario.
> > 2. For the streaming scenario,  a new shuffle plugin + proper failover
> > strategy could avoid the "stop-the-word" recovery.
> > 3. We have already done many works about the new shuffle in the old Flink
> > shuffle architectures because many of our customers have the concern. We
> > have a plan to move the work to the new Flink pluggable shuffle
> > architecture.
> >
> > Best,
> > Guowei
> >
> >
> > Thomas Weise <t...@apache.org> 于2019年7月26日周五 上午8:54写道:
> >
> > > Hi,
> > >
> > > We are using Flink for streaming and find the "stop-the-world" recovery
> > > behavior of Flink prohibitive for use cases that prioritize
> availability.
> > > Partial recovery as outlined in FLIP-1 would probably alleviate these
> > > concerns.
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> > >
> > > Looking at the subtasks in
> > > https://issues.apache.org/jira/browse/FLINK-4256 it
> > > appears that much of the work was already done but not much recent
> > > progress? What is missing (for streaming)? How close is version 2
> > (recovery
> > > from limited intermediate results)?
> > >
> > > Thanks!
> > > Thomas
> > >
> >
>

Reply via email to