Re: [DISCUSS] FLIP-76: Unaligned checkpoints

Piotr Nowojski Mon, 07 Oct 2019 01:07:56 -0700

Hi Arvid,

Thanks for coming up with this FLIP. I think it addresses the issues raised in 
the previous mailing list discussion [2].


For the record: +1 from my side to implement this.

Piotrek

> On 30 Sep 2019, at 14:31, Arvid Heise <[email protected]> wrote:
> 
> Hi Devs,
> 
> I would like to start the formal discussion about FLIP-76 [1], which
> improves the checkpoint latency in systems under backpressure, where a
> checkpoint can take hours to complete in the worst case. I recommend the
> thread "checkpointing under backpressure" [2] to get a good idea why users
> are not satisfied with the current behavior. The key points:
> 
>   - Since the checkpoint barrier flows much slower through the
>   back-pressured channels, the other channels and their upstream operators
>   are effectively blocked during checkpointing.
>   - The checkpoint barrier takes a long time to reach the sinks causing
>   long checkpointing times. A longer checkpointing time in turn means that
>   the checkpoint will be fairly outdated once done. Since a heavily utilized
>   pipeline is inherently more fragile, we may run into a vicious cycle of
>   late checkpoints, crash, recovery to a rather outdated checkpoint, more
>   back pressure, and even later checkpoints, which would result in little to
>   no progress in the application.
> 
> The FLIP proposes "unaligned checkpoints" which improves the current state,
> such that
> 
>   - Upstream processes can continue to produce data, even if some operator
>   still waits on a checkpoint barrier on a specific input channel.
>   - Checkpointing times are heavily reduced across the execution graph,
>   even for operators with a single input channel.
>   - End-users will see more progress even in unstable environments as more
>   up-to-date checkpoints will avoid too many recomputations.
>   - Facilitate faster rescaling.
> 
> The key idea is to allow checkpoint barriers to be forwarded to downstream
> tasks before the synchronous part of the checkpointing has been conducted
> (see Fig. 1). To that end, we need to store in-flight data as part of the
> checkpoint as described in greater details in this FLIP.
> 
> Although the basic idea was already sketched in [2], we would like get
> broader feedback in this dedicated mail thread.
> 
> Best,
> 
> Arvid
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Checkpointing-under-backpressure-td31616.html

Re: [DISCUSS] FLIP-76: Unaligned checkpoints

Reply via email to