Hi Arvid,

            Very thanks for bring up the discussion! From our side unable to 
finish the checkpoint is commonly met for online jobs, therefore +1 from my 
side to implement this. 
           A tiny issue of the FLIP is that the Discussion Thread URL attached 
seems to be not right. 


     Best, 
     Yun 


------------------------------------------------------------------
From:Arvid Heise <ar...@ververica.com>
Send Time:2019 Sep. 30 (Mon.) 20:31
To:dev <dev@flink.apache.org>
Subject:[DISCUSS] FLIP-76: Unaligned checkpoints

Hi Devs,

I would like to start the formal discussion about FLIP-76 [1], which
improves the checkpoint latency in systems under backpressure, where a
checkpoint can take hours to complete in the worst case. I recommend the
thread "checkpointing under backpressure" [2] to get a good idea why users
are not satisfied with the current behavior. The key points:

   - Since the checkpoint barrier flows much slower through the
   back-pressured channels, the other channels and their upstream operators
   are effectively blocked during checkpointing.
   - The checkpoint barrier takes a long time to reach the sinks causing
   long checkpointing times. A longer checkpointing time in turn means that
   the checkpoint will be fairly outdated once done. Since a heavily utilized
   pipeline is inherently more fragile, we may run into a vicious cycle of
   late checkpoints, crash, recovery to a rather outdated checkpoint, more
   back pressure, and even later checkpoints, which would result in little to
   no progress in the application.

The FLIP proposes "unaligned checkpoints" which improves the current state,
such that

   - Upstream processes can continue to produce data, even if some operator
   still waits on a checkpoint barrier on a specific input channel.
   - Checkpointing times are heavily reduced across the execution graph,
   even for operators with a single input channel.
   - End-users will see more progress even in unstable environments as more
   up-to-date checkpoints will avoid too many recomputations.
   - Facilitate faster rescaling.

The key idea is to allow checkpoint barriers to be forwarded to downstream
tasks before the synchronous part of the checkpointing has been conducted
(see Fig. 1). To that end, we need to store in-flight data as part of the
checkpoint as described in greater details in this FLIP.

Although the basic idea was already sketched in [2], we would like get
broader feedback in this dedicated mail thread.

Best,

Arvid

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Checkpointing-under-backpressure-td31616.html

Reply via email to