Re: [DISCUSS] FLIP-76: Unaligned checkpoints

Yun Gao Thu, 10 Oct 2019 03:39:59 -0700

    Hi Arvid,

            Very thanks for bring up the discussion! From our side unable to 
finish the checkpoint is commonly met for online jobs, therefore +1 from my 
side to implement this. 
           A tiny issue of the FLIP is that the Discussion Thread URL attached 
seems to be not right.

Best,
Yun

------------------------------------------------------------------
From:Arvid Heise <[email protected]>
Send Time:2019 Sep. 30 (Mon.) 20:31
To:dev <[email protected]>
Subject:[DISCUSS] FLIP-76: Unaligned checkpoints

Hi Devs,

I would like to start the formal discussion about FLIP-76 [1], which
improves the checkpoint latency in systems under backpressure, where a
checkpoint can take hours to complete in the worst case. I recommend the
thread "checkpointing under backpressure" [2] to get a good idea why users
are not satisfied with the current behavior. The key points:

- Since the checkpoint barrier flows much slower through the
back-pressured channels, the other channels and their upstream operators
are effectively blocked during checkpointing.
- The checkpoint barrier takes a long time to reach the sinks causing
long checkpointing times. A longer checkpointing time in turn means that
the checkpoint will be fairly outdated once done. Since a heavily utilized
pipeline is inherently more fragile, we may run into a vicious cycle of
late checkpoints, crash, recovery to a rather outdated checkpoint, more
back pressure, and even later checkpoints, which would result in little to
no progress in the application.

The FLIP proposes "unaligned checkpoints" which improves the current state,
such that

- Upstream processes can continue to produce data, even if some operator
still waits on a checkpoint barrier on a specific input channel.
- Checkpointing times are heavily reduced across the execution graph,
even for operators with a single input channel.
- End-users will see more progress even in unstable environments as more
up-to-date checkpoints will avoid too many recomputations.
- Facilitate faster rescaling.

The key idea is to allow checkpoint barriers to be forwarded to downstream
tasks before the synchronous part of the checkpointing has been conducted
(see Fig. 1). To that end, we need to store in-flight data as part of the
checkpoint as described in greater details in this FLIP.

Although the basic idea was already sketched in [2], we would like get
broader feedback in this dedicated mail thread.

Best,

Arvid

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Checkpointing-under-backpressure-td31616.html

Re: [DISCUSS] FLIP-76: Unaligned checkpoints

Reply via email to