[jira] [Commented] (FLINK-4120) Lightweight fault tolerance through recomputing lost state

Flink Jira Bot (Jira) Wed, 14 Apr 2021 16:03:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321724#comment-17321724
 ]


Flink Jira Bot commented on FLINK-4120:
---------------------------------------

This issue and all of its Sub-Tasks have not been updated for 180 days. So, it 
has been labeled "stale-minor". If you are still affected by this bug or are 
still interested in this issue, please give an update and remove the label. In 
7 days the issue will be closed automatically.

> Lightweight fault tolerance through recomputing lost state
> ----------------------------------------------------------
>
>                 Key: FLINK-4120
>                 URL: https://issues.apache.org/jira/browse/FLINK-4120
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / State Backends
>            Reporter: Dénes Vadász
>            Priority: Minor
>              Labels: stale-minor
>
> The current fault tolerance mechanism requires that stateful operators write 
> their internal state to stable storage during a checkpoint. 
> This proposal aims at optimizing out this operation in the cases where the 
> operator state can be recomputed from a finite (practically: small) set of 
> source records, and those records are already on checkpoint-aware persistent 
> storage (e.g. in Kafka). 
> The rationale behind the proposal is that the cost of reprocessing is paid 
> only on recovery from (presumably rare) failures, whereas the cost of 
> persisting the state is paid on every checkpoint. Eliminating the need for 
> persistent storage will also simplify system setup and operation.
> In the cases where this optimisation is applicable, the state of the 
> operators can be restored by restarting the pipeline from a checkpoint taken 
> before the pipeline ingested any of the records required for the state 
> re-computation of the operators (call this the "null-state checkpoint"), as 
> opposed to a restart from the "latest checkpoint". 
> The "latest checkpoint" is still relevant for the recovery: the barriers 
> belonging to that checkpoint must be inserted into the source streams in the 
> position they were originally inserted. Sinks must discard all records until 
> this barrier reaches them.
> Note the inherent relationship between the "latest" and the "null-state" 
> checkpoints: the pipeline must be restarted from the latter to restore the 
> state at the former.
> For the stateful operators for which this optimization is applicable we can 
> define the notion of "current null-state watermark" as the watermark such 
> that the operator can correctly (re)compute its current state merely from 
> records after this watermark. 
>  
> For the checkpoint-coordinator to be able to compute the null-state 
> checkpoint, each stateful operator should report its "current null-state 
> watermark" as part of acknowledging the ongoing checkpoint. The null-state 
> checkpoint of the ongoing checkpoint is the most recent checkpoint preceding 
> all the received null-state watermarks (assuming the pipeline preserves the 
> relative order of barriers and watermarks).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-4120) Lightweight fault tolerance through recomputing lost state

Reply via email to