[ https://issues.apache.org/jira/browse/FLINK-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321724#comment-17321724 ]
Flink Jira Bot commented on FLINK-4120: --------------------------------------- This issue and all of its Sub-Tasks have not been updated for 180 days. So, it has been labeled "stale-minor". If you are still affected by this bug or are still interested in this issue, please give an update and remove the label. In 7 days the issue will be closed automatically. > Lightweight fault tolerance through recomputing lost state > ---------------------------------------------------------- > > Key: FLINK-4120 > URL: https://issues.apache.org/jira/browse/FLINK-4120 > Project: Flink > Issue Type: New Feature > Components: Runtime / State Backends > Reporter: Dénes Vadász > Priority: Minor > Labels: stale-minor > > The current fault tolerance mechanism requires that stateful operators write > their internal state to stable storage during a checkpoint. > This proposal aims at optimizing out this operation in the cases where the > operator state can be recomputed from a finite (practically: small) set of > source records, and those records are already on checkpoint-aware persistent > storage (e.g. in Kafka). > The rationale behind the proposal is that the cost of reprocessing is paid > only on recovery from (presumably rare) failures, whereas the cost of > persisting the state is paid on every checkpoint. Eliminating the need for > persistent storage will also simplify system setup and operation. > In the cases where this optimisation is applicable, the state of the > operators can be restored by restarting the pipeline from a checkpoint taken > before the pipeline ingested any of the records required for the state > re-computation of the operators (call this the "null-state checkpoint"), as > opposed to a restart from the "latest checkpoint". > The "latest checkpoint" is still relevant for the recovery: the barriers > belonging to that checkpoint must be inserted into the source streams in the > position they were originally inserted. Sinks must discard all records until > this barrier reaches them. > Note the inherent relationship between the "latest" and the "null-state" > checkpoints: the pipeline must be restarted from the latter to restore the > state at the former. > For the stateful operators for which this optimization is applicable we can > define the notion of "current null-state watermark" as the watermark such > that the operator can correctly (re)compute its current state merely from > records after this watermark. > > For the checkpoint-coordinator to be able to compute the null-state > checkpoint, each stateful operator should report its "current null-state > watermark" as part of acknowledging the ongoing checkpoint. The null-state > checkpoint of the ongoing checkpoint is the most recent checkpoint preceding > all the received null-state watermarks (assuming the pipeline preserves the > relative order of barriers and watermarks). -- This message was sent by Atlassian Jira (v8.3.4#803005)