Hi!

This is an awesome proposal, I am looking forward to seeing it in action :)

Some things I have been wondering:

Which component decides whether the next checkpoint should be a delta or
not? I guess the more deltas we take the longer the recovery time will be
if there is many overwrites in the database, on the other hand if we rarely
overwrite values it might make sense to keep a lot of deltas. Maybe the
statebackend should be able to decide this, it might not make sense to
preconfigure this to a fix value, but I am not sure. We could for instance
use bloom filters to make the decision.

If we created deltas with a lot of duplicate keys then the recovery will
suffer potentially outweighing the benefits of the incremental checkpoint
itself (it might violate some strict SLA on recovery), would it make sense
in some cases to do a merge of the deltas in background batch jobs? This
would probably make bookkeeping much harder, so just an idea I wanted to
throw in there.

Otherwise it seems that a lot of thought went into this, and looks very
good!

Have a nice weekend!
Gyula

SHI Xiaogang <shixiaoga...@gmail.com> ezt írta (időpont: 2017. febr. 14.,
K, 4:18):

> Hi all,
>
>
> Incremental checkpointing can help a lot in improving the efficiency of
> fault tolerance and recovery in Flink. I wrote an initial design of
> incremental checkpointing in Flink, and am looking forwards for your
> comments.
>
>
>
> https://docs.google.com/document/d/1VvvPp09gGdVb9D2wHx6NX99yQK0jSUMWHQPrQ_mn520/edit?usp=sharing
>
>
> Some more issues, I think, are needed to be discussed in the introduction
> of incremental checkpointing.
>
>
> One is the implementation of savepoints. Savepoints are supposed to be full
> and independent of backend implementation. Currently, the implementation of
> Savepoints and Checkpoints are identical in backends. With the introduction
> of incremental checkpointing, I think backends should take different
> snapshots for them.
>
>
> Regards,
>
> Xiaogang
>

Reply via email to