Re:Understanding checkpoint/savepoint storage requirements

Feifan Wang Wed, 27 Mar 2024 19:53:40 -0700

Hi Robert :

Your understanding are right !
Add some more information : JobManager not only responsible for cleaning old
checkpoints, but also needs to write metadata file to checkpoint storage after
all taskmanagers have taken snapshots.

-------------------------------
Best
Feifan Wang

At 2024-03-28 06:30:54, "Robert Young" <robertyoun...@gmail.com> wrote:

Hi all, I have some questions about checkpoint and savepoint storage.

From what I understand a distributed, production-quality job with a lot of
state should use durable shared storage for checkpoints and savepoints. All job
managers and task managers should access the same volume. So typically you'd
use hadoop, S3, Azure etc.

In the docs [1] it states for state.checkpoints.dir: "The storage path must be
accessible from all participating processes/nodes(i.e. all TaskManagers and
JobManagers)."

I want to understand why that is exactly. Here's my understanding:

1. The JobManager is responsible for cleaning old checkpoints, so it needs
access to all the files written out by all the task managers so it can remove
them.
2. For recovery/rescaling if all nodes share the same volume then TaskManagers
can read/redistribute the checkpoint data easily, since the volume is shared.

Is that correct? Are there more aspects to why the directory must be shared
across the processes?

Thank you,
Rob Young

1.
https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/fault-tolerance/checkpointing/#related-config-options

Re:Understanding checkpoint/savepoint storage requirements

Reply via email to