Thank you both for the information! Rob
On Thu, Mar 28, 2024 at 7:08 PM Asimansu Bera <asimansu.b...@gmail.com> wrote: > To add more details to it so that it will be clear why access to > persistent object stores for all JVM processes are required for a job graph > of Flink for consistent recovery. > *JoB Manager:* > > Flink's JobManager writes critical metadata during checkpoints for fault > tolerance: > > - Job Configuration: Preserves job settings (parallelism, state > backend) for consistent restarts. > - Progress Information: Stores offsets (source/sink positions) to > resume processing from the correct point after failures. > - Checkpoint Counters: Provides metrics (ID, timestamp, duration) for > monitoring checkpointing behavior. > > > *Task Managers:* > While the JobManager handles checkpoint metadata, TaskManagers are the > workhorses during Flink checkpoints. Here's what they do: > > - State Snapshots: Upon receiving checkpoint instructions, > TaskManagers capture snapshots of their current state. This state includes > in-memory data and operator variables crucial for resuming processing. > - State Serialization: The captured state is transformed into a format > suitable for storage, often byte arrays. This serialized data represents > the actual application state. > > > A good network connection bandwidth is very crucial to write the large > state quicker to HDFS/S3 object store from all operator states so that Task > Manager could write it quickly. Oftentimes, some customers use NFS as a > persistent store which is not recommended as NFS is slow and slows down the > checkpointing. > > -A > > > On Wed, Mar 27, 2024 at 7:52 PM Feifan Wang <zoltar9...@163.com> wrote: > >> Hi Robert : >> >> Your understanding are right ! >> Add some more information : JobManager not only responsible for cleaning >> old checkpoints, but also needs to write metadata file to checkpoint >> storage after all taskmanagers have taken snapshots. >> >> ------------------------------- >> Best >> Feifan Wang >> >> At 2024-03-28 06:30:54, "Robert Young" <robertyoun...@gmail.com> wrote: >> >> Hi all, I have some questions about checkpoint and savepoint storage. >> >> From what I understand a distributed, production-quality job with a lot >> of state should use durable shared storage for checkpoints and savepoints. >> All job managers and task managers should access the same volume. So >> typically you'd use hadoop, S3, Azure etc. >> >> In the docs [1] it states for state.checkpoints.dir: "The storage path >> must be accessible from all participating processes/nodes(i.e. all >> TaskManagers and JobManagers)." >> >> I want to understand why that is exactly. Here's my understanding: >> >> 1. The JobManager is responsible for cleaning old checkpoints, so it >> needs access to all the files written out by all the task managers so it >> can remove them. >> 2. For recovery/rescaling if all nodes share the same volume then >> TaskManagers can read/redistribute the checkpoint data easily, since the >> volume is shared. >> >> Is that correct? Are there more aspects to why the directory must be >> shared across the processes? >> >> Thank you, >> Rob Young >> >> 1. >> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/fault-tolerance/checkpointing/#related-config-options >> >>