Hi Sameer If you only have one disk for one TM, 10 TMs could deploy at most 10 disks while 100TM could deploy at most 100 disks. The sync checkpoint phase of RocksDB need to write disk and if you could distribute the write pressure over more disks, you could get better performance which is what you observed.
The synchronous checkpoint phase actually means the task can only execute checkpoint and cannot process elements at that time. On the other hand, the asynchronous phase means the task upload checkpoint data asyncly and could still process elements at that time. Moreover, flushing RocksDB in sync phase is executed in task main thread and one TM could have many task main threads. Since the synchronous checkpoint phase is only triggered after barrier alignment finished, we cannot ensure all RocksDB instances would execute flushing at the same time. Best Yun Tang ________________________________ From: Sameer W <sam...@axiomine.com> Sent: Thursday, June 18, 2020 3:34 To: user <user@flink.apache.org> Subject: Implications on incremental checkpoint duration based on RocksDBStateBackend, Number of Slots per TM Hi, The number of RocksDB databases the Flink creates is equal to the number of operator states multiplied by the number of slots. Assuming a parallelism of 100 for a job which is executed on 100 TM's with 1 slot per TM as opposed to 10 TM's with 10 slots per TM I have noticed that the former configuration is more efficient for incremental checkpointing. In both cases the number of RocksDB databases is the same, except in the latter case 10 times as many are created in one TM vs the former case. Reading the link<https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html> below, it says - "link uses this to figure out the state changes. To do this, Flink triggers a flush in RocksDB, forcing all memtables into sstables on disk, and hard-linked in a local temporary directory. This process is synchronous to the processing pipeline, and Flink performs all further steps asynchronously and does not block processing." What does "Synchronous to the processing pipeline" mean? Does it mean that flushing to DB happens synchronously (serially) for all RocksDB databases in one TM? Is the flushing single threaded per TM Thanks, Sameer <https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html>