Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Jinzhong Li Wed, 20 Mar 2024 05:42:17 -0700

Hi Yue,

Thanks for your feedback!

> 1. If we choose Option-3 for ForSt , how would we handle Manifest File
> ? Should we take a snapshot of the Manifest during the synchronization
phase?

IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo of
Manifest files, and this api also return the manifest file size, which
means this api could take snapshot for Manifest FileInfo (filename +
fileSize) during the synchronization phase.
You could refer to the rocksdb source code[1] to verify this.

 > However, many distributed storage systems do not support the
> ability of Fast Duplicate (such as HDFS). But ForSt has the ability to
> directly read and write remote files. Can we not copy or Fast duplicate
> these files, but instand of directly reuse and. reference these remote
> files? I think this can reduce file download time and may be more useful
> for most users who use HDFS (do not support Fast Duplicate)?

Firstly, as far as I know, most remote file systems support the
FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS
copyObject, and the HDFS indeed does not support FastDuplicate.

Actually，we have considered the design which reuses remote files. And that
is what we want to implement in the coming future, where both checkpoints
and restores can reuse existing files residing on the remote state storage.
However, this design conflicts with the current file management system in
Flink.  At present, remote state files are managed by the ForStDB
(TaskManager side), while checkpoint files are managed by the JobManager,
which is a major hindrance to file reuse. For example, issues could arise
if a TM reuses a checkpoint file that is subsequently deleted by the JM.
Therefore, as mentioned in FLIP-423[2], our roadmap is to first integrate
checkpoint/restore mechanisms with existing framework  at milestone-1.
Then, at milestone-2, we plan to introduce TM State Ownership and Faster
Checkpointing mechanisms, which will allow both checkpointing and restoring
to directly reuse remote files, thus achieving faster checkpointing and
restoring.

[1]
https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
[2]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan

Best,
Jinzhong

On Wed, Mar 20, 2024 at 4:01 PM yue ma <[email protected]> wrote:

> Hi Jinzhong
>
> Thank you for initiating this FLIP.
>
> I have just some minor question:
>
> 1. If we choice Option-3 for ForSt , how would we handle Manifest File
> ? Should we take snapshot of the Manifest during the synchronization phase?
> Otherwise, may the Manifest and MetaInfo information be inconsistent during
> recovery?
> 2. For the Restore Operation , we need Fast Duplicate  Checkpoint Files to
> Working Dir . However, many distributed storage systems do not support the
> ability of Fast Duplicate (such as HDFS). But ForSt has the ability to
> directly read and write remote files. Can we not copy or Fast duplicate
> these files, but instand of directly reuse and. reference these remote
> files? I think this can reduce file download time and may be more useful
> for most users who use HDFS (do not support Fast Duplicate)?
>
> --
> Best,
> Yue
>

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to