Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

yue ma Fri, 22 Mar 2024 01:53:46 -0700

Hi jinzhong,
Thanks for you reply. I still have some doubts about the first question. Is
there such a case
When you made a snapshot during the synchronization phase, you recorded the
current and manifest 8, but before asynchronous phase, the manifest reached
the size threshold and then the CURRENT FILE pointed to the new manifest 9,
and then uploaded the incorrect CURRENT file ?


Jinzhong Li <lijinzhong2...@gmail.com> 于2024年3月20日周三 20:13写道：

> Hi Yue,
>
> Thanks for your feedback!
>
> > 1. If we choose Option-3 for ForSt , how would we handle Manifest File
> > ? Should we take a snapshot of the Manifest during the synchronization
> phase?
>
> IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo of
> Manifest files, and this api also return the manifest file size, which
> means this api could take snapshot for Manifest FileInfo (filename +
> fileSize) during the synchronization phase.
> You could refer to the rocksdb source code[1] to verify this.
>
>
>  > However, many distributed storage systems do not support the
> > ability of Fast Duplicate (such as HDFS). But ForSt has the ability to
> > directly read and write remote files. Can we not copy or Fast duplicate
> > these files, but instand of directly reuse and. reference these remote
> > files? I think this can reduce file download time and may be more useful
> > for most users who use HDFS (do not support Fast Duplicate)?
>
> Firstly, as far as I know, most remote file systems support the
> FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS
> copyObject, and the HDFS indeed does not support FastDuplicate.
>
> Actually，we have considered the design which reuses remote files. And that
> is what we want to implement in the coming future, where both checkpoints
> and restores can reuse existing files residing on the remote state storage.
> However, this design conflicts with the current file management system in
> Flink.  At present, remote state files are managed by the ForStDB
> (TaskManager side), while checkpoint files are managed by the JobManager,
> which is a major hindrance to file reuse. For example, issues could arise
> if a TM reuses a checkpoint file that is subsequently deleted by the JM.
> Therefore, as mentioned in FLIP-423[2], our roadmap is to first integrate
> checkpoint/restore mechanisms with existing framework  at milestone-1.
> Then, at milestone-2, we plan to introduce TM State Ownership and Faster
> Checkpointing mechanisms, which will allow both checkpointing and restoring
> to directly reuse remote files, thus achieving faster checkpointing and
> restoring.
>
> [1]
>
> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
> [2]
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan
>
> Best,
> Jinzhong
>
>
>
>
>
>
>
> On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com> wrote:
>
> > Hi Jinzhong
> >
> > Thank you for initiating this FLIP.
> >
> > I have just some minor question:
> >
> > 1. If we choice Option-3 for ForSt , how would we handle Manifest File
> > ? Should we take snapshot of the Manifest during the synchronization
> phase?
> > Otherwise, may the Manifest and MetaInfo information be inconsistent
> during
> > recovery?
> > 2. For the Restore Operation , we need Fast Duplicate  Checkpoint Files
> to
> > Working Dir . However, many distributed storage systems do not support
> the
> > ability of Fast Duplicate (such as HDFS). But ForSt has the ability to
> > directly read and write remote files. Can we not copy or Fast duplicate
> > these files, but instand of directly reuse and. reference these remote
> > files? I think this can reduce file download time and may be more useful
> > for most users who use HDFS (do not support Fast Duplicate)?
> >
> > --
> > Best,
> > Yue
> >
>


-- 
Best,
Yue

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to