Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-31 Thread Jinzhong Li
, 2024 12:45 > To: dev@flink.apache.org > Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for > Disaggregated State > > Hi Feifan, > > Sorry for the misunderstanding. As Hangxiang explained, the basic cleanup > mechanism for remote working directory is

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-29 Thread Yun Tang
: Jinzhong Li Sent: Thursday, March 28, 2024 12:45 To: dev@flink.apache.org Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State Hi Feifan, Sorry for the misunderstanding. As Hangxiang explained, the basic cleanup mechanism for remote working directory

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-27 Thread Jinzhong Li
; > > >> > > >> > > >>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang wrote: > > >> > > >>> Hi Jinzhong, > > >>> > > >>> The overall design looks good. > > >>> > > >>> I have two mino

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-27 Thread Hangxiang Yu
ve two minor questions: > >>> > >>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the > shared > >>> directory? if we don't consider making TM ownership in this FLIP, this > >>> design seems unnecessary. > >>> 2. This FLIP forgets

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-27 Thread Feifan Wang
aking TM ownership in this FLIP, this >>> design seems unnecessary. >>> 2. This FLIP forgets to mention the cleanup of the remote working >>> directory in case of the taskmanager crushes, even though this is an open >>> problem, we can still leave some space fo

Re:Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-27 Thread Feifan Wang
re optimization. >> >> Best, >> Yun Tang >> >> >> From: Jinzhong Li >> Sent: Monday, March 25, 2024 10:41 >> To: dev@flink.apache.org >> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for >

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-27 Thread Jinzhong Li
Yun Tang > > > From: Jinzhong Li > Sent: Monday, March 25, 2024 10:41 > To: dev@flink.apache.org > Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for > Disaggregated State > > Hi Yue, > > Thanks for your co

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-27 Thread Yun Tang
: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State Hi Yue, Thanks for your comments. The CURRENT is a special file that points to the latest manifest log file. As Zakelly explained above, we could record the latest manifest filename during sync phase, and write

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-24 Thread Jinzhong Li
Hi Yue, Thanks for your comments. The CURRENT is a special file that points to the latest manifest log file. As Zakelly explained above, we could record the latest manifest filename during sync phase, and write the filename into CURRENT snapshot file during async phase. Best, Jinzhong On Fri,

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-22 Thread Zakelly Lan
Hi Yue, Thanks for bringing this up! The CURRENT FILE is the special one, which should be snapshot during the sync phase (temporary load into memory). Thus we can solve this. Best, Zakelly On Fri, Mar 22, 2024 at 4:55 PM yue ma wrote: > Hi jinzhong, > Thanks for you reply. I still have some

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-22 Thread yue ma
Hi jinzhong, Thanks for you reply. I still have some doubts about the first question. Is there such a case When you made a snapshot during the synchronization phase, you recorded the current and manifest 8, but before asynchronous phase, the manifest reached the size threshold and then the CURRENT

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-21 Thread Jinzhong Li
Hi Jeyhun, Thanks for your thoughtful feedback! > Why dont we consider an option where checkpoint directory just contains > metadata. So, we do not need to copy the data all the time from working > directory to the checkpointing directory. > Basically, when checkpointing, 1) we mark files in

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-21 Thread Jeyhun Karimov
Hi Jinzhong, Thanks for the FLIP. +1 for it. I have a few questions: - Why dont we consider an option where checkpoint directory just contains metadata. So, we do not need to copy the data all the time from working directory to the checkpointing directory. Basically, when checkpointing, 1) we

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-20 Thread Jinzhong Li
Hi Yue, Thanks for your feedback! > 1. If we choose Option-3 for ForSt , how would we handle Manifest File > ? Should we take a snapshot of the Manifest during the synchronization phase? IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo of Manifest files, and this api also

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-20 Thread yue ma
Hi Jinzhong Thank you for initiating this FLIP. I have just some minor question: 1. If we choice Option-3 for ForSt , how would we handle Manifest File ? Should we take snapshot of the Manifest during the synchronization phase? Otherwise, may the Manifest and MetaInfo information be

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-19 Thread Jinzhong Li
Hi everyone, This discussion has been open for a while and there are no new comments for several days . As a sub-FLIP of FLIP-423 which is nearing a consensus, I would like to start a vote after 72 hours. Please let me know if you have any concerns, thanks! Best, Jinzhong On Thu, Mar 7, 2024

[DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

2024-03-07 Thread Jinzhong Li
Hi devs, I'd like to start a discussion on a sub-FLIP of FLIP-423: Disaggregated State Storage and Management[1], which is a joint work of Yuan Mei, Zakelly Lan, Jinzhong Li, Hangxiang Yu, Yanfei Lei and Feng Wang: - FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State