Thanks for the pointer, Rui! I have reviewed FLIP-383, and based on my understanding, this feature should be enabled by default for batch jobs in the future. Therefore, +1 for checking the parameters and issuing log warnings when the user explicitly configures execution.batch.job-recovery.enabled to true.
+1 for high-availability.job-recovery.enabled, which would be more suitable with YAML hierarchy. Best, Yangze Guo On Mon, Jan 8, 2024 at 3:43 PM Rui Fan <1996fan...@gmail.com> wrote: > > Thanks to Yangze driving this proposal! > > Overall looks good to me! This proposal is useful for > the performance when the job doesn't need the failover. > > I have some minor questions: > > 1. How does it work with FLIP-383[1]? > > This FLIP introduces a high-availability.enable-job-recovery, > and FLIP-383 introduces a execution.batch.job-recovery.enabled. > > IIUC, when high-availability.enable-job-recovery is false, the job > cannot recover even if execution.batch.job-recovery.enabled = true, > right? > > If so, could we check some parameters and warn some logs? Or > disable the execution.batch.job-recovery.enabled directly when > high-availability.enable-job-recovery = false. > > 2. Could we rename it to high-availability.job-recovery.enabled to unify > the naming? > > WDYT? > > [1] https://cwiki.apache.org/confluence/x/QwqZE > > Best, > Rui > > On Mon, Jan 8, 2024 at 2:04 PM Yangze Guo <karma...@gmail.com> wrote: > > > Thanks for your comment, Yong. > > > > Here are my thoughts on the splitting of HighAvailableServices: > > Firstly, I would treat this separation as a result of technical debt > > and a side effect of the FLIP. In order to achieve a cleaner interface > > hierarchy for High Availability before Flink 2.0, the design decision > > should not be limited to OLAP scenarios. > > I agree that the current HAServices can be divided based on either the > > actual target (cluster & job) or the type of functionality (leader > > election & persistence). From a conceptual perspective, I do not see > > one approach being better than the other. However, I have chosen the > > current separation for a clear separation of concerns. After FLIP-285, > > each process has a dedicated LeaderElectionService responsible for > > leader election of all the components within it. This > > LeaderElectionService has its own lifecycle management. If we were to > > split the HAServices into 'ClusterHighAvailabilityService' and > > 'JobHighAvailabilityService', we would need to couple the lifecycle > > management of these two interfaces, as they both rely on the > > LeaderElectionService and other relevant classes. This coupling and > > implicit design assumption will increase the complexity and testing > > difficulty of the system. WDYT? > > > > Best, > > Yangze Guo > > > > On Mon, Jan 8, 2024 at 12:08 PM Yong Fang <zjur...@gmail.com> wrote: > > > > > > Thanks Yangze for starting this discussion. I have one comment: why do we > > > need to abstract two services as `LeaderServices` and > > > `PersistenceServices`? > > > > > > From the content, the purpose of this FLIP is to make job failover more > > > lightweight, so it would be more appropriate to abstract two services as > > > `ClusterHighAvailabilityService` and `JobHighAvailabilityService` instead > > > of `LeaderServices` and `PersistenceServices` based on leader and store. > > In > > > this way, we can create a `JobHighAvailabilityService` that has a leader > > > service and store for the job that meets the requirements based on the > > > configuration in the zk/k8s high availability service. > > > > > > WDYT? > > > > > > Best, > > > Fang Yong > > > > > > On Fri, Dec 29, 2023 at 8:10 PM xiangyu feng <xiangyu...@gmail.com> > > wrote: > > > > > > > Thanks Yangze for restart this discussion. > > > > > > > > +1 for the overall idea. By splitting the HighAvailabilityServices into > > > > LeaderServices and PersistenceServices, we may support configuring > > > > different storage behind them in the future. > > > > > > > > We did run into real problems in production where too much job > > metadata was > > > > being stored on ZK, causing system instability. > > > > > > > > > > > > Yangze Guo <karma...@gmail.com> 于2023年12月29日周五 10:21写道: > > > > > > > > > Thanks for the response, Zhanghao. > > > > > > > > > > PersistenceServices sounds good to me. > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > On Wed, Dec 27, 2023 at 11:30 AM Zhanghao Chen > > > > > <zhanghao.c...@outlook.com> wrote: > > > > > > > > > > > > Thanks for driving this effort, Yangze! The proposal overall LGTM. > > > > Other > > > > > from the throughput enhancement in the OLAP scenario, the separation > > of > > > > > leader election/discovery services and the metadata persistence > > services > > > > > will also make the HA impl clearer and easier to maintain. Just a > > minor > > > > > comment on naming: would it better to rename PersistentServices to > > > > > PersistenceServices, as usually we put a noun before Services? > > > > > > > > > > > > Best, > > > > > > Zhanghao Chen > > > > > > ________________________________ > > > > > > From: Yangze Guo <karma...@gmail.com> > > > > > > Sent: Tuesday, December 19, 2023 17:33 > > > > > > To: dev <dev@flink.apache.org> > > > > > > Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP > > > > > Scenarios > > > > > > > > > > > > Hi, there, > > > > > > > > > > > > We would like to start a discussion thread on "FLIP-403: High > > > > > > Availability Services for OLAP Scenarios"[1]. > > > > > > > > > > > > Currently, Flink's high availability service consists of two > > > > > > mechanisms: leader election/retrieval services for JobManager and > > > > > > persistent services for job metadata. However, these mechanisms are > > > > > > set up in an "all or nothing" manner. In OLAP scenarios, we > > typically > > > > > > only require leader election/retrieval services for JobManager > > > > > > components since jobs usually do not have a restart strategy. > > > > > > Additionally, the persistence of job states can negatively impact > > the > > > > > > cluster's throughput, especially for short query jobs. > > > > > > > > > > > > To address these issues, this FLIP proposes splitting the > > > > > > HighAvailabilityServices into LeaderServices and > > PersistentServices, > > > > > > and enable users to independently configure the high availability > > > > > > strategies specifically related to jobs. > > > > > > > > > > > > Please find more details in the FLIP wiki document [1]. Looking > > > > > > forward to your feedback. > > > > > > > > > > > > [1] > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios > > > > > > > > > > > > Best, > > > > > > Yangze Guo > > > > > > > > > > >