Re: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios

Yangze Guo Mon, 08 Jan 2024 01:11:48 -0800

Thanks for the pointer, Rui!

I have reviewed FLIP-383, and based on my understanding, this feature
should be enabled by default for batch jobs in the future. Therefore,
+1 for checking the parameters and issuing log warnings when the user
explicitly configures execution.batch.job-recovery.enabled to true.


+1 for high-availability.job-recovery.enabled, which would be more
suitable with YAML hierarchy.


Best,
Yangze Guo

On Mon, Jan 8, 2024 at 3:43 PM Rui Fan <[email protected]> wrote:
>
> Thanks to Yangze driving this proposal!
>
> Overall looks good to me! This proposal is useful for
> the performance when the job doesn't need the failover.
>
> I have some minor questions:
>
> 1. How does it work with FLIP-383[1]?
>
> This FLIP introduces a high-availability.enable-job-recovery,
> and FLIP-383 introduces a execution.batch.job-recovery.enabled.
>
> IIUC, when high-availability.enable-job-recovery is false, the job
> cannot recover even if execution.batch.job-recovery.enabled = true,
> right?
>
> If so, could we check some parameters and warn some logs? Or
> disable the execution.batch.job-recovery.enabled directly when
> high-availability.enable-job-recovery = false.
>
> 2. Could we rename it to high-availability.job-recovery.enabled to unify
> the naming?
>
> WDYT?
>
> [1] https://cwiki.apache.org/confluence/x/QwqZE
>
> Best,
> Rui
>
> On Mon, Jan 8, 2024 at 2:04 PM Yangze Guo <[email protected]> wrote:
>
> > Thanks for your comment, Yong.
> >
> > Here are my thoughts on the splitting of HighAvailableServices:
> > Firstly, I would treat this separation as a result of technical debt
> > and a side effect of the FLIP. In order to achieve a cleaner interface
> > hierarchy for High Availability before Flink 2.0, the design decision
> > should not be limited to OLAP scenarios.
> > I agree that the current HAServices can be divided based on either the
> > actual target (cluster & job) or the type of functionality (leader
> > election & persistence). From a conceptual perspective, I do not see
> > one approach being better than the other. However, I have chosen the
> > current separation for a clear separation of concerns. After FLIP-285,
> > each process has a dedicated LeaderElectionService responsible for
> > leader election of all the components within it. This
> > LeaderElectionService has its own lifecycle management. If we were to
> > split the HAServices into 'ClusterHighAvailabilityService' and
> > 'JobHighAvailabilityService', we would need to couple the lifecycle
> > management of these two interfaces, as they both rely on the
> > LeaderElectionService and other relevant classes. This coupling and
> > implicit design assumption will increase the complexity and testing
> > difficulty of the system. WDYT?
> >
> > Best,
> > Yangze Guo
> >
> > On Mon, Jan 8, 2024 at 12:08 PM Yong Fang <[email protected]> wrote:
> > >
> > > Thanks Yangze for starting this discussion. I have one comment: why do we
> > > need to abstract two services as `LeaderServices` and
> > > `PersistenceServices`?
> > >
> > > From the content, the purpose of this FLIP is to make job failover more
> > > lightweight, so it would be more appropriate to abstract two services as
> > > `ClusterHighAvailabilityService` and `JobHighAvailabilityService` instead
> > > of `LeaderServices` and `PersistenceServices` based on leader and store.
> > In
> > > this way, we can create a `JobHighAvailabilityService` that has a leader
> > > service and store for the job that meets the requirements based on the
> > > configuration in the zk/k8s high availability service.
> > >
> > > WDYT?
> > >
> > > Best,
> > > Fang Yong
> > >
> > > On Fri, Dec 29, 2023 at 8:10 PM xiangyu feng <[email protected]>
> > wrote:
> > >
> > > > Thanks Yangze for restart this discussion.
> > > >
> > > > +1 for the overall idea. By splitting the HighAvailabilityServices into
> > > > LeaderServices and PersistenceServices, we may support configuring
> > > > different storage behind them in the future.
> > > >
> > > > We did run into real problems in production where too much job
> > metadata was
> > > > being stored on ZK, causing system instability.
> > > >
> > > >
> > > > Yangze Guo <[email protected]> 于2023年12月29日周五 10:21写道：
> > > >
> > > > > Thanks for the response, Zhanghao.
> > > > >
> > > > > PersistenceServices sounds good to me.
> > > > >
> > > > > Best,
> > > > > Yangze Guo
> > > > >
> > > > > On Wed, Dec 27, 2023 at 11:30 AM Zhanghao Chen
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Thanks for driving this effort, Yangze! The proposal overall LGTM.
> > > > Other
> > > > > from the throughput enhancement in the OLAP scenario, the separation
> > of
> > > > > leader election/discovery services and the metadata persistence
> > services
> > > > > will also make the HA impl clearer and easier to maintain. Just a
> > minor
> > > > > comment on naming: would it better to rename PersistentServices to
> > > > > PersistenceServices, as usually we put a noun before Services?
> > > > > >
> > > > > > Best,
> > > > > > Zhanghao Chen
> > > > > > ________________________________
> > > > > > From: Yangze Guo <[email protected]>
> > > > > > Sent: Tuesday, December 19, 2023 17:33
> > > > > > To: dev <[email protected]>
> > > > > > Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP
> > > > > Scenarios
> > > > > >
> > > > > > Hi, there,
> > > > > >
> > > > > > We would like to start a discussion thread on "FLIP-403: High
> > > > > > Availability Services for OLAP Scenarios"[1].
> > > > > >
> > > > > > Currently, Flink's high availability service consists of two
> > > > > > mechanisms: leader election/retrieval services for JobManager and
> > > > > > persistent services for job metadata. However, these mechanisms are
> > > > > > set up in an "all or nothing" manner. In OLAP scenarios, we
> > typically
> > > > > > only require leader election/retrieval services for JobManager
> > > > > > components since jobs usually do not have a restart strategy.
> > > > > > Additionally, the persistence of job states can negatively impact
> > the
> > > > > > cluster's throughput, especially for short query jobs.
> > > > > >
> > > > > > To address these issues, this FLIP proposes splitting the
> > > > > > HighAvailabilityServices into LeaderServices and
> > PersistentServices,
> > > > > > and enable users to independently configure the high availability
> > > > > > strategies specifically related to jobs.
> > > > > >
> > > > > > Please find more details in the FLIP wiki document [1]. Looking
> > > > > > forward to your feedback.
> > > > > >
> > > > > > [1]
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios
> > > > > >
> > > > > > Best,
> > > > > > Yangze Guo
> > > > >
> > > >
> >

Re: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios

Reply via email to