Re: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios

Yangze Guo Sun, 07 Jan 2024 22:04:11 -0800

Thanks for your comment, Yong.

Here are my thoughts on the splitting of HighAvailableServices:
Firstly, I would treat this separation as a result of technical debt
and a side effect of the FLIP. In order to achieve a cleaner interface
hierarchy for High Availability before Flink 2.0, the design decision
should not be limited to OLAP scenarios.
I agree that the current HAServices can be divided based on either the
actual target (cluster & job) or the type of functionality (leader
election & persistence). From a conceptual perspective, I do not see
one approach being better than the other. However, I have chosen the
current separation for a clear separation of concerns. After FLIP-285,
each process has a dedicated LeaderElectionService responsible for
leader election of all the components within it. This
LeaderElectionService has its own lifecycle management. If we were to
split the HAServices into 'ClusterHighAvailabilityService' and
'JobHighAvailabilityService', we would need to couple the lifecycle
management of these two interfaces, as they both rely on the
LeaderElectionService and other relevant classes. This coupling and
implicit design assumption will increase the complexity and testing
difficulty of the system. WDYT?


Best,
Yangze Guo

On Mon, Jan 8, 2024 at 12:08 PM Yong Fang <[email protected]> wrote:
>
> Thanks Yangze for starting this discussion. I have one comment: why do we
> need to abstract two services as `LeaderServices` and
> `PersistenceServices`?
>
> From the content, the purpose of this FLIP is to make job failover more
> lightweight, so it would be more appropriate to abstract two services as
> `ClusterHighAvailabilityService` and `JobHighAvailabilityService` instead
> of `LeaderServices` and `PersistenceServices` based on leader and store. In
> this way, we can create a `JobHighAvailabilityService` that has a leader
> service and store for the job that meets the requirements based on the
> configuration in the zk/k8s high availability service.
>
> WDYT?
>
> Best,
> Fang Yong
>
> On Fri, Dec 29, 2023 at 8:10 PM xiangyu feng <[email protected]> wrote:
>
> > Thanks Yangze for restart this discussion.
> >
> > +1 for the overall idea. By splitting the HighAvailabilityServices into
> > LeaderServices and PersistenceServices, we may support configuring
> > different storage behind them in the future.
> >
> > We did run into real problems in production where too much job metadata was
> > being stored on ZK, causing system instability.
> >
> >
> > Yangze Guo <[email protected]> 于2023年12月29日周五 10:21写道：
> >
> > > Thanks for the response, Zhanghao.
> > >
> > > PersistenceServices sounds good to me.
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Wed, Dec 27, 2023 at 11:30 AM Zhanghao Chen
> > > <[email protected]> wrote:
> > > >
> > > > Thanks for driving this effort, Yangze! The proposal overall LGTM.
> > Other
> > > from the throughput enhancement in the OLAP scenario, the separation of
> > > leader election/discovery services and the metadata persistence services
> > > will also make the HA impl clearer and easier to maintain. Just a minor
> > > comment on naming: would it better to rename PersistentServices to
> > > PersistenceServices, as usually we put a noun before Services?
> > > >
> > > > Best,
> > > > Zhanghao Chen
> > > > ________________________________
> > > > From: Yangze Guo <[email protected]>
> > > > Sent: Tuesday, December 19, 2023 17:33
> > > > To: dev <[email protected]>
> > > > Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP
> > > Scenarios
> > > >
> > > > Hi, there,
> > > >
> > > > We would like to start a discussion thread on "FLIP-403: High
> > > > Availability Services for OLAP Scenarios"[1].
> > > >
> > > > Currently, Flink's high availability service consists of two
> > > > mechanisms: leader election/retrieval services for JobManager and
> > > > persistent services for job metadata. However, these mechanisms are
> > > > set up in an "all or nothing" manner. In OLAP scenarios, we typically
> > > > only require leader election/retrieval services for JobManager
> > > > components since jobs usually do not have a restart strategy.
> > > > Additionally, the persistence of job states can negatively impact the
> > > > cluster's throughput, especially for short query jobs.
> > > >
> > > > To address these issues, this FLIP proposes splitting the
> > > > HighAvailabilityServices into LeaderServices and PersistentServices,
> > > > and enable users to independently configure the high availability
> > > > strategies specifically related to jobs.
> > > >
> > > > Please find more details in the FLIP wiki document [1]. Looking
> > > > forward to your feedback.
> > > >
> > > > [1]
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios
> > > >
> > > > Best,
> > > > Yangze Guo
> > >
> >

Re: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios

Reply via email to