Hi, there, We would like to start a discussion thread on "FLIP-403: High Availability Services for OLAP Scenarios"[1].
Currently, Flink's high availability service consists of two mechanisms: leader election/retrieval services for JobManager and persistent services for job metadata. However, these mechanisms are set up in an "all or nothing" manner. In OLAP scenarios, we typically only require leader election/retrieval services for JobManager components since jobs usually do not have a restart strategy. Additionally, the persistence of job states can negatively impact the cluster's throughput, especially for short query jobs. To address these issues, this FLIP proposes splitting the HighAvailabilityServices into LeaderServices and PersistentServices, and enable users to independently configure the high availability strategies specifically related to jobs. Please find more details in the FLIP wiki document [1]. Looking forward to your feedback. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios Best, Yangze Guo