Hi all, Since I’m currently working on an implementation of HighAvailabilityServicesFactory I thought it would be good to report here about my experience so far.
Our use case is cloud based, where we package Flink and our supplementary code into a docker image, then run those images through Kubernetes+Helm orchestration. We don’t use Hadoop nor HDFS but rather Google Cloud Storage, and we don’t run ZooKeepers. Our Flink setup consists of one JobManager and multiple TaskManagers on-demand. Due to the nature of cloud computing there’s a possibility our JobManager instance might go down, only to be automatically recreated through Kubernetes. Since we don’t run ZooKeeper We needed a way to run a variant of High Availability cluster where we would keep JobManager information on our attached persistent k8s volume instead of ZooKeeper. We found this (https://stackoverflow.com/questions/52104759/apache-flink-on-kubernetes-resume-job-if-jobmanager-crashes/52112538 <https://stackoverflow.com/questions/52104759/apache-flink-on-kubernetes-resume-job-if-jobmanager-crashes/52112538>) post on StackOverflow and decided to give it a try. So far we have a setup that seems to be working on our local deployment, we haven’t yet tried it in the actual cloud. As far as implementation goes, here’s what we did: We used MapDB (mapdb.org <http://mapdb.org/>) as our storage format, to persist lists of objects onto disk. We partially relied on StandaloneHaServices for our HaServices implementation. Otherwise we looked at the ZooKeeperHaServices and related classes for inspiration and guidance. Here’s a list of new classes: FileSystemCheckpointIDCounter implements CheckpointIDCounter FileSystemCheckpointRecoveryFactory implements CheckpointRecoveryFactory FileSystemCompletedCheckpointStore implements CompletedCheckpointStore FileSystemHaServices extends StandaloneHaServices FileSystemHaServicesFactory implements HighAvailabilityServicesFactory FileSystemSubmittedJobGraphStore implements SubmittedJobGraphStore Testing so far proved that bringing down a JobManager and bringing it back up does indeed restore all the running jobs. Job creation/destruction also works. Hope this helps! Thanks, Aleksandar Mastilovic > On Aug 21, 2019, at 12:32 AM, Zili Chen <wander4...@gmail.com> wrote: > > Hi guys, > > We want to have an accurate idea of how users actually use > high-availability services in Flink, especially how you customize > high-availability services by HighAvailabilityServicesFactory. > > Basically there are standalone impl., zookeeper impl., embedded impl. > used in MiniCluster, YARN impl. not yet implemented, and a gate to > customized implementations. > > Generally I think standalone impl. and zookeeper impl. are the most > widely used implementations. The embedded impl. is used without > awareness when users run a MiniCluster. > > Besides that, it is helpful to know how you guys customize > high-availability services using HighAvailabilityServicesFactory > interface for the ongoing FLINK-10333[1] which would evolve > high-availability services in Flink. As well as whether there is any > user take interest in the not yet implemented YARN impl.. > > Any user case should be helpful. I really appreciate your time and your > insight. > > Best, > tison. > > [1] https://issues.apache.org/jira/browse/FLINK-10333 > <https://issues.apache.org/jira/browse/FLINK-10333>