Hi all,

Since I’m currently working on an implementation of 
HighAvailabilityServicesFactory I thought it would be good to report here about 
my experience so far.

Our use case is cloud based, where we package Flink and our supplementary code 
into a docker image, then run those images through Kubernetes+Helm 
orchestration.

We don’t use Hadoop nor HDFS but rather Google Cloud Storage, and we don’t run 
ZooKeepers. Our Flink setup consists of one JobManager and multiple 
TaskManagers on-demand.

Due to the nature of cloud computing there’s a possibility our JobManager 
instance might go down, only to be automatically recreated through Kubernetes. 
Since we don’t run ZooKeeper
We needed a way to run a variant of High Availability cluster where we would 
keep JobManager information on our attached persistent k8s volume instead of 
ZooKeeper.
We found this 
(https://stackoverflow.com/questions/52104759/apache-flink-on-kubernetes-resume-job-if-jobmanager-crashes/52112538
 
<https://stackoverflow.com/questions/52104759/apache-flink-on-kubernetes-resume-job-if-jobmanager-crashes/52112538>)
 post on StackOverflow and decided to give it a try.

So far we have a setup that seems to be working on our local deployment, we 
haven’t yet tried it in the actual cloud.

As far as implementation goes, here’s what we did:

We used MapDB (mapdb.org <http://mapdb.org/>) as our storage format, to persist 
lists of objects onto disk. We partially relied on StandaloneHaServices for our 
HaServices implementation. Otherwise we looked at the ZooKeeperHaServices and 
related classes for inspiration and guidance.

Here’s a list of new classes:

FileSystemCheckpointIDCounter implements CheckpointIDCounter
FileSystemCheckpointRecoveryFactory implements CheckpointRecoveryFactory
FileSystemCompletedCheckpointStore implements CompletedCheckpointStore
FileSystemHaServices extends StandaloneHaServices
FileSystemHaServicesFactory implements HighAvailabilityServicesFactory
FileSystemSubmittedJobGraphStore implements SubmittedJobGraphStore

Testing so far proved that bringing down a JobManager and bringing it back up 
does indeed restore all the running jobs. Job creation/destruction also works. 

Hope this helps!

Thanks,
Aleksandar Mastilovic

> On Aug 21, 2019, at 12:32 AM, Zili Chen <wander4...@gmail.com> wrote:
> 
> Hi guys,
> 
> We want to have an accurate idea of how users actually use 
> high-availability services in Flink, especially how you customize
> high-availability services by HighAvailabilityServicesFactory.
> 
> Basically there are standalone impl., zookeeper impl., embedded impl.
> used in MiniCluster, YARN impl. not yet implemented, and a gate to
> customized implementations.
> 
> Generally I think standalone impl. and zookeeper impl. are the most
> widely used implementations. The embedded impl. is used without
> awareness when users run a MiniCluster.
> 
> Besides that, it is helpful to know how you guys customize 
> high-availability services using HighAvailabilityServicesFactory 
> interface for the ongoing FLINK-10333[1] which would evolve 
> high-availability services in Flink. As well as whether there is any
> user take interest in the not yet implemented YARN impl..
> 
> Any user case should be helpful. I really appreciate your time and your
> insight.
> 
> Best,
> tison.
> 
> [1] https://issues.apache.org/jira/browse/FLINK-10333 
> <https://issues.apache.org/jira/browse/FLINK-10333>

Reply via email to