Till Rohrmann created FLINK-2289:
------------------------------------
Summary: Make JobManager highly available
Key: FLINK-2289
URL: https://issues.apache.org/jira/browse/FLINK-2289
Project: Flink
Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Till Rohrmann
Currently, the {{JobManager}} is the single point of failure in the Flink
system. If it fails, then your job cannot be recovered and the Flink cluster is
no longer able to receive new jobs.
Therefore, it is crucial to make the {{JobManager}} fault tolerant so that the
Flink cluster can recover from failed {{JobManager}}. As a first step towards
this goal, I propose to make the {{JobManager}} highly available by starting
multiple instances and using Apache ZooKeeper to elect a leader. The leader is
responsible for the execution of the Flink job.
In case that the {{JobManager}} dies, one of the other running {{JobManager}}
will be elected as the leader and take over the role of the leader. The
{{Client}} and the {{TaskManager}} will automatically detect the new
{{JobManager}} by querying the ZooKeeper cluster.
Note that this does not achieve full fault tolerance for the {{JobManager}} but
it allows the cluster to recover from failed {{JobManager}}. The design of
high-availability for the {{JobManager}} is tracked in the wiki here [1].
Resources:
[1]
[https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)