[ 
https://issues.apache.org/jira/browse/FLINK-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103477#comment-15103477
 ] 

Ufuk Celebi commented on FLINK-2287:
------------------------------------

The active JobManager is discovered via ZooKeeper. There is no need to provide 
the JobManager end points diretly at the moment. The ZooKeeper endpoint is 
configured in the flink-conf file. I agree that it would be nice to be able to 
overwrite this directly from the CLI (also when starting the cluster). +1 

The client is always going to connect to the master job manager (eventually). 
If it connects to a standby job manager it will (eventually) be notified about 
the leader change and re-connect to the new master job manager. Hope this 
helps. 

> Implement JobManager high availability
> --------------------------------------
>
>                 Key: FLINK-2287
>                 URL: https://issues.apache.org/jira/browse/FLINK-2287
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager, TaskManager
>            Reporter: Ufuk Celebi
>             Fix For: 0.10.0
>
>
> The problem: The JobManager (JM) is a single point of failure. When it 
> crashes, TaskManagers (TM) fail all running jobs and try to reconnect to the 
> same JM. A failed JM looses all state and can not resume the running jobs; 
> even if it recovers and the TMs reconnect.
> Solution: implement JM fault tolerance/high availability by having multiple 
> JM instances running with one as leader and the other(s) in standby. The 
> exact coordination and state update protocol between JM, TM, and clients is 
> covered in sub-tasks/issues.
> Related Wiki: 
> https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to