[ 
https://issues.apache.org/jira/browse/FLINK-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616639#comment-14616639
 ] 

ASF GitHub Bot commented on FLINK-2288:
---------------------------------------

Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/886#discussion_r34035723
  
    --- Diff: docs/setup/jobmanager_high_availability.md ---
    @@ -0,0 +1,121 @@
    +---
    +title: "JobManager High Availability (HA)"
    +---
    +<!--
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied.  See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +-->
    +
    +The JobManager is the coordinator of each Flink deployment. It is 
responsible for both *scheduling* and *resource management*.
    +
    +By default, there is a single JobManager instance per Flink cluster. This 
creates a *single point of failure* (SPOF): if the JobManager crashes, no new 
programs can be submitted and running programs fail.
    +
    +With JobManager High Availability, you can run multiple JobManager 
instances per Flink cluster and thereby circumvent the *SPOF*.
    +
    +The general idea of JobManager high availability is that there is a 
**single leading JobManager** at any time and **multiple standby JobManagers** 
to take over leadership in case the leader fails. This guarantees that there is 
**no single point of failure** and programs can make progress as soon as a 
standby JobManager has taken leadership. There is no explicit distinction 
between standby and master JobManager instances. Each JobManager can take the 
role of master or standby.
    +
    +As an example, consider the following setup with three JobManager 
instances:
    +
    +<img src="fig/jobmanager_ha_overview.png" class="center" />
    +
    +## Configuration
    +
    +To enable JobManager High Availability you have to configure a **ZooKeeper 
quorum** and set up a **masters file** with all JobManagers hosts.
    +
    +Flink leverages **[ZooKeeper](http://zookeeper.apache.org)** for  
*distributed coordination* between all running JobManager instances. ZooKeeper 
is a separate service from Flink, which provides highly reliable distirbuted 
coordination via leader election and light-weight consistent state storage. 
Check out [ZooKeeper's Getting Started 
Guide](http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html) for more 
information about ZooKeeper.
    +
    +Configuring a ZooKeeper quorum in `conf/flink-conf.yaml` *enables* high 
availability mode and all Flink components try to connect to a JobManager via 
coordination through ZooKeeper.
    +
    +- **ZooKeeper quorum** (required): A *ZooKeeper quorum* is a replicated 
group of ZooKeeper servers, which provide the distributed coordination service.
    +  
    +  <pre>ha.zookeeper.quorum: address1:2181[,...],addressX:2181</pre>
    +
    +  Each *addressX:port* refers to a ZooKeeper server, which is reachable by 
Flink at the given address and port.
    +
    +- The following configuration keys are optional:
    +
    +  - `ha.zookeeper.dir: /flink [default]`: ZooKeeper directory to use for 
coordination
    +  - TODO Add client configuration keys
    +
    +## Starting an HA-cluster
    +
    +In order to start an HA-cluster configure the *masters* file in 
`conf/masters`:
    +
    +- **masters file**: The *masters file* contains all hosts, on which 
JobManagers are started.
    +
    +  <pre>
    +jobManagerAddress1
    +[...]
    +jobManagerAddressX
    +  </pre>
    +
    +After configuring the masters and the ZooKeeper quorum, you can use the 
provided cluster startup scripts as usual. They will start a HA-cluster. **Keep 
in mind that the ZooKeeper quorum has to be running when you call the scripts**.
    +
    +## Running ZooKeeper
    +
    +If you don't have a running ZooKeeper installation, you can use the helper 
scripts, which ship with Flink.
    +
    +There is a ZooKeeper configuration template in `conf/zoo.cfg`. You can 
configure the hosts to run ZooKeeper on with the `server.X` entries, where X is 
a unique ID of each server:
    +
    +<pre>
    +server.X=addressX:peerPort:leaderPort
    +[...]
    +server.Y=addressY:peerPort:leaderPort
    +</pre>
    +
    +The script `bin/start-zookeeper-quorum.sh` will start a ZooKeeper server 
on each of the configured hosts. The started processes start ZooKeeper servers 
via a Flink wrapper, which reads the configuration from `conf/zoo.cfg` and 
makes sure to set some rqeuired configuration values for convenience. In 
production setups, it is recommended to manage your own ZooKeeper installation.
    --- End diff --
    
    Typo: `rqeuired` -> `required`


> Setup ZooKeeper for distributed coordination
> --------------------------------------------
>
>                 Key: FLINK-2288
>                 URL: https://issues.apache.org/jira/browse/FLINK-2288
>             Project: Flink
>          Issue Type: Sub-task
>          Components: JobManager, TaskManager
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>             Fix For: 0.10
>
>
> Having standby JM instances for job manager high availabilty requires 
> distributed coordination between JM, TM, and clients. For this, we will use 
> ZooKeeper (ZK).
> Pros:
> - Proven solution (other projects use it for this as well)
> - Apache TLP with large community, docs, and library with required "recipies" 
> like leader election (see below)
> Related Wiki: 
> https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to