I've used a multi-datacenter Consul cluster used to coordinate
service-discovery. When a service starts up in the primary DC, it registers
itself in Consul with a key that has a TTL that must be periodically
renewed. If the service shuts down or terminates abruptly, the key expires
and is removed from Consul. A standby service in another DC can be started
automatically after detecting the absence of the key in Consul in the
primary DC. This could lead to submitting a job to the standby Flink
cluster from the most recent savepoint that was copied by the offline
process you mentioned. It should be pretty easy to automate all of this. I
would not recommend setting up a multi-datacenter Zookeeper cluster; in my
experience, Consul is much easier to work with.

Best,

--
Scott Kidder


On Mon, Jul 9, 2018 at 4:48 AM Sofer, Tovi <tovi.so...@citi.com> wrote:

> Hi all,
>
>
>
> We are now examining how to achieve high availability for Flink, and to
> support also automatic recovery in disaster scenario- when all DC goes down.
>
> We have DC1 which we usually want work to be done, and DC2 – which is more
> remote and we want work to go there only when DC1 is down.
>
>
>
> We examined few options and would be glad to hear feedback a suggestion
> for another way to achieve this.
>
> ·         Two zookeeper separate zookeeper and flink clusters on the two
> data centers.
>
> Only the cluster on DC1 are running, and state is copied to DC2 in offline
> process.
>
> To achieve automatic recovery we need to use some king of watch dog which
> will check DC1 availability , and if it is down will start DC2 (and same
> later if DC2 is down).
>
> Is there recommended tool for this?
>
> ·         Zookeeper “stretch cluster” cross data centers – with 2 nodes
> on DC1, 2 nodes on DC2 and one observer node.
>
> Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2.
>
> This way when DC1 is down, zookeeper will notice this automatically and
> will transfer work to jobmanager2 on DC2.
>
> However we would like zookeeper leader, and flink jobmanager leader
> (primary one) to be from DC1 – unless it is down.
>
> Is there a way to achieve this?
>
>
>
> Thanks and regards,
>
> [image: citi_logo_mail]
>
> *Tovi Sofer*
>
> Software Engineer
> +972 (3) 7405756
>
> [image: Mail_signature_blue]
>
>
>

Reply via email to