RE: high availability with automated disaster recovery using zookeeper

Sofer, Tovi Tue, 10 Jul 2018 12:09:35 -0700

To add one thing to Mesos question-
My assumption that  constraints on JobManager  can work, is based on the 
sentence from link bleow
“When running Flink with Marathon, the whole Flink cluster including the job 
manager will be run as Mesos tasks in the Mesos cluster.”
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/mesos.html

[Not sure this is accurate, since it seems to contradict the image in link below
https://mesosphere.com/blog/apache-flink-on-dcos-and-apache-mesos ]

From: Sofer, Tovi [ICG-IT]
Sent: יום ג 10 יולי 2018 20:04
To: 'Till Rohrmann' <trohrm...@apache.org>; user <user@flink.apache.org>
Cc: Gardi, Hila [ICG-IT] <hg11...@imceu.eu.ssmb.com>
Subject: RE: high availability with automated disaster recovery using zookeeper

Hi Till, group,

Thank you for your response.
After reading further online on Mesos – Can’t Mesos fill the requirement of 
running job manager in primary server?
By using: “constraints”: [[“datacenter”, “CLUSTER”, “main”]]
(See 
http://www.stratio.com/blog/mesos-multi-data-center-architecture-for-disaster-recovery/
 )

Is this supported by Flink cluster on Mesos ?

Thanks again
Tovi

From: Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org>>
Sent: יום ג 10 יולי 2018 10:11
To: Sofer, Tovi [ICG-IT] 
<ts72...@imceu.eu.ssmb.com<mailto:ts72...@imceu.eu.ssmb.com>>
Cc: user <user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: high availability with automated disaster recovery using zookeeper

Hi Tovi,

that is an interesting use case you are describing here. I think, however, it 
depends mainly on the capabilities of ZooKeeper to produce the intended 
behavior. Flink itself relies on ZooKeeper for leader election in HA mode but 
does not expose any means to influence the leader election process. To be more 
precise ZK is used as a blackbox which simply tells a JobManager that it is now 
the leader, independent of any data center preferences. I'm not sure whether it 
is possible to tell ZooKeeper about these preferences. If not, then an 
alternative could be to implement one's own high availability services which 
does that at the moment.

Cheers,
Till

On Mon, Jul 9, 2018 at 1:48 PM Sofer, Tovi 
<tovi.so...@citi.com<mailto:tovi.so...@citi.com>> wrote:
Hi all,

We are now examining how to achieve high availability for Flink, and to support 
also automatic recovery in disaster scenario- when all DC goes down.
We have DC1 which we usually want work to be done, and DC2 – which is more 
remote and we want work to go there only when DC1 is down.

We examined few options and would be glad to hear feedback a suggestion for 
another way to achieve this.

•         Two zookeeper separate zookeeper and flink clusters on the two data 
centers.
Only the cluster on DC1 are running, and state is copied to DC2 in offline 
process.

To achieve automatic recovery we need to use some king of watch dog which will 
check DC1 availability , and if it is down will start DC2 (and same later if 
DC2 is down).

Is there recommended tool for this?

•         Zookeeper “stretch cluster” cross data centers – with 2 nodes on DC1, 
2 nodes on DC2 and one observer node.

Also flink cluster jobmabnager1 on DC1 and jobmanager2 on DC2.

This way when DC1 is down, zookeeper will notice this automatically and will 
transfer work to jobmanager2 on DC2.

However we would like zookeeper leader, and flink jobmanager leader (primary 
one) to be from DC1 – unless it is down.

Is there a way to achieve this?

Thanks and regards,
[citi_logo_mail]
Tovi Sofer
Software Engineer
+972 (3) 7405756
[Mail_signature_blue]

RE: high availability with automated disaster recovery using zookeeper

Reply via email to