Hi,

Before the actual question, and just to make sure my assumptions are
correct, my understanding is that Flink's current behaviour is:

- Job Manager failure under an highly available setup: a standby Job
Manager takes over, no impact on the job
- Task Manager failure, with enough free slots in the cluster: as long
there is a restart strategy defined, the failed sub-tasks will be resumed
in another machine
- Task Manager failure, without free slots in the cluster: sub-tasks in the
faulty node are marked as failed, while the other ones transition to
cancelled. The job is marked as failing. When new task slots appear in the
cluster, the job recovers as long as there is a restart strategy

Assuming this is true, there is no true high availability in the sense that
there's always some down time when a node fails, even in unrelated
partitions or parallel sub-tasks. Is this correct? If so, what is the
recommended strategy for achieving high availability in case of a task
manager failure?

Currently the only option I can see is to duplicate the entire job, in an
active-active fashion. However, if we also try to be resilient to data
center failures, that would mean having something like 4 instances of the
same job, which starts to seem like a waste of resources. It can be argued
that the standby jobs don't need to run in a cluster as powerful as the
original one, and that the secondary datacenter only needs a single
cluster, but still. Are there any other alternatives?

Ignoring the current state of affairs, are there any plans to tackle this
issue? I recall seeing something about "hot-standby replication" in
Aljoscha Krettek's "(Past), Present, and Future of Apache Flink" talk
during last year's QCon.ai. I've tried scavenging information about this
from JIRA and design docs
<https://cwiki.apache.org/confluence/display/FLINK/Design+Documents>, but
to no success.

Besides this, and I'm not sure this is the right place to ask, is there a
commercial solution to this problem?

Thanks in advance!

Reply via email to