Hi Kevin,

Thanks for reaching out. This question is not new and has been asked
many times previously. I have the same feeling as Gyula. There are ways to
provide zero downtime for Flink jobs, but commonly, I doubt if it is really
necessary.

If there might be some special use cases that really need zero downtime
with the Flink app itself, we should also drill down into the zero downtime
requirement.  For me, there are two different categories of zero downtime.
The first one is zero downtime for app upgrades. Traditional blue/green or
canary deployment for online service might be put on the table. The other
one is HA service that guarantees the zero downtime for Flink apps, i.e.
there will be zero delay (no impact at all) for any Flink job failover.
Attention should be paid that the big difference between Flink app and
online services is that Flink apps are stateful data computation. We have
to take care of the state and data consistency and will end up with a much
complicated and expensive solution. Since people care about ROI, I would
take a look at the big picture and try to provide zero downtime at the
"online" side.

Some users are not aware that Flink is an engine for "offline" apps even if
Flink is focusing on the real-time stream processing. Users who are asking
for zero downtime and blue/green deployment have a strong "online" apps
mindset. It sounds to me like asking if an apple has to be tasted as
orange. Commonly, Flink apps, again as offline apps, are only part of the
whole service landscape. We should check our requirements more closely. In
most cases, zero downtime could be done with online services who consume
the result data processed by Flink apps. No zero downtime at Flink side is
required, if the delay created by Flink job's failover is smaller than the
business SLA. Flink 1.17 and 1.18 are working on it and will reduce the
delay significantly to seconds[1]. This should be good enough for most use
cases (team up with online services) to provide the requested zero
downtime.

Just my two cents and thanks for any different thoughts that could help me
know what I didn't know.

Best regards,
Jing

[1] https://www.ververica.com/blog/generic-log-based-incremental-checkpoint

On Wed, May 24, 2023 at 7:39 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hey Kevin!
>
> I am not aware of anyone currently working on this for the Flink Operator.
>
> Here are my current thoughts on the topic:
>
>    1. It's not impossible to build this into the operator but it would
>    require some considerable changes to the logic, both in terms of
> resource
>    mapping and observer logic, however...
>    2. It's a very niche use-case and in most cases this is not required
>    3. Even if we implement it there are a lot of caveats for making this
>    generally useful outside of some very specialized use-cases
>    4. In most cases this is actually not a good way to perform upgrades and
>    depending on the application it may lead to incorrect results etc.
>    5. This is possible to build on top of the current operator logic
>    externally
>
> So at the moment I am slightly against the idea in general, but of course I
> can be convinced otherwise if there is a general requirement / interest in
> the community. In any case we should have confidence that this will
> actually provide production value to many use-cases and it would require a
> FLIP for sure.
>
> Cheers,
> Gyula
>
>
>
> On Wed, May 24, 2023 at 5:24 PM Kevin Lam <kevin....@shopify.com.invalid>
> wrote:
>
> > Hi,
> >
> > Is there any interest or ongoing work around supporting zero-downtime
> > deployments with Flink using the Flink Operator?
> >
> > I saw that https://issues.apache.org/jira/browse/FLINK-24257 existed,
> but
> > it looks a little stale. I'm interested in learning more about the
> current
> > state of things.
> >
> > There is also some pre-existing work done by Lyft:
> > https://www.youtube.com/watch?v=Hyt3YrtKQAM
> >
>

Reply via email to