Draft PR)

Ryan van Huuksloot via dev Mon, 16 Mar 2026 07:03:04 -0700

Hi Sergio,

re: 1.1
My thought is that a BlueGreen Mixin isn't Kubernetes specific and could be
reused by other deployment control planes. However, I do agree that
attaching it to the sink has other implications so I am happy to pivot if
we can find an alternative solution.


re: 2
I'm happy to leave it out of the Phase 2 implementation, but I think it
should be possible. For example we use Phase 1 with cross cluster
migrations today. Phase 2 within a single cluster isn't particularly useful
for us.

re: GateInjectorExecutor
This sounds like a neat idea. I need to read more about how it would work
but from a high level, injecting an operator before your sinks sounds like
a good idea. Better isolation, possible with SQL, no mixins, etc.

I will mention that part of the reason I want it before the sinks is
because nine out of ten people building pipelines struggle to understand
where their state is and how Phase 2 would affect the correctness of their
state depending on where they put the gate. I understand that if you have a
remote lookup and want to save bandwidth, you could optimize your pipeline
by moving the gate before the remote call; however, that seems like an
optimization that can be made later.

Thanks for driving this! Let me know how we can help.

Ryan van Huuksloot
Staff Engineer, Infrastructure | Streaming Platform
[image: Shopify]
<https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>


On Mon, Mar 16, 2026 at 2:21 AM Sergio Chong Loo <[email protected]>
wrote:

> Hi Ryan
>
>
> Thanks a lot for these details. For sure some of these observations popped
> up during our initial discussions, and that’s why our initial goal was to
> introduce this as simple as possible and gradually enhance it to cover gaps.
>
>
> Allow me to address your concerns:
>
>    1. I’m happy you stressed the point of “disruption to existing
>    pipelines”. However, there’s a few points about attempting to build this
>    functionality into the sinks (or sources) right off the bat (read further
>    below for my alternative):
>       1. Kubernetes centric: as of now the Blue/Green Deployments support
>       is a Kubernetes specific solution, adding a mixin directly available to
>       sinks would “leak” this support outside of K8s
>       2. A sink being aware of these deployment phases violates single
>       responsibility, but more importantly…
>       3. Flink currently has many connectors, with the majority being
>       maintained outside of the Flink code base, by separate teams, separate
>       repos, separate release cycles. This would complicate things 
> significantly
>       as to try and add support for this for every potential flink connector
>       project out there would be a cumbersome. Blue/Green Phase 2 then only 
> would
>       works with "gate-aware" sinks.
>    2. I’d leave the conversation about migrating jobs between K8s
>    clusters outside of this scope, even Phase 1 is meant to only work in a
>    single cluster…
>    3. Watermarking, excellent point, it’s indeed a requirement so I’ll
>    make sure this is validated where applicable (by the concrete
>    implementation)
>
>
> Having said what I said about point 1.1 above, I’m currently working on an
> approach which uses a “GateInjectorPipelineExecutor” so to speak; in other
> words a custom PipelineExecutor that would be shipped with the K8s
> Operator, invoked by Flink Configuration (via “execution.target:”). This
> custom piece would instantiate and inject the Gate at a fixed point in the
> StreamGraph right before job submission. I still have to validate and
> ensure a few things are correctly taken care of (like Type Information,
> etc.) but the theory looks promising.
>
>
> For the most part this works well with Flink SQL (same configuration),
> here’s my estimation:
>
>
> tEnv.executeSql("INSERT INTO my_sink ...")
>
>     └─> SQL planner → ExecNodeGraph → Transformation[]
>
>           └─> StreamGraph
>
>                 └─> GateInjectorExecutor injects GateProcessFunction
>
>                       └─> StreamGraph' (mutated) → JobGraph
>
>                             └─> Submit Job
>
>
> I’m aiming to share some updates along these lines in the next few weeks
> but hopefully this falls inline with your objectives/thoughts overall.
>
>
> Sergio
>
>
>
> On Mar 6, 2026, at 3:36 PM, Ryan van Huuksloot via dev <
> [email protected]> wrote:
>
> Hi Sergio,
> Thanks for starting this conversation.
>
> A few thoughts regarding BlueGreen Phase 2:
> 1. The Gate Operator is interesting but I don't like that we would have to
> modify users' pipelines for them to use Phase 2. This gate function seems
> like it could be a Mixin that connectors would implement. If you want to
> use Phase 2, your sinks must implement this Mixin. I understand that a
> unique GateFunction has pros, but it works less well with FlinkSQL - and
> the trade-off doesn't seem worthwhile.
> 2. Regarding the ConfigMap. We should consider a solution that supports
> migrating Flink jobs between Kubernetes clusters. Otherwise Phase 2 is only
> useful for in cluster operations.
> 3. Watermarking is a requirement. Will the Flink Kubernetes Operator
> validate that the pipeline is using watermarks?
>
> What happens when idleness is configured? Watermarks will get ignored from
>
> these “slow” subtasks and advance, could records from the ignored subtasks
> eventually be lost?
> Yes they would be lost, but that would happen irrespective of Phase 2.
>
> I'll have more thoughts after we discuss the Gate Operator, as that is
> crucial to the FLIP right now.
>
> Ryan van Huuksloot
> Staff Engineer, Infrastructure | Streaming Platform
> [image: Shopify]
> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>
>
> On Mon, Mar 2, 2026 at 6:52 PM Sergio Chong Loo <[email protected]>
> wrote:
>
> Bumping this (Advanced Blue/Green deployments - FLIP-504) thread after
> making some code adjustments.
>
> FYI @drossos <https://github.com/drossos> @ryanvanhuuksloot <
> https://github.com/ryanvanhuuksloot> I’d like to get your feedback since
> I know you’re interested in this feature.
>
> Thanks,
> - Sergio
>
>
> On Dec 5, 2025, at 2:31 PM, Sergio Chong Loo <[email protected]>
>
> wrote:
>
>
> Hi folks,
>
> FLIP-503 (already merged) introduced the Basic Blue/Green Deployment
>
> functionality to the Flink K8s Operator. It was very straightforward,
> simply transitioning to the second deployment once it's considered stable.
>
>
> FLIP-504 is an Advanced version added on top of 503 and brings about the
>
> notion of "record-level" coordination between the 2 deployments to have no
> data duplication and exactly once semantics while preserving a smooth
> transition.
>
>
> The main goals are:
>    • For the community to take a quick look at the current
>
> functionality (previously mentioned at the Flink Forward 2025 Conference)
>
>    • To get feedback and improvement suggestions
>
> Flip 504 details:
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677650
>
>
> Draft PR: https://github.com/apache/flink-kubernetes-operator/pull/1043
>
> Thank you!
> - Sergio
>
>
>
>
>
>

Re: FLIP-504: Blue/Green Deployments for Flink on Kubernetes - Phase 2 (WIP/Draft PR)

Reply via email to