Draft PR)

Sergio Chong Loo Sun, 15 Mar 2026 23:22:04 -0700

Hi Ryan 

Thanks a lot for these details. For sure some of these observations popped up 
during our initial discussions, and that’s why our initial goal was to 
introduce this as simple as possible and gradually enhance it to cover gaps.

Allow me to address your concerns:
I’m happy you stressed the point of “disruption to existing pipelines”. 
However, there’s a few points about attempting to build this functionality into 
the sinks (or sources) right off the bat (read further below for my 
alternative):
Kubernetes centric: as of now the Blue/Green Deployments support is a 
Kubernetes specific solution, adding a mixin directly available to sinks would 
“leak” this support outside of K8s
A sink being aware of these deployment phases violates single responsibility, 
but more importantly…
Flink currently has many connectors, with the majority being maintained outside 
of the Flink code base, by separate teams, separate repos, separate release 
cycles. This would complicate things significantly as to try and add support 
for this for every potential flink connector project out there would be a 
cumbersome. Blue/Green Phase 2 then only would works with "gate-aware" sinks. 
I’d leave the conversation about migrating jobs between K8s clusters outside of 
this scope, even Phase 1 is meant to only work in a single cluster…
Watermarking, excellent point, it’s indeed a requirement so I’ll make sure this 
is validated where applicable (by the concrete implementation)

Having said what I said about point 1.1 above, I’m currently working on an 
approach which uses a “GateInjectorPipelineExecutor” so to speak; in other 
words a custom PipelineExecutor that would be shipped with the K8s Operator, 
invoked by Flink Configuration (via “execution.target:”). This custom piece 
would instantiate and inject the Gate at a fixed point in the StreamGraph right 
before job submission. I still have to validate and ensure a few things are 
correctly taken care of (like Type Information, etc.) but the theory looks 
promising.

For the most part this works well with Flink SQL (same configuration), here’s 
my estimation:

tEnv.executeSql("INSERT INTO my_sink ...")
    └─> SQL planner → ExecNodeGraph → Transformation[]
          └─> StreamGraph
                └─> GateInjectorExecutor injects GateProcessFunction
                      └─> StreamGraph' (mutated) → JobGraph
                            └─> Submit Job

I’m aiming to share some updates along these lines in the next few weeks but 
hopefully this falls inline with your objectives/thoughts overall.

Sergio

> On Mar 6, 2026, at 3:36 PM, Ryan van Huuksloot via dev <[email protected]> 
> wrote:
> 
> Hi Sergio,
> Thanks for starting this conversation.
> 
> A few thoughts regarding BlueGreen Phase 2:
> 1. The Gate Operator is interesting but I don't like that we would have to
> modify users' pipelines for them to use Phase 2. This gate function seems
> like it could be a Mixin that connectors would implement. If you want to
> use Phase 2, your sinks must implement this Mixin. I understand that a
> unique GateFunction has pros, but it works less well with FlinkSQL - and
> the trade-off doesn't seem worthwhile.
> 2. Regarding the ConfigMap. We should consider a solution that supports
> migrating Flink jobs between Kubernetes clusters. Otherwise Phase 2 is only
> useful for in cluster operations.
> 3. Watermarking is a requirement. Will the Flink Kubernetes Operator
> validate that the pipeline is using watermarks?
> 
>> What happens when idleness is configured? Watermarks will get ignored from
> these “slow” subtasks and advance, could records from the ignored subtasks
> eventually be lost?
> Yes they would be lost, but that would happen irrespective of Phase 2.
> 
> I'll have more thoughts after we discuss the Gate Operator, as that is
> crucial to the FLIP right now.
> 
> Ryan van Huuksloot
> Staff Engineer, Infrastructure | Streaming Platform
> [image: Shopify]
> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
> 
> 
> On Mon, Mar 2, 2026 at 6:52 PM Sergio Chong Loo <[email protected]> wrote:
> 
>> Bumping this (Advanced Blue/Green deployments - FLIP-504) thread after
>> making some code adjustments.
>> 
>> FYI @drossos <https://github.com/drossos> @ryanvanhuuksloot <
>> https://github.com/ryanvanhuuksloot> I’d like to get your feedback since
>> I know you’re interested in this feature.
>> 
>> Thanks,
>> - Sergio
>> 
>> 
>>> On Dec 5, 2025, at 2:31 PM, Sergio Chong Loo <[email protected]>
>> wrote:
>>> 
>>> Hi folks,
>>> 
>>> FLIP-503 (already merged) introduced the Basic Blue/Green Deployment
>> functionality to the Flink K8s Operator. It was very straightforward,
>> simply transitioning to the second deployment once it's considered stable.
>>> 
>>> FLIP-504 is an Advanced version added on top of 503 and brings about the
>> notion of "record-level" coordination between the 2 deployments to have no
>> data duplication and exactly once semantics while preserving a smooth
>> transition.
>>> 
>>> The main goals are:
>>>    • For the community to take a quick look at the current
>> functionality (previously mentioned at the Flink Forward 2025 Conference)
>>>    • To get feedback and improvement suggestions
>>> 
>>> Flip 504 details:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677650
>>> 
>>> Draft PR: https://github.com/apache/flink-kubernetes-operator/pull/1043
>>> 
>>> Thank you!
>>> - Sergio
>>> 
>>> 
>> 
>>

Re: FLIP-504: Blue/Green Deployments for Flink on Kubernetes - Phase 2 (WIP/Draft PR)

Reply via email to