Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Zhu Zhu Thu, 30 Apr 2026 03:24:24 -0700

Hi all,

Thanks for the great discussion so far. I'd like to share some thoughts on
the RpcOperator.


**What RpcOperator is — and is not**

I think some of the confusion stems from viewing RpcOperator as a variant
of AsyncOperator. They are actually complementary and operate on different
planes:

- **RpcOperator** lives *outside* the job's DAG. It does not participate in
the DAG's data edges or shuffle — instead, it exposes an RPC endpoint that
serves inference requests. Think of it as a service-side primitive.
- **AsyncOperator** lives *inside* the DAG and acts as the *client* that
makes asynchronous calls to the RpcOperator. It is the consumer, not the
competitor.

In other words, they form a client-server pair: AsyncOperator calls,
RpcOperator serves. There is no conflict between them.

**Benefits of RpcOperator**

A key design choice is that each RpcOperator instance forms its own Flink
task within an independent Pipelined Region. This is what unlocks the core
benefits:

- **Fault isolation**: A GPU operator failure only triggers recovery within
its own Region, without rolling back the entire job. For workloads where a
single inference can take minutes, avoiding global recomputation is
critical.
- **Independent scaling**: GPU operators can scale in/out independently —
new instances are registered to the load balancer only after model loading
completes; scaling in drains in-flight requests before releasing resources.
No global checkpoint or job restart is needed.
- **Flexible load balancing**: Because the GPU operators are decoupled from
the DAG topology, the runtime can distribute requests across instances
based on actual load rather than static partition assignment.

**Why inside the Flink runtime, not externalized to K8s Operator?**

Yaroslav suggested managing GPU services as a separate sub-project,
potentially via the Kubernetes Operator. I think this is a fair question,
but there are strong reasons to keep RpcOperator inside the runtime:

1. **Reuse of existing machinery**: By forming standard Flink tasks within
Pipelined Regions, RpcOperator directly reuses Flink's resource scheduling,
checkpoint coordination, and failover recovery — all battle-tested
capabilities. Externalizing this would require rebuilding these mechanisms
outside of Flink, which adds complexity rather than reducing it.
2. **Lifecycle binding**: RpcOperator services are tied to the job
lifecycle. They start with the job, scale with the workload, and shut down
when the job ends. Users don't need to separately provision, monitor, or
tear down inference services. Compare this with the current pattern of
deploying standalone inference services alongside Flink — users must handle
capacity planning, elastic scaling, failure recovery, and lifecycle
synchronization across two independent systems.
3. **Unified failure domain**: When the inference service is external, a
GPU failure and a Flink job failure are two separate incidents requiring
separate investigation. With RpcOperator, the runtime coordinates recovery
automatically.

Thanks,
Zhu

Martijn Visser <[email protected]> 于2026年4月30日周四 17:48写道：

> Hi all,
>
> First off, thanks for putting this together. I'd like to raise a couple of
> concerns on top of what Yaroslav and Robert already raised.
>
> 1. The "AI-Native" framing in my opinion bundles things that don't belong
> together. Areas like pipeline region checkpoints, UAC enhancements and
> non-disruptive scaling are general engine improvements that should be
> evaluated on their own, and not bundled.
> 2. Flink's feature matrix is already sparse, and I think that expanding
> with this feature set is going to make it even worse and actually worsens
> the developer experience, because features won't be implemented everywhere.
> 3. I think we must have a high bar for "must be inside Flink". RpcOperator
> looks like service orchestration, which Kubernetes operators and others
> already handle. Also looking back at Flink ML, we have seen that we started
> with something and then later it got abandoned. I'd like to understand why
> this proposal would be different here.
> 4. The motivations makes an assertion that workloads are rapidly evolving,
> but there's no evidence for this in the FLIP. I think such a large FLIP
> requires more evidence before committing to this scope as a community.
>
> Best regards,
>
> Martijn
>
> On Thu, Apr 30, 2026 at 9:58 AM Guowei Ma <[email protected]> wrote:
>
> > Hi David,
> >
> > Thanks for the careful review — these questions help sharpen the
> umbrella.
> > Let me respond point by point.
> >
> > 1. Layer 3 dependencies on Layer 1/2
> >
> > I broadly agree. The pure CPU-side checkpoint enhancements in Layer 3
> > (Pipeline Region independent checkpoint, Unaligned Checkpoint
> improvements)
> > can indeed be advanced independently. The parts of Layer 3 related to GPU
> > elastic scaling do depend on RpcOperator from Layer 1, but this is a
> local
> > dependency and doesn't affect the parallelism of Layer 3 as a whole. I'll
> > make the dependency relationships more explicit in the next revision of
> the
> > umbrella.
> >
> > 2. "Could Flink become like R or SPSS?" — and how this compares to
> existing
> > solutions
> >
> > This is a question worth answering directly, because it goes to Flink's
> > positioning in AI data processing.
> >
> > Direct answer: no. R and SPSS are single-machine interactive statistical
> > analysis tools, targeted at statisticians and researchers. Flink is
> > distributed data infrastructure, targeted at engineers building
> production
> > data pipelines. These aren't in the same lane.
> >
> > The more relevant comparisons are Daft and Ray Data — these are the most
> > active distributed systems in AI data processing today. Flink's
> > differentiation shows up on two levels.
> >
> > On the ecosystem side: Flink's core strength is its streaming +
> checkpoint
> > machinery, which matters especially in inference scenarios. A single
> > inference can take seconds to minutes, so the cost of failover is far
> > higher than in traditional batch data processing, and fine-grained fault
> > tolerance directly determines production viability. We've seen users put
> > significant additional engineering effort into fault tolerance for
> > inference workloads on other systems, and even so the result is less
> > systematic and less complete than what Flink already provides — this is
> the
> > accumulation of more than a decade of streaming engine work, and not
> > something easily caught up with in the short term.
> >
> > On ecosystem position: Flink is already data infrastructure inside many
> > enterprises, widely used for ETL, CDC, and real-time analytics.
> Integrating
> > AI inference directly into existing Flink pipelines is far cheaper than
> > spinning up a separate stack on Ray or Daft — especially when multimodal
> > data is already flowing through Flink. The AI-Native evolution makes this
> > integration a natural extension, rather than asking users to switch to a
> > different technology stack.
> >
> > 3. RpcOperator: where it's deployed, how remote works
> >
> > Let me first clarify one point (this also addresses the same confusion in
> > Robert's email): RpcOperator is not a specialization of async io, it's a
> > deployment primitive — it splits GPU compute out of the data-plane
> topology
> > so it can be independently scheduled, scaled, and recovered as a service.
> > CPU operators can still use async semantics when calling it.
> >
> > In terms of deployment shape, the mechanics for bringing up an
> RpcOperator,
> > managing its lifecycle, and scheduling its resources are similar to how
> > Flink launches a TaskManager today — reusing existing Flink runtime
> > capabilities rather than introducing a new infrastructure layer. The
> > concrete protocols and implementation details will be developed in the
> > sub-FLIP.
> >
> > Typical use cases are GPU inference services such as LLM inference and
> > multimodal model inference — compute units that have their own resource
> > profile and scaling curve, distinct from those of the main data flow, and
> > therefore well-suited to being deployed as independent services.
> >
> > 4. Multimodal types — deser is the core
> >
> > You said "the interesting part will be the deser" — that's exactly right,
> > and deser is indeed the core of this sub-FLIP's design. The Object
> > Reference mechanism is a partial answer here — large objects are
> serialized
> > only once and passed through the pipeline by reference, which
> significantly
> > reduces the SerDes burden in multimodal scenarios. The detailed SerDes
> > design will be developed in the sub-FLIP.
> >
> > 5. Built-in operators: library-by-library integration
> >
> > This is a very pragmatic suggestion, and it's exactly the direction we're
> > going in. Our goal is not to reinvent the wheel, but to use the new
> > mechanisms to bring existing AI libraries smoothly into Flink, rather
> than
> > forcing users to write call-by-call wrappers. The layering is:
> >
> >    - Built-in operators cover the most common operations that can benefit
> >    from framework-level optimization (model sharing, batching, GPU
> resource
> >    pooling, etc.); these will be built on top of existing libraries
> rather
> >    than reimplemented from scratch.
> >    - The UDF system lets users import any Python library directly,
> without
> >    needing a dedicated operator per call.
> >
> > The two are complementary: common operations that benefit from framework
> > optimization go through built-ins; long-tail and custom needs go through
> > UDFs. The concrete set of built-in operators, the library integration
> > approach, and the specific framework-level optimization points will be
> > addressed in "FLIP-XXX: Built-in Multimodal Operators and AI Functions".
> >
> > 6. Columnar: whether it's exposed in SQL
> >
> > Your reading is correct — in the first phase we don't plan to expose the
> > columnar format at the SQL layer. Columnar execution is an internal
> engine
> > optimization, and SQL remains row-based at the logical level.
> >
> > Whether UDFs should directly produce/consume columnar data (Arrow /
> NumPy)
> > is something we're keeping open and can discuss further down the line —
> it
> > depends on how strong the vectorization need is in concrete scenarios,
> and
> > on the impact on user programming model complexity.
> >
> > Thanks again — your feedback meaningfully improves the clarity of the
> > umbrella in several places.
> >
> >
> > Best, Guowei
> >
> >
> > On Wed, Apr 29, 2026 at 12:02 AM David Radley <[email protected]>
> > wrote:
> >
> > > Hi Guowei,
> > > This is an interesting proposal. I second Roberts questions. Some
> > thoughts.
> > >
> > > Layer 3 does not depend on layers 1 and 2 I think. At the high level I
> > > wonder, is the idea that Flink could become like an R ML pipeline or
> > SPSS?
> > > It would be good to compare existing technology solutions and what
> > benefits
> > > Flink will bring to these scenarios.
> > > FLIP-XXX: Supporting RpcOperator — Independently Deployed and Scaled
> RPC
> > > Service Operators - see Robert's comment. I assume this is a
> > specialization
> > > of the async io operator for RPC. When you say deploying RPC services
> > that
> > > are fully managed by the Flink runtime, where would these be deployed?
> If
> > > it is remote how would this work? It would be interesting to see some
> use
> > > cases where Flink would be deploying RPC services that it has created.
> > > FLIP-XXX: Multimodal Data Type System and Object Reference Mechanism
> > >
> > > I like the idea of adding these types - the interesting part will be
> the
> > > deser.
> > >
> > > FLIP-XXX: A More Pythonic DataFrame API for Python Users - this makes
> > sense
> > > FLIP-XXX: Connector API for Multimodal Data Source/Sink - I assume this
> > > will be renamed new multimodal formats. Are there existing registries
> > that
> > > these could be looked in - similar to schema registry - so we can bring
> > in
> > > artifacts via metadata?
> > > FLIP-XXX: Built-in Multimodal Operators and AI Functions - I wonder if
> we
> > > could bring in existing implementation libraries and the new work would
> > > allow us to call them from Flink. i.e. not having to do them one call
> by
> > > call but library by library.
> > > FLIP-XXX: Columnar Data Transport and Processing Optimization - this
> > seems
> > > a big change, events as columns rather than events as rows or CDC
> > > sequences. I assume this would not be exposed in SQL?
> > >
> > >  kind regards, David.
> > >
> > > From: Robert Metzger <[email protected]>
> > > Date: Tuesday, 28 April 2026 at 07:38
> > > To: [email protected] <[email protected]>
> > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-577: AI-Native Flink — An
> Umbrella
> > > Proposal for Multimodal Data Processing
> > >
> > > Hey Guowei,
> > >
> > > Thanks for the proposal. I just took a brief look, here are some high
> > level
> > > questions:
> > >
> > > Regarding the RPC Operator: What is the difference to the async io
> > operator
> > > we have already?
> > >
> > > "Connector API for Multimodal Data Source/Sink": Why do we need to
> touch
> > > the connector API for supporting multimodal data? Isn't this more of a
> > > formats concern?
> > >
> > > "Non-Disruptive Scaling for CPU Operators": How do you want to
> guarantee
> > > exactly-once on that kind of scaling? E.g. you need to somehow make a
> > > handover between the old and new new pipeline
> > >
> > > Overall, I find the proposal has some things which seem related to
> making
> > > Flink more AI native, but other changes seem orthogonal to that. For
> > > example the checkpoint or scaling changes are actually unrelated to AI,
> > and
> > > just engine improvements.
> > >
> > >
> > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]>
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I'd like to start a discussion on an umbrella FLIP[1] that lays out a
> > > > direction for evolving Flink into a data engine that natively
> supports
> > AI
> > > > workloads.
> > > >
> > > > The short version: user workloads are shifting from BI analytics to
> > > > multimodal data processing centered on model inference, and this
> > triggers
> > > > cascading changes across the stack — multimodal data flowing through
> > > > pipelines, heterogeneous CPU/GPU resources, vectorized execution, and
> > > > inference tasks that run for seconds to minutes on Spot instances.
> The
> > > > proposal sketches an evolution along five directions (development
> > > paradigm,
> > > > data model, heterogeneous resources, execution engine, fault
> > tolerance),
> > > > decomposed into 11 sub-FLIPs organized into three layers: core
> runtime
> > > > primitives, AI workload expression and execution, and
> production-grade
> > > > operational guarantees. Most sub-FLIPs have no hard dependencies on
> > each
> > > > other and can be advanced in parallel.
> > > >
> > > > A note on scope, since it's an umbrella:
> > > >
> > > > - In scope here: whether the evolution directions are reasonable,
> > whether
> > > > each sub-FLIP's motivation and proposed approach are well-founded,
> and
> > > > whether the boundaries and dependencies between sub-FLIPs are clear.
> > > > - Out of scope here: detailed designs, API specifics, and
> > implementation
> > > > plans of individual sub-FLIPs — those will go through their own
> FLIPs.
> > > > - Consensus criteria: agreement on the overall direction is
> sufficient
> > > for
> > > > the umbrella to pass; passing it does not lock in any sub-FLIP's
> > design —
> > > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they
> > progress.
> > > >
> > > > All proposed changes are incremental — no existing API or behavior is
> > > > removed or altered. Compatibility details are covered at the end of
> the
> > > > document.
> > > >
> > > > Looking forward to your feedback on the overall direction and the
> > > layering.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275__;!!Ayb5sqE7!r8RpqwBaLYaCD-T9W5LY122mgxzTZ_YPlLipB7kE5-J06kjkGBZKtkFADyDbv4sq7yOxfPaNCZ0DNz-V-pLJLg$
> > > >
> > > > Thanks,
> > > > Guowei
> > > >
> > >
> > > Unless otherwise stated above:
> > >
> > > IBM United Kingdom Limited
> > > Registered in England and Wales with number 741598
> > > Registered office: Building C, IBM Hursley Office, Hursley Park Road,
> > > Winchester, Hampshire SO21 2JN
> > >
> >
>

Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Reply via email to