Hi David,

Thanks for the careful review — these questions help sharpen the umbrella.
Let me respond point by point.

1. Layer 3 dependencies on Layer 1/2

I broadly agree. The pure CPU-side checkpoint enhancements in Layer 3
(Pipeline Region independent checkpoint, Unaligned Checkpoint improvements)
can indeed be advanced independently. The parts of Layer 3 related to GPU
elastic scaling do depend on RpcOperator from Layer 1, but this is a local
dependency and doesn't affect the parallelism of Layer 3 as a whole. I'll
make the dependency relationships more explicit in the next revision of the
umbrella.

2. "Could Flink become like R or SPSS?" — and how this compares to existing
solutions

This is a question worth answering directly, because it goes to Flink's
positioning in AI data processing.

Direct answer: no. R and SPSS are single-machine interactive statistical
analysis tools, targeted at statisticians and researchers. Flink is
distributed data infrastructure, targeted at engineers building production
data pipelines. These aren't in the same lane.

The more relevant comparisons are Daft and Ray Data — these are the most
active distributed systems in AI data processing today. Flink's
differentiation shows up on two levels.

On the ecosystem side: Flink's core strength is its streaming + checkpoint
machinery, which matters especially in inference scenarios. A single
inference can take seconds to minutes, so the cost of failover is far
higher than in traditional batch data processing, and fine-grained fault
tolerance directly determines production viability. We've seen users put
significant additional engineering effort into fault tolerance for
inference workloads on other systems, and even so the result is less
systematic and less complete than what Flink already provides — this is the
accumulation of more than a decade of streaming engine work, and not
something easily caught up with in the short term.

On ecosystem position: Flink is already data infrastructure inside many
enterprises, widely used for ETL, CDC, and real-time analytics. Integrating
AI inference directly into existing Flink pipelines is far cheaper than
spinning up a separate stack on Ray or Daft — especially when multimodal
data is already flowing through Flink. The AI-Native evolution makes this
integration a natural extension, rather than asking users to switch to a
different technology stack.

3. RpcOperator: where it's deployed, how remote works

Let me first clarify one point (this also addresses the same confusion in
Robert's email): RpcOperator is not a specialization of async io, it's a
deployment primitive — it splits GPU compute out of the data-plane topology
so it can be independently scheduled, scaled, and recovered as a service.
CPU operators can still use async semantics when calling it.

In terms of deployment shape, the mechanics for bringing up an RpcOperator,
managing its lifecycle, and scheduling its resources are similar to how
Flink launches a TaskManager today — reusing existing Flink runtime
capabilities rather than introducing a new infrastructure layer. The
concrete protocols and implementation details will be developed in the
sub-FLIP.

Typical use cases are GPU inference services such as LLM inference and
multimodal model inference — compute units that have their own resource
profile and scaling curve, distinct from those of the main data flow, and
therefore well-suited to being deployed as independent services.

4. Multimodal types — deser is the core

You said "the interesting part will be the deser" — that's exactly right,
and deser is indeed the core of this sub-FLIP's design. The Object
Reference mechanism is a partial answer here — large objects are serialized
only once and passed through the pipeline by reference, which significantly
reduces the SerDes burden in multimodal scenarios. The detailed SerDes
design will be developed in the sub-FLIP.

5. Built-in operators: library-by-library integration

This is a very pragmatic suggestion, and it's exactly the direction we're
going in. Our goal is not to reinvent the wheel, but to use the new
mechanisms to bring existing AI libraries smoothly into Flink, rather than
forcing users to write call-by-call wrappers. The layering is:

   - Built-in operators cover the most common operations that can benefit
   from framework-level optimization (model sharing, batching, GPU resource
   pooling, etc.); these will be built on top of existing libraries rather
   than reimplemented from scratch.
   - The UDF system lets users import any Python library directly, without
   needing a dedicated operator per call.

The two are complementary: common operations that benefit from framework
optimization go through built-ins; long-tail and custom needs go through
UDFs. The concrete set of built-in operators, the library integration
approach, and the specific framework-level optimization points will be
addressed in "FLIP-XXX: Built-in Multimodal Operators and AI Functions".

6. Columnar: whether it's exposed in SQL

Your reading is correct — in the first phase we don't plan to expose the
columnar format at the SQL layer. Columnar execution is an internal engine
optimization, and SQL remains row-based at the logical level.

Whether UDFs should directly produce/consume columnar data (Arrow / NumPy)
is something we're keeping open and can discuss further down the line — it
depends on how strong the vectorization need is in concrete scenarios, and
on the impact on user programming model complexity.

Thanks again — your feedback meaningfully improves the clarity of the
umbrella in several places.


Best, Guowei


On Wed, Apr 29, 2026 at 12:02 AM David Radley <[email protected]>
wrote:

> Hi Guowei,
> This is an interesting proposal. I second Roberts questions. Some thoughts.
>
> Layer 3 does not depend on layers 1 and 2 I think. At the high level I
> wonder, is the idea that Flink could become like an R ML pipeline or SPSS?
> It would be good to compare existing technology solutions and what benefits
> Flink will bring to these scenarios.
> FLIP-XXX: Supporting RpcOperator — Independently Deployed and Scaled RPC
> Service Operators - see Robert's comment. I assume this is a specialization
> of the async io operator for RPC. When you say deploying RPC services that
> are fully managed by the Flink runtime, where would these be deployed? If
> it is remote how would this work? It would be interesting to see some use
> cases where Flink would be deploying RPC services that it has created.
> FLIP-XXX: Multimodal Data Type System and Object Reference Mechanism
>
> I like the idea of adding these types - the interesting part will be the
> deser.
>
> FLIP-XXX: A More Pythonic DataFrame API for Python Users - this makes sense
> FLIP-XXX: Connector API for Multimodal Data Source/Sink - I assume this
> will be renamed new multimodal formats. Are there existing registries that
> these could be looked in - similar to schema registry - so we can bring in
> artifacts via metadata?
> FLIP-XXX: Built-in Multimodal Operators and AI Functions - I wonder if we
> could bring in existing implementation libraries and the new work would
> allow us to call them from Flink. i.e. not having to do them one call by
> call but library by library.
> FLIP-XXX: Columnar Data Transport and Processing Optimization - this seems
> a big change, events as columns rather than events as rows or CDC
> sequences. I assume this would not be exposed in SQL?
>
>  kind regards, David.
>
> From: Robert Metzger <[email protected]>
> Date: Tuesday, 28 April 2026 at 07:38
> To: [email protected] <[email protected]>
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella
> Proposal for Multimodal Data Processing
>
> Hey Guowei,
>
> Thanks for the proposal. I just took a brief look, here are some high level
> questions:
>
> Regarding the RPC Operator: What is the difference to the async io operator
> we have already?
>
> "Connector API for Multimodal Data Source/Sink": Why do we need to touch
> the connector API for supporting multimodal data? Isn't this more of a
> formats concern?
>
> "Non-Disruptive Scaling for CPU Operators": How do you want to guarantee
> exactly-once on that kind of scaling? E.g. you need to somehow make a
> handover between the old and new new pipeline
>
> Overall, I find the proposal has some things which seem related to making
> Flink more AI native, but other changes seem orthogonal to that. For
> example the checkpoint or scaling changes are actually unrelated to AI, and
> just engine improvements.
>
>
> On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]> wrote:
>
> > Hi everyone,
> >
> > I'd like to start a discussion on an umbrella FLIP[1] that lays out a
> > direction for evolving Flink into a data engine that natively supports AI
> > workloads.
> >
> > The short version: user workloads are shifting from BI analytics to
> > multimodal data processing centered on model inference, and this triggers
> > cascading changes across the stack — multimodal data flowing through
> > pipelines, heterogeneous CPU/GPU resources, vectorized execution, and
> > inference tasks that run for seconds to minutes on Spot instances. The
> > proposal sketches an evolution along five directions (development
> paradigm,
> > data model, heterogeneous resources, execution engine, fault tolerance),
> > decomposed into 11 sub-FLIPs organized into three layers: core runtime
> > primitives, AI workload expression and execution, and production-grade
> > operational guarantees. Most sub-FLIPs have no hard dependencies on each
> > other and can be advanced in parallel.
> >
> > A note on scope, since it's an umbrella:
> >
> > - In scope here: whether the evolution directions are reasonable, whether
> > each sub-FLIP's motivation and proposed approach are well-founded, and
> > whether the boundaries and dependencies between sub-FLIPs are clear.
> > - Out of scope here: detailed designs, API specifics, and implementation
> > plans of individual sub-FLIPs — those will go through their own FLIPs.
> > - Consensus criteria: agreement on the overall direction is sufficient
> for
> > the umbrella to pass; passing it does not lock in any sub-FLIP's design —
> > sub-FLIPs may still be adjusted, deferred, or withdrawn as they progress.
> >
> > All proposed changes are incremental — no existing API or behavior is
> > removed or altered. Compatibility details are covered at the end of the
> > document.
> >
> > Looking forward to your feedback on the overall direction and the
> layering.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275
> >
> > Thanks,
> > Guowei
> >
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: Building C, IBM Hursley Office, Hursley Park Road,
> Winchester, Hampshire SO21 2JN
>

Reply via email to