Hi all, First off, thanks for putting this together. I'd like to raise a couple of concerns on top of what Yaroslav and Robert already raised.
1. The "AI-Native" framing in my opinion bundles things that don't belong together. Areas like pipeline region checkpoints, UAC enhancements and non-disruptive scaling are general engine improvements that should be evaluated on their own, and not bundled. 2. Flink's feature matrix is already sparse, and I think that expanding with this feature set is going to make it even worse and actually worsens the developer experience, because features won't be implemented everywhere. 3. I think we must have a high bar for "must be inside Flink". RpcOperator looks like service orchestration, which Kubernetes operators and others already handle. Also looking back at Flink ML, we have seen that we started with something and then later it got abandoned. I'd like to understand why this proposal would be different here. 4. The motivations makes an assertion that workloads are rapidly evolving, but there's no evidence for this in the FLIP. I think such a large FLIP requires more evidence before committing to this scope as a community. Best regards, Martijn On Thu, Apr 30, 2026 at 9:58 AM Guowei Ma <[email protected]> wrote: > Hi David, > > Thanks for the careful review — these questions help sharpen the umbrella. > Let me respond point by point. > > 1. Layer 3 dependencies on Layer 1/2 > > I broadly agree. The pure CPU-side checkpoint enhancements in Layer 3 > (Pipeline Region independent checkpoint, Unaligned Checkpoint improvements) > can indeed be advanced independently. The parts of Layer 3 related to GPU > elastic scaling do depend on RpcOperator from Layer 1, but this is a local > dependency and doesn't affect the parallelism of Layer 3 as a whole. I'll > make the dependency relationships more explicit in the next revision of the > umbrella. > > 2. "Could Flink become like R or SPSS?" — and how this compares to existing > solutions > > This is a question worth answering directly, because it goes to Flink's > positioning in AI data processing. > > Direct answer: no. R and SPSS are single-machine interactive statistical > analysis tools, targeted at statisticians and researchers. Flink is > distributed data infrastructure, targeted at engineers building production > data pipelines. These aren't in the same lane. > > The more relevant comparisons are Daft and Ray Data — these are the most > active distributed systems in AI data processing today. Flink's > differentiation shows up on two levels. > > On the ecosystem side: Flink's core strength is its streaming + checkpoint > machinery, which matters especially in inference scenarios. A single > inference can take seconds to minutes, so the cost of failover is far > higher than in traditional batch data processing, and fine-grained fault > tolerance directly determines production viability. We've seen users put > significant additional engineering effort into fault tolerance for > inference workloads on other systems, and even so the result is less > systematic and less complete than what Flink already provides — this is the > accumulation of more than a decade of streaming engine work, and not > something easily caught up with in the short term. > > On ecosystem position: Flink is already data infrastructure inside many > enterprises, widely used for ETL, CDC, and real-time analytics. Integrating > AI inference directly into existing Flink pipelines is far cheaper than > spinning up a separate stack on Ray or Daft — especially when multimodal > data is already flowing through Flink. The AI-Native evolution makes this > integration a natural extension, rather than asking users to switch to a > different technology stack. > > 3. RpcOperator: where it's deployed, how remote works > > Let me first clarify one point (this also addresses the same confusion in > Robert's email): RpcOperator is not a specialization of async io, it's a > deployment primitive — it splits GPU compute out of the data-plane topology > so it can be independently scheduled, scaled, and recovered as a service. > CPU operators can still use async semantics when calling it. > > In terms of deployment shape, the mechanics for bringing up an RpcOperator, > managing its lifecycle, and scheduling its resources are similar to how > Flink launches a TaskManager today — reusing existing Flink runtime > capabilities rather than introducing a new infrastructure layer. The > concrete protocols and implementation details will be developed in the > sub-FLIP. > > Typical use cases are GPU inference services such as LLM inference and > multimodal model inference — compute units that have their own resource > profile and scaling curve, distinct from those of the main data flow, and > therefore well-suited to being deployed as independent services. > > 4. Multimodal types — deser is the core > > You said "the interesting part will be the deser" — that's exactly right, > and deser is indeed the core of this sub-FLIP's design. The Object > Reference mechanism is a partial answer here — large objects are serialized > only once and passed through the pipeline by reference, which significantly > reduces the SerDes burden in multimodal scenarios. The detailed SerDes > design will be developed in the sub-FLIP. > > 5. Built-in operators: library-by-library integration > > This is a very pragmatic suggestion, and it's exactly the direction we're > going in. Our goal is not to reinvent the wheel, but to use the new > mechanisms to bring existing AI libraries smoothly into Flink, rather than > forcing users to write call-by-call wrappers. The layering is: > > - Built-in operators cover the most common operations that can benefit > from framework-level optimization (model sharing, batching, GPU resource > pooling, etc.); these will be built on top of existing libraries rather > than reimplemented from scratch. > - The UDF system lets users import any Python library directly, without > needing a dedicated operator per call. > > The two are complementary: common operations that benefit from framework > optimization go through built-ins; long-tail and custom needs go through > UDFs. The concrete set of built-in operators, the library integration > approach, and the specific framework-level optimization points will be > addressed in "FLIP-XXX: Built-in Multimodal Operators and AI Functions". > > 6. Columnar: whether it's exposed in SQL > > Your reading is correct — in the first phase we don't plan to expose the > columnar format at the SQL layer. Columnar execution is an internal engine > optimization, and SQL remains row-based at the logical level. > > Whether UDFs should directly produce/consume columnar data (Arrow / NumPy) > is something we're keeping open and can discuss further down the line — it > depends on how strong the vectorization need is in concrete scenarios, and > on the impact on user programming model complexity. > > Thanks again — your feedback meaningfully improves the clarity of the > umbrella in several places. > > > Best, Guowei > > > On Wed, Apr 29, 2026 at 12:02 AM David Radley <[email protected]> > wrote: > > > Hi Guowei, > > This is an interesting proposal. I second Roberts questions. Some > thoughts. > > > > Layer 3 does not depend on layers 1 and 2 I think. At the high level I > > wonder, is the idea that Flink could become like an R ML pipeline or > SPSS? > > It would be good to compare existing technology solutions and what > benefits > > Flink will bring to these scenarios. > > FLIP-XXX: Supporting RpcOperator — Independently Deployed and Scaled RPC > > Service Operators - see Robert's comment. I assume this is a > specialization > > of the async io operator for RPC. When you say deploying RPC services > that > > are fully managed by the Flink runtime, where would these be deployed? If > > it is remote how would this work? It would be interesting to see some use > > cases where Flink would be deploying RPC services that it has created. > > FLIP-XXX: Multimodal Data Type System and Object Reference Mechanism > > > > I like the idea of adding these types - the interesting part will be the > > deser. > > > > FLIP-XXX: A More Pythonic DataFrame API for Python Users - this makes > sense > > FLIP-XXX: Connector API for Multimodal Data Source/Sink - I assume this > > will be renamed new multimodal formats. Are there existing registries > that > > these could be looked in - similar to schema registry - so we can bring > in > > artifacts via metadata? > > FLIP-XXX: Built-in Multimodal Operators and AI Functions - I wonder if we > > could bring in existing implementation libraries and the new work would > > allow us to call them from Flink. i.e. not having to do them one call by > > call but library by library. > > FLIP-XXX: Columnar Data Transport and Processing Optimization - this > seems > > a big change, events as columns rather than events as rows or CDC > > sequences. I assume this would not be exposed in SQL? > > > > kind regards, David. > > > > From: Robert Metzger <[email protected]> > > Date: Tuesday, 28 April 2026 at 07:38 > > To: [email protected] <[email protected]> > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella > > Proposal for Multimodal Data Processing > > > > Hey Guowei, > > > > Thanks for the proposal. I just took a brief look, here are some high > level > > questions: > > > > Regarding the RPC Operator: What is the difference to the async io > operator > > we have already? > > > > "Connector API for Multimodal Data Source/Sink": Why do we need to touch > > the connector API for supporting multimodal data? Isn't this more of a > > formats concern? > > > > "Non-Disruptive Scaling for CPU Operators": How do you want to guarantee > > exactly-once on that kind of scaling? E.g. you need to somehow make a > > handover between the old and new new pipeline > > > > Overall, I find the proposal has some things which seem related to making > > Flink more AI native, but other changes seem orthogonal to that. For > > example the checkpoint or scaling changes are actually unrelated to AI, > and > > just engine improvements. > > > > > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]> wrote: > > > > > Hi everyone, > > > > > > I'd like to start a discussion on an umbrella FLIP[1] that lays out a > > > direction for evolving Flink into a data engine that natively supports > AI > > > workloads. > > > > > > The short version: user workloads are shifting from BI analytics to > > > multimodal data processing centered on model inference, and this > triggers > > > cascading changes across the stack — multimodal data flowing through > > > pipelines, heterogeneous CPU/GPU resources, vectorized execution, and > > > inference tasks that run for seconds to minutes on Spot instances. The > > > proposal sketches an evolution along five directions (development > > paradigm, > > > data model, heterogeneous resources, execution engine, fault > tolerance), > > > decomposed into 11 sub-FLIPs organized into three layers: core runtime > > > primitives, AI workload expression and execution, and production-grade > > > operational guarantees. Most sub-FLIPs have no hard dependencies on > each > > > other and can be advanced in parallel. > > > > > > A note on scope, since it's an umbrella: > > > > > > - In scope here: whether the evolution directions are reasonable, > whether > > > each sub-FLIP's motivation and proposed approach are well-founded, and > > > whether the boundaries and dependencies between sub-FLIPs are clear. > > > - Out of scope here: detailed designs, API specifics, and > implementation > > > plans of individual sub-FLIPs — those will go through their own FLIPs. > > > - Consensus criteria: agreement on the overall direction is sufficient > > for > > > the umbrella to pass; passing it does not lock in any sub-FLIP's > design — > > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they > progress. > > > > > > All proposed changes are incremental — no existing API or behavior is > > > removed or altered. Compatibility details are covered at the end of the > > > document. > > > > > > Looking forward to your feedback on the overall direction and the > > layering. > > > > > > [1] > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275__;!!Ayb5sqE7!r8RpqwBaLYaCD-T9W5LY122mgxzTZ_YPlLipB7kE5-J06kjkGBZKtkFADyDbv4sq7yOxfPaNCZ0DNz-V-pLJLg$ > > > > > > Thanks, > > > Guowei > > > > > > > Unless otherwise stated above: > > > > IBM United Kingdom Limited > > Registered in England and Wales with number 741598 > > Registered office: Building C, IBM Hursley Office, Hursley Park Road, > > Winchester, Hampshire SO21 2JN > > >
