Hello Guowei, Thanks for this proposal. The runtime direction looks like a good foundation, and I think many of these pieces will also help non-AI workloads.
One part I would like to see more detailed is the API side, especially for Java DataStream and SQL/Table users. The FLIP is Python-DataFrame focused, which makes sense for the AI ecosystem, but some questions are still open to align this with the overall project: - SQL/Table: What is the plan here? How will the new multimodal types (Tensor, Image, Embedding) work in the type system, codegen, and plan/savepoint compatibility? Is there a plan for SQL-level model inference beyond the current ML_PREDICT shape, for example, vector similarity or multimodal predicates? Today this is still very vendor-specific across the industry, so it would be nice to know if Flink wants to take a clear position here or how this flip will fit with the sql table vision - DataStream (v1 and v2): Will RpcOperator and the Arrow-batch primitives be exposed as first-class building blocks for Java users, or only as internal pieces behind the Python DataFrame? Many streaming inference use cases (real-time enrichment, CDC + model scoring) fit very well with DataStream and would benefit from clear guidance. This is not a blocker for the overall direction. I just think the API roadmap deserves the same level of detail as the runtime one, so that the current user base has a clear picture of what "AI-native" means for them. Kind regards, Gustavo On Tue, 28 Apr 2026 at 10:33, zl z <[email protected]> wrote: > Hey Guowei, > > Thanks for the proposal, and I think this is very valuable. I have some > question about it: > > 1. What are our expected throughput and latency targets? Do we have any > forward-looking tests for this? > > 2. AI involves a very large number of operators. Besides allowing users to > use them through UDFs, will we also provide commonly used built-in > operators? > > 3. Each of the 11 sub-FLIPs is a major project involving a significant > amount of changes. What is our plan for this? > > 4. GPU scheduling is extremely complex. What is our current roadmap for > this? > > This is a very high-quality and exciting proposal. Making Flink an > AI-native data processing engine will make it far more valuable in the AI > era. Look forward to seeing it land and come to fruition soon. > > Robert Metzger <[email protected]> 于2026年4月28日周二 14:38写道: > > > Hey Guowei, > > > > Thanks for the proposal. I just took a brief look, here are some high > level > > questions: > > > > Regarding the RPC Operator: What is the difference to the async io > operator > > we have already? > > > > "Connector API for Multimodal Data Source/Sink": Why do we need to touch > > the connector API for supporting multimodal data? Isn't this more of a > > formats concern? > > > > "Non-Disruptive Scaling for CPU Operators": How do you want to guarantee > > exactly-once on that kind of scaling? E.g. you need to somehow make a > > handover between the old and new new pipeline > > > > Overall, I find the proposal has some things which seem related to making > > Flink more AI native, but other changes seem orthogonal to that. For > > example the checkpoint or scaling changes are actually unrelated to AI, > and > > just engine improvements. > > > > > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]> wrote: > > > > > Hi everyone, > > > > > > I'd like to start a discussion on an umbrella FLIP[1] that lays out a > > > direction for evolving Flink into a data engine that natively supports > AI > > > workloads. > > > > > > The short version: user workloads are shifting from BI analytics to > > > multimodal data processing centered on model inference, and this > triggers > > > cascading changes across the stack — multimodal data flowing through > > > pipelines, heterogeneous CPU/GPU resources, vectorized execution, and > > > inference tasks that run for seconds to minutes on Spot instances. The > > > proposal sketches an evolution along five directions (development > > paradigm, > > > data model, heterogeneous resources, execution engine, fault > tolerance), > > > decomposed into 11 sub-FLIPs organized into three layers: core runtime > > > primitives, AI workload expression and execution, and production-grade > > > operational guarantees. Most sub-FLIPs have no hard dependencies on > each > > > other and can be advanced in parallel. > > > > > > A note on scope, since it's an umbrella: > > > > > > - In scope here: whether the evolution directions are reasonable, > whether > > > each sub-FLIP's motivation and proposed approach are well-founded, and > > > whether the boundaries and dependencies between sub-FLIPs are clear. > > > - Out of scope here: detailed designs, API specifics, and > implementation > > > plans of individual sub-FLIPs — those will go through their own FLIPs. > > > - Consensus criteria: agreement on the overall direction is sufficient > > for > > > the umbrella to pass; passing it does not lock in any sub-FLIP's > design — > > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they > progress. > > > > > > All proposed changes are incremental — no existing API or behavior is > > > removed or altered. Compatibility details are covered at the end of the > > > document. > > > > > > Looking forward to your feedback on the overall direction and the > > layering. > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275 > > > > > > Thanks, > > > Guowei > > > > > >
