Hi Guowei, Thank you for writing this proposal.
I may be in the minority here, but I hope my voice will be heard. I disagree with turning Flink into an "AI-Native" engine. Regarding your "Data processing is entering the AI era, and Flink needs to evolve from a traditional BI compute engine into a data engine that natively supports AI workloads" claim: - How exactly do you define "AI"? I don't believe there is a standard definition. For example, Machine Learning have been around for more than a decade, but there were no proposals (or need, in my opinion) to turn Flink into an "ML-Native" engine. Flink, in its current state, has been successfully used in many systems alongside dedicated ML technologies, like feature stores. Based on the context of your proposal, it looks like you mostly mean LLMs, so could you be specific about the language? - I wouldn't call Flink "a traditional BI compute engine". Flink is a general data processing technology which can be used for a variety of use cases without any BI involvement. - Do you have any proof that "Users' core workloads are rapidly evolving" and that they require your proposed changes? Case studies, user surveys, or submitted issues about the lack of support? Big changes like that require extensive validation. - And even if there is a real need to adopt some LLM-driven changes, why now? The LLM-related tooling has been changing so rapidly, and it's hard to predict what will be needed tomorrow. Why does it make sense to introduce changes now, and not wait for more standardization and consolidation? To summarize, I think there are a lot of great ideas in the proposal, but in my mind, they need to be addressed as tactical, focused changes, not under the "AI-Native" umbrella. I also wanted to address a few more specific points: - RpcOperator, why does it need to be managed by Flink? I see absolutely no need to introduce the additional complexity of orchestrating standalone components into the core Flink engine. I can imagine a separate sub-project for an RpcOperator, which could potentially be managed by the Kubernetes Operator. - You make the case for the vectorized batch processing, but only on the Python side. Why stop there? Native columnar vectorized execution will require end-to-end changes, including connectors, data format support, Type system support, runtime changes, etc. It seems logical to me to support this execution mode for Java and SQL as well. - Supporting many more data types natively (images, video, audio, tensors) will make connector serializers and deserializers (SerDes) much more challenging to implement. Even today, many SerDes in officially supported connectors don't fully implement types like arrays and structs. Thank you. On Wed, Apr 29, 2026 at 1:18 AM Guowei Ma <[email protected]> wrote: > Hi Z > > Thanks for the kind words and the thoughtful questions. Let me take them > one by one. > > 1. Throughput and latency targets > > To be honest, I don't have concrete numbers to share yet. What I can say is > that our internal testing has already surfaced several directions where > Flink can be improved, and at the same time we want to fully leverage > Flink's existing streaming shuffle capabilities. As the multimodal operator > library matures, we'll progressively publish benchmark results. > > 2. Built-in operators > > You're absolutely right. From what I've seen, our internal users already > rely on a fairly large set of multimodal operators — potentially 100+. The > exact set the community should provide is best discussed in FLIP-XXX: > Built-in Multimodal Operators and AI Functions, and contributions from the > community are very welcome there. > > 3. Plan for the 11 sub-FLIPs > > The sequencing follows the layering in the umbrella: > > - Layer 1 (Core Primitives) should be discussed and aligned first, since > the second and third layers build on it. > - Layer 2 (API + compilation + single-node execution) starts with > getting the API discussion right — the Python API, how UDFs declare > resources, etc. — after which the single-node execution work can build > on > top. > - Layer 3 (distributed scheduling and checkpointing) can largely proceed > independently in parallel. > > So while each sub-FLIP is indeed a substantial piece of work, most of them > can be advanced in parallel by different contributors once the Layer 1 > primitives are settled. > > 4. GPU scheduling roadmap > > Could you expand a bit on which aspect of GPU scheduling you have in mind > as the complex one? "GPU scheduling" covers a fairly wide surface area > (resource declaration, operator-level deployment, elastic scaling, > heterogeneous GPU types, fine-grained partitioning, etc.), and the answer > differs quite a bit depending on which dimension we're discussing. Once I > understand your specific concern I can give a more useful response. > > Thanks again for the support — looking forward to the continued discussion. > > Best, > Guowei > > > On Tue, Apr 28, 2026 at 4:34 PM zl z <[email protected]> wrote: > > > Hey Guowei, > > > > Thanks for the proposal, and I think this is very valuable. I have some > > question about it: > > > > 1. What are our expected throughput and latency targets? Do we have any > > forward-looking tests for this? > > > > 2. AI involves a very large number of operators. Besides allowing users > to > > use them through UDFs, will we also provide commonly used built-in > > operators? > > > > 3. Each of the 11 sub-FLIPs is a major project involving a significant > > amount of changes. What is our plan for this? > > > > 4. GPU scheduling is extremely complex. What is our current roadmap for > > this? > > > > This is a very high-quality and exciting proposal. Making Flink an > > AI-native data processing engine will make it far more valuable in the AI > > era. Look forward to seeing it land and come to fruition soon. > > > > Robert Metzger <[email protected]> 于2026年4月28日周二 14:38写道: > > > > > Hey Guowei, > > > > > > Thanks for the proposal. I just took a brief look, here are some high > > level > > > questions: > > > > > > Regarding the RPC Operator: What is the difference to the async io > > operator > > > we have already? > > > > > > "Connector API for Multimodal Data Source/Sink": Why do we need to > touch > > > the connector API for supporting multimodal data? Isn't this more of a > > > formats concern? > > > > > > "Non-Disruptive Scaling for CPU Operators": How do you want to > guarantee > > > exactly-once on that kind of scaling? E.g. you need to somehow make a > > > handover between the old and new new pipeline > > > > > > Overall, I find the proposal has some things which seem related to > making > > > Flink more AI native, but other changes seem orthogonal to that. For > > > example the checkpoint or scaling changes are actually unrelated to AI, > > and > > > just engine improvements. > > > > > > > > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <[email protected]> > wrote: > > > > > > > Hi everyone, > > > > > > > > I'd like to start a discussion on an umbrella FLIP[1] that lays out a > > > > direction for evolving Flink into a data engine that natively > supports > > AI > > > > workloads. > > > > > > > > The short version: user workloads are shifting from BI analytics to > > > > multimodal data processing centered on model inference, and this > > triggers > > > > cascading changes across the stack — multimodal data flowing through > > > > pipelines, heterogeneous CPU/GPU resources, vectorized execution, and > > > > inference tasks that run for seconds to minutes on Spot instances. > The > > > > proposal sketches an evolution along five directions (development > > > paradigm, > > > > data model, heterogeneous resources, execution engine, fault > > tolerance), > > > > decomposed into 11 sub-FLIPs organized into three layers: core > runtime > > > > primitives, AI workload expression and execution, and > production-grade > > > > operational guarantees. Most sub-FLIPs have no hard dependencies on > > each > > > > other and can be advanced in parallel. > > > > > > > > A note on scope, since it's an umbrella: > > > > > > > > - In scope here: whether the evolution directions are reasonable, > > whether > > > > each sub-FLIP's motivation and proposed approach are well-founded, > and > > > > whether the boundaries and dependencies between sub-FLIPs are clear. > > > > - Out of scope here: detailed designs, API specifics, and > > implementation > > > > plans of individual sub-FLIPs — those will go through their own > FLIPs. > > > > - Consensus criteria: agreement on the overall direction is > sufficient > > > for > > > > the umbrella to pass; passing it does not lock in any sub-FLIP's > > design — > > > > sub-FLIPs may still be adjusted, deferred, or withdrawn as they > > progress. > > > > > > > > All proposed changes are incremental — no existing API or behavior is > > > > removed or altered. Compatibility details are covered at the end of > the > > > > document. > > > > > > > > Looking forward to your feedback on the overall direction and the > > > layering. > > > > > > > > [1] > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275 > > > > > > > > Thanks, > > > > Guowei > > > > > > > > > >
