Hi Shekhar, Thanks for the question — this is a very important point to clarify.
I think the key distinction here is between *flink-agents as an orchestration layer* and the *streaming inference runtime as an execution layer inside Flink core*. 1. What flink-agents typically covers >From my understanding, flink-agents focuses on: - high-level AI agent orchestration - tool/function calling patterns - workflow composition of LLM / AI tasks - user-defined inference logic and chaining In other words, it is primarily an *application / orchestration framework built on top of Flink*. ------------------------------ 2. What this FLIP is targeting (runtime layer) This proposal is focused on *execution-level capabilities inside Flink runtime*, which are typically not handled by flink-agents: 2.1 Streaming-aware execution semantics - adaptive batching under backpressure - latency-aware scheduling of inference requests - concurrency control per operator instance ------------------------------ 2.2 Fault tolerance at runtime level - standardized retry / fallback policies - circuit breaker integration - timeout isolation at operator level These are currently implemented inconsistently in user code or async functions. ------------------------------ 2.3 Backend abstraction and lifecycle management - unified inference backend abstraction (HTTP / Triton / custom) - connection pooling and lifecycle management - resource-aware execution coordination ------------------------------ 2.4 Integration with Flink execution model - checkpoint-aware inference execution - operator lifecycle integration - coordination with task scheduling and backpressure ------------------------------ 3. Why this cannot be fully handled in flink-agents The main limitation is that flink-agents operates at a *logical orchestration level*, while these requirements are deeply tied to: - task scheduling - operator execution semantics - backpressure model - runtime fault tolerance These are *core streaming engine concerns*, not orchestration logic. ------------------------------ 4. Relationship between the two To be clear, I see these as complementary: - flink-agents → AI workflow orchestration layer - inference runtime (this FLIP) → execution layer for AI workloads inside Flink A natural integration point would be: flink-agents generating workflows that execute on top of this runtime abstraction. ------------------------------ 5. Open question It would be great to align on: - whether we agree on this layering separation - whether existing flink-agents scope already includes any runtime-level execution concerns that I might have missed Happy to adjust the proposal if there is overlap. Thanks, featzhang Shekhar Rajak <[email protected]> 于2026年4月19日周日 00:17写道: > Hi, > I’m trying to understand - what are the features we are talking about > which can not be implemented in flink-agents sub project ? > > Shekhar > > On Saturday, April 18, 2026, 21:39, FeatZhang <[email protected]> > wrote: > > Hi Flink devs, > > I would like to start a discussion on a missing piece in Flink’s current > AI/ML inference capabilities and propose a FLIP for a *streaming-native AI > inference runtime layer*. > Motivation > > Apache Flink currently provides basic AI inference capabilities through > SQL-level constructs such as ML_PREDICT and related functions. These are > useful for integrating external models into batch and streaming pipelines. > > However, in production AI workloads (especially real-time inference and LLM > serving), we observe several gaps: > > - No unified runtime abstraction for inference execution > - No streaming-native batching or latency-aware scheduling > - Limited support for backpressure-aware inference control > - No built-in retry, fallback, or circuit breaker mechanisms > - Fragmented integration with external inference systems (e.g., HTTP > services, Triton, LLM endpoints) > > As a result, users often re-implement these capabilities in user-defined > functions, leading to inconsistent behavior and duplicated complexity. > ------------------------------ > Proposal (High-level) > > This FLIP proposes introducing a *Streaming-native AI Inference Runtime > Layer* in Flink, providing: > > - A unified inference operator abstraction > - Adaptive batching and concurrency control > - Backpressure-aware request scheduling > - Pluggable inference backends (HTTP / Triton / custom services) > - Built-in reliability mechanisms (retry, timeout, circuit breaker) > - Standard metrics and observability hooks > > ------------------------------ > Design Overview > > The high-level architecture would look like: > > DataStream / Table API > ↓ > Inference Operator Layer > ↓ > Inference Execution Engine > ↓ > Pluggable Inference Backend > > This layer would integrate with Flink’s existing streaming runtime and > remain fully compatible with current SQL/Table APIs. > ------------------------------ > Non-goals > > - This does NOT replace ML_PREDICT or existing SQL semantics > - This does NOT introduce a new ML training framework > - This is not tied to any specific inference engine > > ------------------------------ > Why now > > We see increasing adoption of Flink for real-time AI workloads, including: > > - streaming inference > - LLM-based pipelines > - hybrid AI + data processing workflows > > However, the lack of a standardized runtime abstraction makes production > deployments complex and inconsistent. > ------------------------------ > Request for feedback > > I would like feedback on: > > 1. Whether a dedicated inference runtime layer fits within Flink’s > architectural direction > 2. Preferred integration approach (Table API, DataStream, or both) > 3. Scope of built-in features vs user-defined extensibility > 4. Any existing efforts or ongoing work in this direction > > If there is agreement on direction, I will follow up with a more detailed > FLIP design document. > ------------------------------ > > Thanks, > featzhang > > > >
