Re: [DISCUSS] FLIP: Streaming-native AI Inference Runtime for Flink

FeatZhang Sat, 18 Apr 2026 09:44:09 -0700

Hi Shekhar,

Thanks for the question — this is a very important point to clarify.


I think the key distinction here is between *flink-agents as an
orchestration layer* and the *streaming inference runtime as an execution
layer inside Flink core*.
1. What flink-agents typically covers

>From my understanding, flink-agents focuses on:

   - high-level AI agent orchestration
   - tool/function calling patterns
   - workflow composition of LLM / AI tasks
   - user-defined inference logic and chaining

In other words, it is primarily an *application / orchestration framework
built on top of Flink*.
------------------------------
2. What this FLIP is targeting (runtime layer)

This proposal is focused on *execution-level capabilities inside Flink
runtime*, which are typically not handled by flink-agents:
2.1 Streaming-aware execution semantics

   - adaptive batching under backpressure
   - latency-aware scheduling of inference requests
   - concurrency control per operator instance

------------------------------
2.2 Fault tolerance at runtime level

   - standardized retry / fallback policies
   - circuit breaker integration
   - timeout isolation at operator level

These are currently implemented inconsistently in user code or async
functions.
------------------------------
2.3 Backend abstraction and lifecycle management

   - unified inference backend abstraction (HTTP / Triton / custom)
   - connection pooling and lifecycle management
   - resource-aware execution coordination

------------------------------
2.4 Integration with Flink execution model

   - checkpoint-aware inference execution
   - operator lifecycle integration
   - coordination with task scheduling and backpressure

------------------------------
3. Why this cannot be fully handled in flink-agents

The main limitation is that flink-agents operates at a *logical
orchestration level*, while these requirements are deeply tied to:

   - task scheduling
   - operator execution semantics
   - backpressure model
   - runtime fault tolerance

These are *core streaming engine concerns*, not orchestration logic.
------------------------------
4. Relationship between the two

To be clear, I see these as complementary:

   - flink-agents → AI workflow orchestration layer
   - inference runtime (this FLIP) → execution layer for AI workloads
   inside Flink

A natural integration point would be:
flink-agents generating workflows that execute on top of this runtime
abstraction.
------------------------------
5. Open question

It would be great to align on:

   - whether we agree on this layering separation
   - whether existing flink-agents scope already includes any runtime-level
   execution concerns that I might have missed

Happy to adjust the proposal if there is overlap.

Thanks,
featzhang

Shekhar Rajak <[email protected]> 于2026年4月19日周日 00:17写道：

> Hi,
> I’m trying to understand - what are the features we are talking about
> which can not be implemented in flink-agents sub project ?
>
> Shekhar
>
> On Saturday, April 18, 2026, 21:39, FeatZhang <[email protected]>
> wrote:
>
> Hi Flink devs,
>
> I would like to start a discussion on a missing piece in Flink’s current
> AI/ML inference capabilities and propose a FLIP for a *streaming-native AI
> inference runtime layer*.
> Motivation
>
> Apache Flink currently provides basic AI inference capabilities through
> SQL-level constructs such as ML_PREDICT and related functions. These are
> useful for integrating external models into batch and streaming pipelines.
>
> However, in production AI workloads (especially real-time inference and LLM
> serving), we observe several gaps:
>
>   - No unified runtime abstraction for inference execution
>   - No streaming-native batching or latency-aware scheduling
>   - Limited support for backpressure-aware inference control
>   - No built-in retry, fallback, or circuit breaker mechanisms
>   - Fragmented integration with external inference systems (e.g., HTTP
>   services, Triton, LLM endpoints)
>
> As a result, users often re-implement these capabilities in user-defined
> functions, leading to inconsistent behavior and duplicated complexity.
> ------------------------------
> Proposal (High-level)
>
> This FLIP proposes introducing a *Streaming-native AI Inference Runtime
> Layer* in Flink, providing:
>
>   - A unified inference operator abstraction
>   - Adaptive batching and concurrency control
>   - Backpressure-aware request scheduling
>   - Pluggable inference backends (HTTP / Triton / custom services)
>   - Built-in reliability mechanisms (retry, timeout, circuit breaker)
>   - Standard metrics and observability hooks
>
> ------------------------------
> Design Overview
>
> The high-level architecture would look like:
>
> DataStream / Table API
>         ↓
> Inference Operator Layer
>         ↓
> Inference Execution Engine
>         ↓
> Pluggable Inference Backend
>
> This layer would integrate with Flink’s existing streaming runtime and
> remain fully compatible with current SQL/Table APIs.
> ------------------------------
> Non-goals
>
>   - This does NOT replace ML_PREDICT or existing SQL semantics
>   - This does NOT introduce a new ML training framework
>   - This is not tied to any specific inference engine
>
> ------------------------------
> Why now
>
> We see increasing adoption of Flink for real-time AI workloads, including:
>
>   - streaming inference
>   - LLM-based pipelines
>   - hybrid AI + data processing workflows
>
> However, the lack of a standardized runtime abstraction makes production
> deployments complex and inconsistent.
> ------------------------------
> Request for feedback
>
> I would like feedback on:
>
>   1. Whether a dedicated inference runtime layer fits within Flink’s
>   architectural direction
>   2. Preferred integration approach (Table API, DataStream, or both)
>   3. Scope of built-in features vs user-defined extensibility
>   4. Any existing efforts or ongoing work in this direction
>
> If there is agreement on direction, I will follow up with a more detailed
> FLIP design document.
> ------------------------------
>
> Thanks,
> featzhang
>
>
>
>

Re: [DISCUSS] FLIP: Streaming-native AI Inference Runtime for Flink

Reply via email to