featzhang created FLINK-38857:
---------------------------------
Summary: Introduce a Triton inference module under flink-models
for batch-oriented AI inference
Key: FLINK-38857
URL: https://issues.apache.org/jira/browse/FLINK-38857
Project: Flink
Issue Type: Improvement
Components: Table SQL / API
Reporter: featzhang
h2. Background
Modern AI inference workloads are increasingly served by dedicated inference
servers such as {*}NVIDIA Triton Inference Server{*}, which provide
high-performance, batch-oriented inference APIs over HTTP / gRPC.
Typical characteristics of such workloads include:
* High per-request latency
* Strong batching efficiency (especially for GPU-based inference)
* Stateless or externally managed model lifecycle
While Apache Flink already provides asynchronous I/O primitives, there is
currently *no reusable, model-oriented runtime abstraction* for integrating
external inference servers in a structured and reusable way.
This makes inference integrations:
* Ad-hoc and application-specific
* Difficult to standardize across projects
* Hard to evolve towards SQL- or planner-level integration later
----
h2. Proposal
This issue proposes introducing a {*}Triton-specific inference module under
{{flink-models}}{*}, focusing on {*}runtime-level integration only{*}, without
introducing any SQL or Table API changes.
The goal is to provide a *clean, reusable building block* for Triton-based
inference that can be used directly from the DataStream API and serve as a
foundation for future higher-level integrations.
----
h2. Scope
h3. In scope
* Introduce a new module:
** {{flink-models-triton}}
* Define a Triton model client abstraction, for example:
** {{TritonModelClient}}
* Support:
** Batch-oriented inference requests
** Asynchronous execution
** Mapping between Flink records and Triton request / response formats
* Enable integration with existing async primitives, such as:
** {{AsyncBatchFunction}}
** {{AsyncBatchWaitOperator}}
* Provide basic examples and tests demonstrating:
** Batched inference
** Error propagation
** Graceful shutdown
----
h3. Out of scope (for this issue)
The following topics are *explicitly excluded* from this proposal and can be
addressed incrementally in follow-up issues:
* SQL / Table API integration (e.g. {{{}CREATE MODEL{}}})
* Planner- or optimizer-level inference support
* Model lifecycle management or versioning
* Retry, fallback, or advanced timeout strategies
* Inference-specific metrics or observability extensions
----
h2. Motivation and Benefits
* Provides a *standardized Triton integration* for Flink users running AI
inference workloads
* Avoids treating inference as a black-box UDF
* Keeps the initial contribution *small, focused, and reviewable*
* Establishes a clear separation between:
** Runtime inference execution ({{{}flink-models{}}})
** Higher-level API and planner integration (future work)
----
h2. Compatibility and Migration
This change is fully additive:
* No existing APIs are modified
* No behavior changes to existing async operators
* No impact on SQL or Table API users
----
h2. Future Work
Possible follow-up work includes:
* SQL-level model definitions and invocation
* Planner-aware inference batching
* Cost-based inference optimization
* Support for additional inference backends beyond Triton
----
h2. Summary
This proposal introduces a minimal, runtime-focused Triton inference module
under {{{}flink-models{}}}, enabling efficient batch-oriented AI inference in
Flink while keeping the core system stable and backward-compatible.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)