featzhang created FLINK-38857:
---------------------------------

             Summary: Introduce a Triton inference module under flink-models 
for batch-oriented AI inference
                 Key: FLINK-38857
                 URL: https://issues.apache.org/jira/browse/FLINK-38857
             Project: Flink
          Issue Type: Improvement
          Components: Table SQL / API
            Reporter: featzhang


h2. Background

Modern AI inference workloads are increasingly served by dedicated inference 
servers such as {*}NVIDIA Triton Inference Server{*}, which provide 
high-performance, batch-oriented inference APIs over HTTP / gRPC.

Typical characteristics of such workloads include:
 * High per-request latency
 * Strong batching efficiency (especially for GPU-based inference)
 * Stateless or externally managed model lifecycle

While Apache Flink already provides asynchronous I/O primitives, there is 
currently *no reusable, model-oriented runtime abstraction* for integrating 
external inference servers in a structured and reusable way.

This makes inference integrations:
 * Ad-hoc and application-specific
 * Difficult to standardize across projects
 * Hard to evolve towards SQL- or planner-level integration later

----
h2. Proposal

This issue proposes introducing a {*}Triton-specific inference module under 
{{flink-models}}{*}, focusing on {*}runtime-level integration only{*}, without 
introducing any SQL or Table API changes.

The goal is to provide a *clean, reusable building block* for Triton-based 
inference that can be used directly from the DataStream API and serve as a 
foundation for future higher-level integrations.
----
h2. Scope
h3. In scope
 * Introduce a new module:
 ** {{flink-models-triton}}
 * Define a Triton model client abstraction, for example:
 ** {{TritonModelClient}}
 * Support:
 ** Batch-oriented inference requests
 ** Asynchronous execution
 ** Mapping between Flink records and Triton request / response formats
 * Enable integration with existing async primitives, such as:
 ** {{AsyncBatchFunction}}
 ** {{AsyncBatchWaitOperator}}
 * Provide basic examples and tests demonstrating:
 ** Batched inference
 ** Error propagation
 ** Graceful shutdown

----
h3. Out of scope (for this issue)

The following topics are *explicitly excluded* from this proposal and can be 
addressed incrementally in follow-up issues:
 * SQL / Table API integration (e.g. {{{}CREATE MODEL{}}})
 * Planner- or optimizer-level inference support
 * Model lifecycle management or versioning
 * Retry, fallback, or advanced timeout strategies
 * Inference-specific metrics or observability extensions

----
h2. Motivation and Benefits
 * Provides a *standardized Triton integration* for Flink users running AI 
inference workloads
 * Avoids treating inference as a black-box UDF
 * Keeps the initial contribution *small, focused, and reviewable*
 * Establishes a clear separation between:
 ** Runtime inference execution ({{{}flink-models{}}})
 ** Higher-level API and planner integration (future work)

----
h2. Compatibility and Migration

This change is fully additive:
 * No existing APIs are modified
 * No behavior changes to existing async operators
 * No impact on SQL or Table API users

----
h2. Future Work

Possible follow-up work includes:
 * SQL-level model definitions and invocation
 * Planner-aware inference batching
 * Cost-based inference optimization
 * Support for additional inference backends beyond Triton

----
h2. Summary

This proposal introduces a minimal, runtime-focused Triton inference module 
under {{{}flink-models{}}}, enabling efficient batch-oriented AI inference in 
Flink while keeping the core system stable and backward-compatible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to