[
https://issues.apache.org/jira/browse/FLINK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-38857:
-----------------------------------
Labels: pull-request-available (was: )
> Introduce a Triton inference module under flink-models for batch-oriented AI
> inference
> --------------------------------------------------------------------------------------
>
> Key: FLINK-38857
> URL: https://issues.apache.org/jira/browse/FLINK-38857
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / API
> Reporter: featzhang
> Priority: Major
> Labels: pull-request-available
>
> h2. Background
> Modern AI inference workloads are increasingly served by dedicated inference
> servers such as {*}NVIDIA Triton Inference Server{*}, which provide
> high-performance, batch-oriented inference APIs over HTTP / gRPC.
> Typical characteristics of such workloads include:
> * High per-request latency
> * Strong batching efficiency (especially for GPU-based inference)
> * Stateless or externally managed model lifecycle
> While Apache Flink already provides asynchronous I/O primitives, there is
> currently *no reusable, model-oriented runtime abstraction* for integrating
> external inference servers in a structured and reusable way.
> This makes inference integrations:
> * Ad-hoc and application-specific
> * Difficult to standardize across projects
> * Hard to evolve towards SQL- or planner-level integration later
> ----
> h2. Proposal
> This issue proposes introducing a {*}Triton-specific inference module under
> {{flink-models}}{*}, focusing on {*}runtime-level integration only{*},
> without introducing any SQL or Table API changes.
> The goal is to provide a *clean, reusable building block* for Triton-based
> inference that can be used directly from the DataStream API and serve as a
> foundation for future higher-level integrations.
> ----
> h2. Scope
> h3. In scope
> * Introduce a new module:
> ** {{flink-models-triton}}
> * Define a Triton model client abstraction, for example:
> ** {{TritonModelClient}}
> * Support:
> ** Batch-oriented inference requests
> ** Asynchronous execution
> ** Mapping between Flink records and Triton request / response formats
> * Enable integration with existing async primitives, such as:
> ** {{AsyncBatchFunction}}
> ** {{AsyncBatchWaitOperator}}
> * Provide basic examples and tests demonstrating:
> ** Batched inference
> ** Error propagation
> ** Graceful shutdown
> ----
> h3. Out of scope (for this issue)
> The following topics are *explicitly excluded* from this proposal and can be
> addressed incrementally in follow-up issues:
> * SQL / Table API integration (e.g. {{{}CREATE MODEL{}}})
> * Planner- or optimizer-level inference support
> * Model lifecycle management or versioning
> * Retry, fallback, or advanced timeout strategies
> * Inference-specific metrics or observability extensions
> ----
> h2. Motivation and Benefits
> * Provides a *standardized Triton integration* for Flink users running AI
> inference workloads
> * Avoids treating inference as a black-box UDF
> * Keeps the initial contribution *small, focused, and reviewable*
> * Establishes a clear separation between:
> ** Runtime inference execution ({{{}flink-models{}}})
> ** Higher-level API and planner integration (future work)
> ----
> h2. Compatibility and Migration
> This change is fully additive:
> * No existing APIs are modified
> * No behavior changes to existing async operators
> * No impact on SQL or Table API users
> ----
> h2. Future Work
> Possible follow-up work includes:
> * SQL-level model definitions and invocation
> * Planner-aware inference batching
> * Cost-based inference optimization
> * Support for additional inference backends beyond Triton
> ----
> h2. Summary
> This proposal introduces a minimal, runtime-focused Triton inference module
> under {{{}flink-models{}}}, enabling efficient batch-oriented AI inference in
> Flink while keeping the core system stable and backward-compatible.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)