[ 
https://issues.apache.org/jira/browse/FLINK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-38857:
-----------------------------------
    Labels: pull-request-available  (was: )

> Introduce a Triton inference module under flink-models for batch-oriented AI 
> inference
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-38857
>                 URL: https://issues.apache.org/jira/browse/FLINK-38857
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / API
>            Reporter: featzhang
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Background
> Modern AI inference workloads are increasingly served by dedicated inference 
> servers such as {*}NVIDIA Triton Inference Server{*}, which provide 
> high-performance, batch-oriented inference APIs over HTTP / gRPC.
> Typical characteristics of such workloads include:
>  * High per-request latency
>  * Strong batching efficiency (especially for GPU-based inference)
>  * Stateless or externally managed model lifecycle
> While Apache Flink already provides asynchronous I/O primitives, there is 
> currently *no reusable, model-oriented runtime abstraction* for integrating 
> external inference servers in a structured and reusable way.
> This makes inference integrations:
>  * Ad-hoc and application-specific
>  * Difficult to standardize across projects
>  * Hard to evolve towards SQL- or planner-level integration later
> ----
> h2. Proposal
> This issue proposes introducing a {*}Triton-specific inference module under 
> {{flink-models}}{*}, focusing on {*}runtime-level integration only{*}, 
> without introducing any SQL or Table API changes.
> The goal is to provide a *clean, reusable building block* for Triton-based 
> inference that can be used directly from the DataStream API and serve as a 
> foundation for future higher-level integrations.
> ----
> h2. Scope
> h3. In scope
>  * Introduce a new module:
>  ** {{flink-models-triton}}
>  * Define a Triton model client abstraction, for example:
>  ** {{TritonModelClient}}
>  * Support:
>  ** Batch-oriented inference requests
>  ** Asynchronous execution
>  ** Mapping between Flink records and Triton request / response formats
>  * Enable integration with existing async primitives, such as:
>  ** {{AsyncBatchFunction}}
>  ** {{AsyncBatchWaitOperator}}
>  * Provide basic examples and tests demonstrating:
>  ** Batched inference
>  ** Error propagation
>  ** Graceful shutdown
> ----
> h3. Out of scope (for this issue)
> The following topics are *explicitly excluded* from this proposal and can be 
> addressed incrementally in follow-up issues:
>  * SQL / Table API integration (e.g. {{{}CREATE MODEL{}}})
>  * Planner- or optimizer-level inference support
>  * Model lifecycle management or versioning
>  * Retry, fallback, or advanced timeout strategies
>  * Inference-specific metrics or observability extensions
> ----
> h2. Motivation and Benefits
>  * Provides a *standardized Triton integration* for Flink users running AI 
> inference workloads
>  * Avoids treating inference as a black-box UDF
>  * Keeps the initial contribution *small, focused, and reviewable*
>  * Establishes a clear separation between:
>  ** Runtime inference execution ({{{}flink-models{}}})
>  ** Higher-level API and planner integration (future work)
> ----
> h2. Compatibility and Migration
> This change is fully additive:
>  * No existing APIs are modified
>  * No behavior changes to existing async operators
>  * No impact on SQL or Table API users
> ----
> h2. Future Work
> Possible follow-up work includes:
>  * SQL-level model definitions and invocation
>  * Planner-aware inference batching
>  * Cost-based inference optimization
>  * Support for additional inference backends beyond Triton
> ----
> h2. Summary
> This proposal introduces a minimal, runtime-focused Triton inference module 
> under {{{}flink-models{}}}, enabling efficient batch-oriented AI inference in 
> Flink while keeping the core system stable and backward-compatible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to