[
https://issues.apache.org/jira/browse/FLINK-39625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
featzhang updated FLINK-39625:
------------------------------
Attachment: image-2026-05-08-14-41-15-033.png
> Support GPU-based model inference via sidecar/actor pattern
> -----------------------------------------------------------
>
> Key: FLINK-39625
> URL: https://issues.apache.org/jira/browse/FLINK-39625
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Coordination
> Reporter: featzhang
> Priority: Major
> Labels: flip-proposal, gpu, model-inference
> Attachments: image-2026-05-08-14-41-15-033.png
>
>
> h2. !image-2026-05-08-14-41-15-033.png!
> h2. Motivation
> Real-time stream processing increasingly incorporates machine-learning
> inference
> directly into streaming pipelines, for use cases such as feature enrichment,
> anomaly detection, content ranking, and fraud detection. Today, users who
> need GPU-accelerated model inference inside Flink typically embed the model
> weights inside the UDF or operator, which has several drawbacks:
> * Every task slot holding the operator loads its own copy of the model into
> GPU memory, wasting VRAM and slot initialisation time.
> * The model's lifecycle is coupled to Flink's task lifecycle: any restart,
> rescale, or failover causes model reload.
> * Request batching must be implemented ad-hoc inside each UDF, and cross-task
> batching is impossible because each task only sees its own local requests.
> * Flink has no first-class notion of GPU resources in ResourceProfile, so
> GPU assignment relies on external schedulers or manual pinning.
> Inspired by the actor pattern used in frameworks such as Ray, this proposal
> introduces a long-lived "GPU sidecar" process, co-located with each GPU-
> equipped TaskManager. Flink operators invoke the sidecar over an RPC
> channel; the sidecar owns model loading, GPU memory management, and
> cross-operator request batching.
> h2. Goals
> * Add a standardised way to declare GPU resources on TaskManager and request
> them through ResourceProfile.
> * Provide a first-party GPU sidecar service that hosts one or more models on
> each GPU node, loads them once, and serves asynchronous inference
> requests.
> * Allow Flink operators (both DataStream async I/O and Table/SQL functions)
> to invoke the sidecar efficiently, with bounded-queue back-pressure and
> cross-request batching.
> * Schedule GPU-affinity operators close to a live sidecar via existing
> ResourceManager mechanisms.
> * Expose metrics for GPU utilisation, queue depth, batch size, and
> per-request latency.
> h2. Non-goals
> * This proposal does not introduce a new model-training runtime; training
> remains the responsibility of external systems.
> * It does not prescribe a specific inference engine (TensorRT, ONNX
> Runtime, PyTorch, and plain CUDA kernels are all acceptable backends).
> * It does not replace the existing external-async patterns (Async I/O,
> Lookup Join); those remain valid for non-GPU remote calls.
> h2. Proposed design overview
> Three cooperating components:
> # *TaskManager-side resource declaration* - extend {{ResourceProfile}} with a
> GPU dimension, advertised by TaskManagers that opt in.
> # *GPU sidecar process* - a standalone service, one per GPU node, that loads
> models on startup, exposes an RPC endpoint (initial implementation: gRPC
> over UDS / TCP), and batches incoming inference requests.
> # *Flink-side client* - an Async I/O based operator and a Table/SQL UDF that
> translate records into sidecar RPC calls, enforce ordering and timeouts,
> and surface metrics.
> The sidecar is co-located with the TaskManager but runs in its own process
> so that GPU memory is isolated from JVM heap and survives TaskManager
> restarts when possible.
> h2. Compatibility, deprecation, and migration plan
> * Entirely additive. No existing API changes.
> * Feature-flagged via TaskManager configuration; clusters without GPUs are
> unaffected.
> * The initial release is marked {{{}@Experimental{}}}.
> h2. Testing plan
> * Unit tests for the new ResourceProfile dimension and scheduling hints.
> * End-to-end tests using a stubbed CPU "mock sidecar" so the pipeline can
> run on non-GPU CI agents.
> * Nightly GPU-enabled tests in a dedicated lane (tracked separately).
> h2. Implementation plan
> Work is split into six sequentially mergeable sub-tasks, each producing one
> or two reviewable pull requests. The sub-tasks are linked from this
> umbrella issue.
> # Extend {{ResourceProfile}} to declare GPU resources on TaskManager.
> # Introduce a new module {{flink-gpu-sidecar}} containing the service
> skeleton (process lifecycle, configuration, health checks, empty RPC
> surface).
> # Implement asynchronous batched inference RPC inside the sidecar,
> including queue, batcher, and metrics.
> # Add an Async I/O operator that invokes the sidecar from DataStream API.
> # Route GPU-affinity operators to slots backed by a live sidecar via
> ResourceManager.
> # Integrate with Table/SQL (UDF + DDL), ship an end-to-end example, and
> document user-facing configuration.
> h2. Rejected alternatives
> * *Loading the model directly inside the UDF.* Wastes GPU memory, prevents
> cross-task batching, couples model lifecycle to task lifecycle.
> * *Running an external model-serving system (for example Triton) as the
> only integration point.* Still valuable and complementary, but does not
> give Flink a first-class notion of GPU resources or tight scheduling
> affinity; this proposal provides that foundation and is compatible with
> external servers as alternative backends.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)