[jira] [Updated] (FLINK-39625) Support GPU-based model inference via sidecar/actor pattern

featzhang (Jira) Thu, 07 May 2026 23:42:10 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


featzhang updated FLINK-39625:
------------------------------
    Attachment: image-2026-05-08-14-41-15-033.png

> Support GPU-based model inference via sidecar/actor pattern
> -----------------------------------------------------------
>
>                 Key: FLINK-39625
>                 URL: https://issues.apache.org/jira/browse/FLINK-39625
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>            Reporter: featzhang
>            Priority: Major
>              Labels: flip-proposal, gpu, model-inference
>         Attachments: image-2026-05-08-14-41-15-033.png
>
>
> h2. !image-2026-05-08-14-41-15-033.png!
> h2. Motivation
> Real-time stream processing increasingly incorporates machine-learning 
> inference
> directly into streaming pipelines, for use cases such as feature enrichment,
> anomaly detection, content ranking, and fraud detection. Today, users who
> need GPU-accelerated model inference inside Flink typically embed the model
> weights inside the UDF or operator, which has several drawbacks:
>  * Every task slot holding the operator loads its own copy of the model into
> GPU memory, wasting VRAM and slot initialisation time.
>  * The model's lifecycle is coupled to Flink's task lifecycle: any restart,
> rescale, or failover causes model reload.
>  * Request batching must be implemented ad-hoc inside each UDF, and cross-task
> batching is impossible because each task only sees its own local requests.
>  * Flink has no first-class notion of GPU resources in ResourceProfile, so
> GPU assignment relies on external schedulers or manual pinning.
> Inspired by the actor pattern used in frameworks such as Ray, this proposal
> introduces a long-lived "GPU sidecar" process, co-located with each GPU-
> equipped TaskManager. Flink operators invoke the sidecar over an RPC
> channel; the sidecar owns model loading, GPU memory management, and
> cross-operator request batching.
> h2. Goals
>  * Add a standardised way to declare GPU resources on TaskManager and request
> them through ResourceProfile.
>  * Provide a first-party GPU sidecar service that hosts one or more models on
> each GPU node, loads them once, and serves asynchronous inference
> requests.
>  * Allow Flink operators (both DataStream async I/O and Table/SQL functions)
> to invoke the sidecar efficiently, with bounded-queue back-pressure and
> cross-request batching.
>  * Schedule GPU-affinity operators close to a live sidecar via existing
> ResourceManager mechanisms.
>  * Expose metrics for GPU utilisation, queue depth, batch size, and
> per-request latency.
> h2. Non-goals
>  * This proposal does not introduce a new model-training runtime; training
> remains the responsibility of external systems.
>  * It does not prescribe a specific inference engine (TensorRT, ONNX
> Runtime, PyTorch, and plain CUDA kernels are all acceptable backends).
>  * It does not replace the existing external-async patterns (Async I/O,
> Lookup Join); those remain valid for non-GPU remote calls.
> h2. Proposed design overview
> Three cooperating components:
>  # *TaskManager-side resource declaration* - extend {{ResourceProfile}} with a
> GPU dimension, advertised by TaskManagers that opt in.
>  # *GPU sidecar process* - a standalone service, one per GPU node, that loads
> models on startup, exposes an RPC endpoint (initial implementation: gRPC
> over UDS / TCP), and batches incoming inference requests.
>  # *Flink-side client* - an Async I/O based operator and a Table/SQL UDF that
> translate records into sidecar RPC calls, enforce ordering and timeouts,
> and surface metrics.
> The sidecar is co-located with the TaskManager but runs in its own process
> so that GPU memory is isolated from JVM heap and survives TaskManager
> restarts when possible.
> h2. Compatibility, deprecation, and migration plan
>  * Entirely additive. No existing API changes.
>  * Feature-flagged via TaskManager configuration; clusters without GPUs are
> unaffected.
>  * The initial release is marked {{{}@Experimental{}}}.
> h2. Testing plan
>  * Unit tests for the new ResourceProfile dimension and scheduling hints.
>  * End-to-end tests using a stubbed CPU "mock sidecar" so the pipeline can
> run on non-GPU CI agents.
>  * Nightly GPU-enabled tests in a dedicated lane (tracked separately).
> h2. Implementation plan
> Work is split into six sequentially mergeable sub-tasks, each producing one
> or two reviewable pull requests. The sub-tasks are linked from this
> umbrella issue.
>  # Extend {{ResourceProfile}} to declare GPU resources on TaskManager.
>  # Introduce a new module {{flink-gpu-sidecar}} containing the service
> skeleton (process lifecycle, configuration, health checks, empty RPC
> surface).
>  # Implement asynchronous batched inference RPC inside the sidecar,
> including queue, batcher, and metrics.
>  # Add an Async I/O operator that invokes the sidecar from DataStream API.
>  # Route GPU-affinity operators to slots backed by a live sidecar via
> ResourceManager.
>  # Integrate with Table/SQL (UDF + DDL), ship an end-to-end example, and
> document user-facing configuration.
> h2. Rejected alternatives
>  * *Loading the model directly inside the UDF.* Wastes GPU memory, prevents
> cross-task batching, couples model lifecycle to task lifecycle.
>  * *Running an external model-serving system (for example Triton) as the
> only integration point.* Still valuable and complementary, but does not
> give Flink a first-class notion of GPU resources or tight scheduling
> affinity; this proposal provides that foundation and is compatible with
> external servers as alternative backends.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-39625) Support GPU-based model inference via sidecar/actor pattern

Reply via email to