[jira] [Commented] (FLINK-39628) Implement asynchronous batched inference RPC in GPU sidecar

featzhang (Jira) Thu, 07 May 2026 17:28:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-39628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079275#comment-18079275
 ]


featzhang commented on FLINK-39628:
-----------------------------------

I would like to work on this sub-task under the FLINK-39625 umbrella. Could a 
committer please assign it to me (Jira username: featzhang)? Thanks!

> Implement asynchronous batched inference RPC in GPU sidecar
> -----------------------------------------------------------
>
>                 Key: FLINK-39628
>                 URL: https://issues.apache.org/jira/browse/FLINK-39628
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Task
>            Reporter: featzhang
>            Priority: Major
>              Labels: gpu, model-inference
>
> h2. Background
> With the sidecar process and its empty RPC surface in place, this sub-task
> turns the sidecar into a real inference server: it accepts concurrent
> inference requests, batches them to make full use of the GPU, and returns
> results asynchronously.
> h2. Scope of this sub-task
> * Add an {{Infer}} RPC to the sidecar proto:
> ** Bidirectional streaming, so that the client can pipeline requests and
>  the server can interleave responses.
> ** Request carries opaque tensor bytes plus a request id; response carries
>  the same request id plus result tensor bytes or an error.
> * Implement a bounded, backpressure-aware request queue inside the sidecar:
> ** Maximum queue length and maximum wait time are both configurable.
> ** Once the queue is full the server returns a
>  {{RESOURCE_EXHAUSTED}}-equivalent status so the client can apply its
>  own back-pressure.
> * Implement a batcher that aggregates queued requests by time window and
>  maximum batch size, then submits a single batched call to the inference
>  backend.
> * Wire a pluggable backend interface so that the first concrete backend
>  (a mock / CPU stub for tests) can be replaced with TensorRT, ONNX
>  Runtime, or PyTorch in follow-up work.
> * Publish the following metrics through the existing Flink metrics
>  reporter abstraction:
> ** Queue depth.
> ** Batch size (histogram).
> ** Inference latency (histogram, end-to-end and per-stage).
> ** Inflight requests.
> h2. Out of scope
> * A specific model format (tracked with the concrete backend work).
> * Authentication / authorisation on the RPC boundary (tracked separately).
> h2. Acceptance criteria
> * Throughput and latency benchmarks using the mock backend match the
>  documented expectations on a reference machine.
> * Queue saturation returns a structured error rather than hanging.
> * Metrics are visible via the in-process metric reporter and match the
>  counts observed at the client.
> * No memory leak across a 30-minute soak test.
> h2. Affected modules
> * {{flink-gpu-sidecar}}
> h2. Links
> Parent: see umbrella issue linked to this sub-task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-39628) Implement asynchronous batched inference RPC in GPU sidecar

Reply via email to