featzhang created FLINK-39628:
---------------------------------

             Summary: Implement asynchronous batched inference RPC in GPU 
sidecar
                 Key: FLINK-39628
                 URL: https://issues.apache.org/jira/browse/FLINK-39628
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Task
            Reporter: featzhang


h2. Background

With the sidecar process and its empty RPC surface in place, this sub-task
turns the sidecar into a real inference server: it accepts concurrent
inference requests, batches them to make full use of the GPU, and returns
results asynchronously.

h2. Scope of this sub-task

* Add an {{Infer}} RPC to the sidecar proto:
** Bidirectional streaming, so that the client can pipeline requests and
 the server can interleave responses.
** Request carries opaque tensor bytes plus a request id; response carries
 the same request id plus result tensor bytes or an error.
* Implement a bounded, backpressure-aware request queue inside the sidecar:
** Maximum queue length and maximum wait time are both configurable.
** Once the queue is full the server returns a
 {{RESOURCE_EXHAUSTED}}-equivalent status so the client can apply its
 own back-pressure.
* Implement a batcher that aggregates queued requests by time window and
 maximum batch size, then submits a single batched call to the inference
 backend.
* Wire a pluggable backend interface so that the first concrete backend
 (a mock / CPU stub for tests) can be replaced with TensorRT, ONNX
 Runtime, or PyTorch in follow-up work.
* Publish the following metrics through the existing Flink metrics
 reporter abstraction:
** Queue depth.
** Batch size (histogram).
** Inference latency (histogram, end-to-end and per-stage).
** Inflight requests.

h2. Out of scope

* A specific model format (tracked with the concrete backend work).
* Authentication / authorisation on the RPC boundary (tracked separately).

h2. Acceptance criteria

* Throughput and latency benchmarks using the mock backend match the
 documented expectations on a reference machine.
* Queue saturation returns a structured error rather than hanging.
* Metrics are visible via the in-process metric reporter and match the
 counts observed at the client.
* No memory leak across a 30-minute soak test.

h2. Affected modules

* {{flink-gpu-sidecar}}

h2. Links

Parent: see umbrella issue linked to this sub-task.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to