featzhang created FLINK-39629:
---------------------------------
Summary: Add Async I/O operator that invokes GPU sidecar from
DataStream API
Key: FLINK-39629
URL: https://issues.apache.org/jira/browse/FLINK-39629
Project: Flink
Issue Type: Sub-task
Components: API / DataStream
Reporter: featzhang
h2. Background
With a functional sidecar that batches inference requests, user programs
written against the DataStream API need a convenient, correctly ordered,
back-pressure-friendly way to invoke the sidecar. Flink's existing Async
I/O subsystem ({{AsyncDataStream}}, {{AsyncFunction}},
{{RichAsyncFunction}}) already provides the required machinery for
ordered / unordered result emission, timeouts, and capacity limits.
This sub-task adds a first-party async function that delegates record-by-
record inference to the sidecar.
h2. Scope of this sub-task
* Add {{GpuSidecarAsyncFunction<IN, OUT>}} in a new package under
{{flink-streaming-java}} or a dedicated {{flink-gpu-client}} module.
Constructor parameters:
** Sidecar endpoint (UDS or TCP).
** Request timeout.
** Maximum inflight requests (capacity).
** A pluggable {{TensorCodec<IN, OUT>}} that converts records to / from
the sidecar's tensor wire format.
* Manage one RPC channel per parallel subtask; reconnect with exponential
backoff.
* Honour Flink checkpointing: inflight requests are drained on checkpoint
barriers when ordered emission is requested.
* Surface the sidecar's structured error codes as
{{AsyncFunction}}-level failures so the user program can apply Flink's
standard restart strategies.
h2. Out of scope
* Table / SQL integration (tracked in a later sub-task).
* Scheduling operators onto nodes with a live sidecar (tracked in a later
sub-task).
* Model-hot-reload UX on the DataStream side.
h2. Acceptance criteria
* End-to-end test: a job with a source, a
{{GpuSidecarAsyncFunction}}, and a sink runs against the mock sidecar
on CI and produces deterministic output under ordered emission.
* Timeout and capacity settings behave as documented.
* Clean shutdown on job cancellation; no connection leak.
h2. Affected modules
* {{flink-streaming-java}} (or new {{flink-gpu-client}})
* {{flink-tests}} (integration test against mock sidecar)
h2. Links
Parent: see umbrella issue linked to this sub-task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)