featzhang created FLINK-39630:
---------------------------------

             Summary: Schedule GPU-affinity operators via ResourceManager
                 Key: FLINK-39630
                 URL: https://issues.apache.org/jira/browse/FLINK-39630
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination
            Reporter: featzhang


h2. Background

The GPU sidecar is a per-node resource: every {{TaskManager}} hosting a
sidecar loads the model once and serves all local operators through it.
For this to work efficiently, operators whose execution depends on the
sidecar must be scheduled onto slots backed by a node that actually runs
a live sidecar process.

This sub-task adds the scheduling hint and resource-matching logic, and
plugs them into the existing ResourceManager flow. It depends on the
{{GPUResource}} work already completed in the resource-profile sub-task.

h2. Scope of this sub-task

* Mark the GPU client operator from the async-operator sub-task with a
 {{ResourceSpec}} containing a {{GPUResource}} requirement.
* Extend the slot matcher so that slots advertised by non-GPU
 TaskManagers are rejected for such operators.
* Add a lightweight liveness probe in ResourceManager that verifies the
 sidecar's {{/health}} endpoint before a slot is handed out; slots with
 a not-ready sidecar are temporarily withheld.
* Expose a metric counting the number of rejections due to missing
 sidecar liveness, to aid diagnosis.

h2. Out of scope

* Global GPU placement across multiple clusters.
* Re-scheduling on model-weight hot reload (the sidecar handles that
 internally).

h2. Acceptance criteria

* Unit tests covering the slot matcher with mixed GPU and non-GPU
 TaskManagers.
* Integration test: deploying the async operator on a two-node standalone
 cluster (one GPU node with mock sidecar, one plain node) schedules all
 subtasks onto the GPU node.
* Liveness probe failures are reflected in the new metric and in logs.

h2. Affected modules

* {{flink-runtime}}
* {{flink-runtime-web}} (surface the new metric)

h2. Links

Parent: see umbrella issue linked to this sub-task.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to