featzhang created FLINK-39626:
---------------------------------

             Summary: Extend ResourceProfile to declare GPU resources on 
TaskManager
                 Key: FLINK-39626
                 URL: https://issues.apache.org/jira/browse/FLINK-39626
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination, Runtime / Task
            Reporter: featzhang


h2. Background

Flink currently expresses slot requirements in {{ResourceProfile}} with CPU
cores, managed memory, task heap memory, and a generic
{{Map<String, Resource> extendedResources}}. The extended-resource slot is
intended for pluggable resources such as GPUs, but there is no first-party
support for declaring, advertising, or matching GPU resources.

This sub-task adds the concrete definitions and plumbing required so that
subsequent sub-tasks can schedule operators that depend on a GPU sidecar.

h2. Scope of this sub-task

* Add a {{GPUResource}} subclass of {{Resource}} under
 {{flink-core}} or {{flink-runtime}}, carrying at least a logical GPU
 count.
* Let TaskManagers advertise {{GPUResource}} in the resource profile they
 report to ResourceManager, gated by a configuration option such as
 {{taskmanager.resources.gpu.count}}.
* Ensure {{ResourceProfile#merge}}, {{#subtract}}, and
 {{#isMatching}} handle the new resource correctly.
* No scheduling-policy change in this sub-task; scheduling with GPU
 affinity is covered in a separate sub-task.

h2. Out of scope

* No model loading, no RPC, no operator changes.
* No vendor-specific attributes (device UUID, memory per device). Those can
 be added later in a backward-compatible way using the existing extended-
 resource mechanism.

h2. Acceptance criteria

* {{ResourceProfile}} round-trips correctly through serialization with a
 {{GPUResource}} set.
* TaskManager exposes the configured GPU count to ResourceManager.
* Unit tests cover {{merge}}, {{subtract}}, and {{isMatching}} interactions
 with {{GPUResource}}.
* No regression in non-GPU cluster startup or existing resource tests.

h2. Affected modules

* {{flink-core}}
* {{flink-runtime}}
* {{flink-runtime-web}} (if the resource is surfaced in the dashboard in a
 follow-up)

h2. Links

Parent: see umbrella issue linked to this sub-task.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to