featzhang created FLINK-39626:
---------------------------------
Summary: Extend ResourceProfile to declare GPU resources on
TaskManager
Key: FLINK-39626
URL: https://issues.apache.org/jira/browse/FLINK-39626
Project: Flink
Issue Type: Sub-task
Components: Runtime / Coordination, Runtime / Task
Reporter: featzhang
h2. Background
Flink currently expresses slot requirements in {{ResourceProfile}} with CPU
cores, managed memory, task heap memory, and a generic
{{Map<String, Resource> extendedResources}}. The extended-resource slot is
intended for pluggable resources such as GPUs, but there is no first-party
support for declaring, advertising, or matching GPU resources.
This sub-task adds the concrete definitions and plumbing required so that
subsequent sub-tasks can schedule operators that depend on a GPU sidecar.
h2. Scope of this sub-task
* Add a {{GPUResource}} subclass of {{Resource}} under
{{flink-core}} or {{flink-runtime}}, carrying at least a logical GPU
count.
* Let TaskManagers advertise {{GPUResource}} in the resource profile they
report to ResourceManager, gated by a configuration option such as
{{taskmanager.resources.gpu.count}}.
* Ensure {{ResourceProfile#merge}}, {{#subtract}}, and
{{#isMatching}} handle the new resource correctly.
* No scheduling-policy change in this sub-task; scheduling with GPU
affinity is covered in a separate sub-task.
h2. Out of scope
* No model loading, no RPC, no operator changes.
* No vendor-specific attributes (device UUID, memory per device). Those can
be added later in a backward-compatible way using the existing extended-
resource mechanism.
h2. Acceptance criteria
* {{ResourceProfile}} round-trips correctly through serialization with a
{{GPUResource}} set.
* TaskManager exposes the configured GPU count to ResourceManager.
* Unit tests cover {{merge}}, {{subtract}}, and {{isMatching}} interactions
with {{GPUResource}}.
* No regression in non-GPU cluster startup or existing resource tests.
h2. Affected modules
* {{flink-core}}
* {{flink-runtime}}
* {{flink-runtime-web}} (if the resource is surfaced in the dashboard in a
follow-up)
h2. Links
Parent: see umbrella issue linked to this sub-task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)