[prometheus-users] Horizontal Pod Autoscaling using Nvidia GPU Metrics

Ashraf Guitouni Sat, 01 May 2021 06:36:44 -0700

Hi all.

I'm trying to implement HPA based on GPU utilization metrics.

My initial approach is to use DCGM Exporter which is a daemonset that runs
a pod on every GPU node and exports GPU metrics.

By setting an additional scrape config when installing
kube-prometheus-community and a custom rule when installing
prometheus-adapter, I'm able to query the prometheus API and get the
dcgm_gpu_utilization for each node:
dcgm_gpu_utilization{Hostname="dcgm-exporter-dmrff",
UUID="GPU-e26f8adc-c4aa-4a46-b3d3-ff4599da50a3", device="nvidia0", gpu="0",
instance="10.28.0.50:9400", job="gpu-metrics",
kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-qth8"} 3
dcgm_gpu_utilization{Hostname="dcgm-exporter-rxjfm",
UUID="GPU-0446c63e-3843-62fa-56db-423958021f5c", device="nvidia0", gpu="0",
instance="10.28.1.27:9400", job="gpu-metrics",
kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-8bgb"} 0

What I'd like to ask is this: Is it possible to configure HPA for a
deployment based on this metric (even though it's being exported for each
node through dcgm-exporter pods and not the pods corresponding to the
deployment we want to autoscale)?

Perhaps there's a way to generate a metric like mydeploy_gpu_avg which is
equal to avg(dcgm_gpu_utilization) over all nodes that have a replica of
the deployment mydeploy? That would make it possible to configure HPA with
a custom object that targets this mydeploy_gpu_avg metric of mydeploy.

I hope I'm making sense so far. To my surprise, this is a very rare
scenario it seems. Our use-case is autoscaling GPU-based machine learning
inference servers, in case it helps to know.

I would really appreciate any advice regarding this. I tried to document my
current progress in a Github repo: https://github.com/ashrafgt/k8s-gpu-hpa

Thank you.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/2f35717e-2c9c-4afa-815f-529330a0ece8n%40googlegroups.com.

[prometheus-users] Horizontal Pod Autoscaling using Nvidia GPU Metrics

Reply via email to