Hi all.

I'm trying to implement HPA based on GPU utilization metrics. 

My initial approach is to use DCGM Exporter which is a daemonset that runs 
a pod on every GPU node and exports GPU metrics. 

By setting an additional scrape config when installing 
kube-prometheus-community and a custom rule when installing 
prometheus-adapter, I'm able to query the prometheus API and get the 
dcgm_gpu_utilization for each node:
dcgm_gpu_utilization{Hostname="dcgm-exporter-dmrff", 
UUID="GPU-e26f8adc-c4aa-4a46-b3d3-ff4599da50a3", device="nvidia0", gpu="0", 
instance="10.28.0.50:9400", job="gpu-metrics", 
kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-qth8"} 3
dcgm_gpu_utilization{Hostname="dcgm-exporter-rxjfm", 
UUID="GPU-0446c63e-3843-62fa-56db-423958021f5c", device="nvidia0", gpu="0", 
instance="10.28.1.27:9400", job="gpu-metrics", 
kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-8bgb"} 0

What I'd like to ask is this: Is it possible to configure HPA for a 
deployment based on this metric (even though it's being exported for each 
node through dcgm-exporter pods and not the pods corresponding to the 
deployment we want to autoscale)?

Perhaps there's a way to generate a metric like mydeploy_gpu_avg which is 
equal to avg(dcgm_gpu_utilization) over all nodes that have a replica of 
the deployment mydeploy? That would make it possible to configure HPA with 
a custom object that targets this mydeploy_gpu_avg metric of mydeploy.


I hope I'm making sense so far. To my surprise, this is a very rare 
scenario it seems. Our use-case is autoscaling GPU-based machine learning 
inference servers, in case it helps to know.


I would really appreciate any advice regarding this. I tried to document my 
current progress in a Github repo: https://github.com/ashrafgt/k8s-gpu-hpa

Thank you.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2f35717e-2c9c-4afa-815f-529330a0ece8n%40googlegroups.com.

Reply via email to