Hi all. I'm trying to implement HPA based on GPU utilization metrics.
My initial approach is to use DCGM Exporter which is a daemonset that runs a pod on every GPU node and exports GPU metrics. By setting an additional scrape config when installing kube-prometheus-community and a custom rule when installing prometheus-adapter, I'm able to query the prometheus API and get the dcgm_gpu_utilization for each node: dcgm_gpu_utilization{Hostname="dcgm-exporter-dmrff", UUID="GPU-e26f8adc-c4aa-4a46-b3d3-ff4599da50a3", device="nvidia0", gpu="0", instance="10.28.0.50:9400", job="gpu-metrics", kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-qth8"} 3 dcgm_gpu_utilization{Hostname="dcgm-exporter-rxjfm", UUID="GPU-0446c63e-3843-62fa-56db-423958021f5c", device="nvidia0", gpu="0", instance="10.28.1.27:9400", job="gpu-metrics", kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-8bgb"} 0 What I'd like to ask is this: Is it possible to configure HPA for a deployment based on this metric (even though it's being exported for each node through dcgm-exporter pods and not the pods corresponding to the deployment we want to autoscale)? Perhaps there's a way to generate a metric like mydeploy_gpu_avg which is equal to avg(dcgm_gpu_utilization) over all nodes that have a replica of the deployment mydeploy? That would make it possible to configure HPA with a custom object that targets this mydeploy_gpu_avg metric of mydeploy. I hope I'm making sense so far. To my surprise, this is a very rare scenario it seems. Our use-case is autoscaling GPU-based machine learning inference servers, in case it helps to know. I would really appreciate any advice regarding this. I tried to document my current progress in a Github repo: https://github.com/ashrafgt/k8s-gpu-hpa Thank you. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2f35717e-2c9c-4afa-815f-529330a0ece8n%40googlegroups.com.