Re: [prometheus-users] Horizontal Pod Autoscaling using Nvidia GPU Metrics

2021-05-01 Thread Matthias Rampke
Looks good! On Sat, May 1, 2021, 23:43 Ashraf Guitouni wrote: > Understood. Thank you for the explanation! > > I tried the expression and I got the following error: > Error executing query: multiple matches for labels: grouping labels must > ensure unique matches > > The output of the first

Re: [prometheus-users] Horizontal Pod Autoscaling using Nvidia GPU Metrics

2021-05-01 Thread Matthias Rampke
I realized that using the request metrics may not work because they can only be updated once a request is complete. Ideally you'd have a direct "is this pod occupied" 1/0 metric from each model pod, but I don't know if that's possible with the framework. For the GPU metrics, we need to match the

[prometheus-users] Re: Horizontal Pod Autoscaling using Nvidia GPU Metrics

2021-05-01 Thread Ashraf Guitouni
Thank you for your reply. There's no GPU sharing for pods at the moment (this is how it is in general for k8s, except for Nvidia MIGs). The goal is to have HPA increasing/decreasing the replicas for a deployment, which will call on the cluster autoscaler to provision a new node if needed.

[prometheus-users] Re: Horizontal Pod Autoscaling using Nvidia GPU Metrics

2021-05-01 Thread sayf.eddi...@gmail.com
Hi, It depends on how the pods from the same node are sharing the GPU, but I think it is doable if you configure the hpa to spawn new pods and the pods to `request` GPU resources, this will force the GKE cluster autoscaler into creating new nodes to locate the new pods. Are you using KubeFlow

Re: [prometheus-users] How-To debug prometheus_rule_evaluation_failures_total? Prometheus is failing rule evaluations

2021-05-01 Thread Matthias Rampke
That looks good, I think the issue is which target(s) you discover for these jobs. If you scrape Prometheus directly you may have to change the TLS settings depending on your configuration. /MR On Sat, Apr 24, 2021, 08:58 'Evelyn Pereira Souza' via Prometheus Users <

[prometheus-users] Horizontal Pod Autoscaling using Nvidia GPU Metrics

2021-05-01 Thread Ashraf Guitouni
Hi all. I'm trying to implement HPA based on GPU utilization metrics. My initial approach is to use DCGM Exporter which is a daemonset that runs a pod on every GPU node and exports GPU metrics. By setting an additional scrape config when installing kube-prometheus-community and a custom

Re: [prometheus-users] Prometheus scrape targets regex

2021-05-01 Thread Ben Kochie
Please see the documented list of available service discovery methods: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config On Sat, May 1, 2021 at 9:57 AM nbada...@gmail.com wrote: > Hi Guys, > > I have process exporter installed on some of the nodes, and i

[prometheus-users] Usage of fail_if_body_not_matches_regexp and fail_if_body_matches_regexp.

2021-05-01 Thread yagyans...@gmail.com
Hi. I am using Blackbox Exporter version 0.18.0. I am want to know which field will be considered in case fail_if_body_not_matches_regexp and fail_if_body_matches_regexp contradict each other? For example: fail_if_body_not_matches_regexp: ['OK'] fail_if_body_matches_regexp: ['NOK']

[prometheus-users] Prometheus scrape targets regex

2021-05-01 Thread nbada...@gmail.com
Hi Guys, I have process exporter installed on some of the nodes, and i have below snippet setup in prometheus.yml: - job_name: 'process' static_configs: - targets: [host1:9256, host2:9256, host3:9256.host10:9256] Problem with above setup is that every time i onboard process