[prometheus-users] Re: [prometheus-developers] Java Prometheus Exporter for Traffic Metrics

2023-10-20 Thread Matthias Rampke
Hi, I think this discussion is better suited for the -users mailing list, moving it there. Metric systems, like Prometheus, offer you a specific tradeoff: they allow you to count a *large number* of events by *limited dimensions*. Fundamentally, for each combination of dimensions, it tracks a

Re: [prometheus-users] substring of label values from recording rules

2023-09-29 Thread Matthias Rampke
You can do this by first using `label_replace` to create a new(!) label with the path prefix. Something like count by(host) (label_replace(up, "host", "$1", "instance", > "([^:]+)(:[0-9]+)")) > link

Re: [prometheus-users] MySQL scraper for shared-instance database

2023-02-22 Thread Matthias Rampke
What per-database metrics are you interested in? How does MySQL itself make them available? The mysqld_exporter has collectors that are off by default, but that you can enable. Some include per-schema (database) and per-table statistics, but the data still comes from the server-wide

Re: [prometheus-users] kubeprometheus container login details to delete WAL

2023-02-22 Thread Matthias Rampke
You can't, your Prometheus is managed by Google (and in reality, it's probably not actually Prometheus behind the scenes). We do not have login details for Google's infrastructure. You will need to contact GCP support to resolve this issue. Best of luck, Matthias On Tue, 21 Feb 2023, 22:48

[prometheus-users] haproxy_exporter FINAL Release 0.15.0

2023-02-15 Thread Matthias Rampke
Hi all, I have rounded up all the pending changes, and made a final release of the standalone HAProxy exporter. As all supported versions of HAProxy have built-in support for Prometheus metrics, this exporter is no longer needed, so we are retiring it. Best, Matthias 0.15.0 / 2023-02-15

Re: [prometheus-users] Null value in alerts

2022-12-11 Thread Matthias Rampke
When you say "the value is missing", what condition exactly do you want to alert on? To detect that there is *no* metric matching your selector, you can use the absent(…) function. It returns 1 when … is nothing. It gets more complicated and difficult if you want to detect that a single series

Re: [prometheus-users] Metrics vs log level

2022-10-07 Thread Matthias Rampke
> Say, If we write a wrapper on top of prometheus java client API, its going to be messy You can make it relatively clean by creating (and incrementing) all the metrics, but only calling .register() on those that you want to expose in the given environment. Even more elaborately, you could have

[prometheus-users] Re: [prometheus-developers] Return specific value if label regex not match

2022-08-12 Thread Matthias Rampke
Hi, this mailing list is for development of Prometheus and related projects. Since your question is about usage, I'm moving the thread to the prometheus-users mailing list. To answer your question, in general a regular expression can have an unbounded number of matches, so Prometheus cannot

Re: [prometheus-users] Re: Does Prometheus recommend exposing 2M timeseries per scrape endpoint?

2022-06-18 Thread Matthias Rampke
One place where time series of this magnitude from a single target are unfortunately common is kube-state-metrics (KSM). On a large cluster, I see almost 1M metrics. Those are relatively cheap because they are nearly constant and compress well,

Re: [prometheus-users] implement smtp connection with blackbox exporter

2022-05-18 Thread Matthias Rampke
It's really just TCP with some scripted interactions. Look at the steps in your configuration to see exactly what they are. /MR On Wed, May 18, 2022, 14:11 nina guo wrote: > > still have 2 questions: > - which module can be used to implement Nagios check_ldap_startTLS? > - Is there a way to

Re: [prometheus-users] implement smtp connection with blackbox exporter

2022-05-18 Thread Matthias Rampke
Yes, that should work. If you have trouble, open(curl) the scrape URL but add =true to see what happens. /MR On Wed, May 18, 2022, 12:34 nina guo wrote: > Hi, > > Can I use the following mode to implement Nagios check_smtp? > > smtp_starttls: > > prober: tcp > > timeout: 5s > > tcp: > >

Re: [prometheus-users] Forced to use the Pushgateway as a workaround?

2022-04-17 Thread Matthias Rampke
If you can, deploy (a) Prometheus into the cluster itself. The easiest way to manage that is using the Prometheus operator, but if that is not possible, you can configure it directly using relabeling, as in this example[0]. This Prometheus can scrape the various targets. You have a few options

Re: [prometheus-users] Any common exporter for Cloud Monitoring, Cloudwatch and Azure Monitor

2022-03-24 Thread Matthias Rampke
rometheus recommended exporter for GCP? > > Thanks, > > On Thu, Mar 24, 2022 at 7:45 PM Matthias Rampke > wrote: > >> No – exporters are typically more focused, and you would run however many >> you need. There are multiple options for some of these (CloudWatch Exporter >&

Re: [prometheus-users] Any common exporter for Cloud Monitoring, Cloudwatch and Azure Monitor

2022-03-24 Thread Matthias Rampke
No – exporters are typically more focused, and you would run however many you need. There are multiple options for some of these (CloudWatch Exporter or Yet Another CloudWatch Exporter; I see multiple results when searching for "azure monitor exporter"), so this approach lets you mix and match.

Re: [prometheus-users] Best option for short-lived jobs instead pushgateway?

2022-03-12 Thread Matthias Rampke
Prometheus does not really deal in single points. Many queries won't work. You can record the finished crawl as an event, in a system of your choice that handles events (any database, or log aggregators). Or, if your crawlers live for a while, treat them as "long" running. Make them expose

Re: [prometheus-users] Messages are dropping because too many are queued in AlertManager

2022-01-13 Thread Matthias Rampke
ller=notifier.go:528 > component=notifier alertmanager=http://127.0.0.1:9093/api/v1/alerts > count=0 msg="Error sending alert" err="Post > http://127.0.0.1:9093/api/v1/alerts: context deadline exceeded" > level=error ts=2021-09-07T23:37:56.967Z caller=notifier.go:52

Re: [prometheus-users] Re: Prometheus mysqld_exporter failing with errors.

2022-01-13 Thread Matthias Rampke
This may be caused by the mysqld exporter invalidly conflating multiple variables. Can you please share what your MySQL version is, and what you get from SHOW GLOBAL VARIABLES LIKE 'validate_password%'; >From a quick read, it seems that there is both a plugin, using configuration variables like

Re: [prometheus-users] Re: Difference between groups and list of rules

2022-01-13 Thread Matthias Rampke
Just to check my understanding: I think rules within one group are also evaluated in order, while groups (even at the same interval) can be evaluated concurrently. If you have multiple rules that build on top of one another (using the output of one rule in the next) put them in one group, with

Re: [prometheus-users] Prometheus crash due to OOM

2022-01-06 Thread Matthias Rampke
This is odd indeed. The only thing I can think of is that aside from the samples loaded into the query engine, a good amount of data may need to be paged in. The TSDB engine makes heavy use of mmap, so the actual data from disk is not accounted for as process memory. In some circumstances,

Re: [prometheus-users] Consumer lag using JMX exporter

2022-01-06 Thread Matthias Rampke
I don't think you can, Kafka itself is not tracking that information or exposing it over JMX. We use https://github.com/linkedin/Burrow for this. /MR On Wed, Jan 5, 2022, 18:56 Somnath Pandey wrote: > Hi Team, > > Could you please help to know the way of monitor Kafka consumer lag using > JMX

Re: [prometheus-users] Messages are dropping because too many are queued in AlertManager

2022-01-06 Thread Matthias Rampke
What is your webhook receiver? Are any of the resolve messages getting through? Are the requests succeeding? I think Alertmanager will retry failed webhooks, not sure for how long. This would keep them in the queue, leading to what you observe in Alertmanager. /MR On Thu, Jan 6, 2022, 07:14

Re: [prometheus-users] Question regarding Loadbalanced Alertmanager Clusters

2021-12-04 Thread Matthias Rampke
The technical reason for this admonition is in how the Prometheus-Alertmanager complex implements high availability notifications. The design goal is to send a notification in all possible circumstances, and *if possible* only send one. By spraying alerts to the list of all Alertmanager

Re: [prometheus-users] Alertmanager Gossip Cluster and Silences

2021-11-30 Thread Matthias Rampke
Alertmanager clustering works best if you send all alerts to all Alertmanagers – in effect, have a single, flat, global Alertmanager cluster where all members do the same work. In that sense, > Is it a good idea to [keep] only 2 Prom connected to each Alertmanager? In my experience, no. When

Re: [prometheus-users] duplicate metrics with prometheus+grafana and docker-compose

2021-11-26 Thread Matthias Rampke
Node exporter does not provide per container metrics, it looks at the whole node. This is sometimes difficult from inside a container, thus the warning. For metrics about containers, check out cAdvisor: https://github.com/google/cadvisor In Prometheus it is common to have metrics broken out by

Re: [prometheus-users] Re: Go client for parsing Alertmanager config file?

2021-11-23 Thread Matthias Rampke
The alertmanager's own structs are here: https://pkg.go.dev/github.com/prometheus/alertmanager/config but keep in mind that this isn't primarily intended as a library, so you may or may not be better off vendoring them. /MR On Tue, Nov 23, 2021 at 8:02 AM Brian Candler wrote: > It's just YAML,

Re: [prometheus-users] limiting permissions for the prometheus ClusterRole?

2021-11-08 Thread Matthias Rampke
I think it should work with just get/list/watch on pods. Try it and see what happens? /MR On Mon, Nov 8, 2021, 06:38 Victor Sudakov wrote: > Dear Colleagues, > > There is a good working example of RBAC setup in > >

Re: [prometheus-users] Send resolved without slack

2021-05-14 Thread Matthias Rampke
ons.description }}" >> >> I'm right or something wrong? What if I want to configure send resolved >> for all alerts? I have many recivers based on severity in recivers and i >> want only to see mail with "[RESOLVED]" in subject. >> >> I think there's

Re: [prometheus-users] Horizontal Pod Autoscaling using Nvidia GPU Metrics

2021-05-01 Thread Matthias Rampke
gt; avg by(node) (dcgm_gpu_utilization) > * on(node) group_right(pod, namespace) > max by(node, pod, namespace) (kube_pod_info{pod=~"cuda-test-.*"}) > > And it seems to give the right value: > {node="gke-test-hpa-gpu-nodes-0f879509-jrx0"} 4 > > Please let me

Re: [prometheus-users] Horizontal Pod Autoscaling using Nvidia GPU Metrics

2021-05-01 Thread Matthias Rampke
I realized that using the request metrics may not work because they can only be updated once a request is complete. Ideally you'd have a direct "is this pod occupied" 1/0 metric from each model pod, but I don't know if that's possible with the framework. For the GPU metrics, we need to match the

Re: [prometheus-users] How-To debug prometheus_rule_evaluation_failures_total? Prometheus is failing rule evaluations

2021-05-01 Thread Matthias Rampke
ers@googlegroups.com> wrote: > On 23.04.21 20:35, Matthias Rampke wrote: > > It seems like you are federating through an ingress or load balancer > > that balances over multiple Prometheus server replicas. Either federate > > > from each separately, or make sure that y

Re: [prometheus-users] How-To debug prometheus_rule_evaluation_failures_total? Prometheus is failing rule evaluations

2021-04-23 Thread Matthias Rampke
further and handles this situation out of the box. /MR On Fri, Apr 23, 2021, 06:08 'Evelyn Pereira Souza' via Prometheus Users < prometheus-users@googlegroups.com> wrote: > On 22.04.21 20:20, Matthias Rampke wrote: > > Your best starting point is the rules page of the Prometheu

Re: [prometheus-users] How-To debug prometheus_rule_evaluation_failures_total? Prometheus is failing rule evaluations

2021-04-22 Thread Matthias Rampke
Your best starting point is the rules page of the Prometheus UI (:9090/rules). It will show the error. You can also evaluate the rule expression yourself, using the UI, or maybe using PromLens to help debug expression issues. /MR On Thu, Apr 22, 2021, 19:06 'Evelyn Pereira Souza' via Prometheus

Re: [prometheus-users] PromQL Regex and the left bracket [ character

2021-04-22 Thread Matthias Rampke
Try using [[] (match any of the following characters: [): https://play.golang.org/p/ynvSW3lIHDY /MR On Thu, Apr 22, 2021 at 3:02 AM Patrick Mackey < patrick.mac...@deniedaccess.org> wrote: > Hi, all. > > I'm trying to match the left bracket character '[' with a regex and seem > to be hitting

Re: [prometheus-users] Send resolved without slack

2021-03-31 Thread Matthias Rampke
cret: '**' > auth_identity:'**' > html: '{{ template "email" .}}' > headers: > subject: "[CUK]{{ .CommonLabels.severity }} {{ > .CommonLabels.instance }} {{ .CommonLabels.alertname }} | {{ > .CommonAnnotations.description }}" > > - name:

Re: [prometheus-users] Send resolved without slack

2021-03-30 Thread Matthias Rampke
I *think* you can have one email config, and toggle whether to show [RESOLVED] as part of the template. Looking at the default templates , it seems like the way to do that is to check for the length of .Alerts.Firing and

Re: [prometheus-users] Timestamp of Duration Metric of a Periodic Task

2021-03-18 Thread Matthias Rampke
The statsd exporter works for this but has the downside that you are mapping through a different metric model. It's okay but can be annoying. There is also an aggregating pushgateway that may be useful if this is the route you want to go: https://github.com/weaveworks/prom-aggregation-gateway

Re: [prometheus-users] Probe_sucess=0 for more than 24hrs promql

2021-03-17 Thread Matthias Rampke
Try the query interactively (in the expression browser or Grafana's Explore). Strip out parts like the == 0 filter, and the max_over_time, to understand the shape of the data. You may have to vary how exactly you query. My main point is that you can use the fact that the metric value is a number,

Re: [prometheus-users] Timestamp of Duration Metric of a Periodic Task

2021-03-16 Thread Matthias Rampke
in memory. Ideally, you could get them from the long-running process that starts these individual runs. If that is not possible, the third party aggregating pushgateway may be useful to you. I hope this helps clarify how Prometheus sees the world! /MR On Wed, Mar 17, 2021, 00:06 Matthias Rampke

Re: [prometheus-users] Timestamp of Duration Metric of a Periodic Task

2021-03-16 Thread Matthias Rampke
There is a mismatch of models here. You are asking about plotting a set of (x,y) points; Prometheus fundamentally thinks in terms of continuous time series that happen to be sampled at the scrape interval. One way to resolve this is to consider the continuous time series of "how long did the last

Re: [prometheus-users] Node Exporter Metrics Documentation

2021-03-16 Thread Matthias Rampke
At the moment, no. Most metrics are passed through from the kernel; which ones exactly are available or what they mean depends on kernel semantics. I usually look at one node's /metrics endpoint to get an overview of what is available, along with the help texts. For more tricky cases (such as

Re: [prometheus-users] Probe_sucess=0 for more than 24hrs promql

2021-03-16 Thread Matthias Rampke
Use the fact that it's a number and you can take the maximum of all values in the last 24 hours: max_over_time(probe_success{job="abc"}[24h]) == 0 No for clause needed in this case. /MR On Tue, Mar 16, 2021, 20:59 Amit Das wrote: > Hi, > I would like to get alerts if my blackbox exporter

Re: [prometheus-users] Delete old data in prometheus and grafana.

2021-03-16 Thread Matthias Rampke
Foiled again by mobile Chrome, here is the link to the right section: https://prometheus.io/docs/prometheus/latest/querying/api/#delete-series On Tue, Mar 16, 2021, 06:59 Matthias Rampke wrote: > To delete specific sets of time series, enable administrative APIs and use > the Delete

Re: [prometheus-users] Delete old data in prometheus and grafana.

2021-03-16 Thread Matthias Rampke
To delete specific sets of time series, enable administrative APIs and use the DeleteSeries endpoint: https://prometheus.io/docs/prometheus/latest/querying/api/ To match all metrics from a given instance, use a matcher like {instance="xxx:1234"} without a metric name. /MR On Mon, Mar 15,

Re: [prometheus-users] Wal directory

2021-03-12 Thread Matthias Rampke
Try deleting /prometheus/wal altogether. You will lose some data but at this point the gap from Prometheus not starting is worse. Generally, Prometheus 2.12 is quite old, consider upgrading to the latest version – there have been lots of WAL related improvements since then. /MR On Wed, Mar 10,

Re: [prometheus-users] node exporter

2021-03-10 Thread Matthias Rampke
Node exporter has flags to filter out certain file systems and mountpoints, by default these include tmpfs. Check node_exporter --help for details, and set the flags according to your needs. /MR On Mon, Mar 8, 2021, 20:07 tarun@gmail.com wrote: > I've a file system > > Filesystem

Re: [prometheus-users] Alert for stopped containers

2021-03-10 Thread Matthias Rampke
The fundamental problem is how Prometheus can know which containers should be there. Considering your regex, there is an infinite number of containers that are "absent": 0dev-4, 1dev-4, … dev-4, …fjdhrhfksnhdev-4 etc. To solve this, you need a list of concretely expected containers somewhere.

Re: [prometheus-users] Calculate the percentage for process process exporter

2021-03-10 Thread Matthias Rampke
Ah, I see. You are on the right track, but you need to apply the "> bool 0" directly to the metric and only then take the average: avg_over_time( ( namedprocess_namegroup_num_procs{job=~"Appops_.*"} > bool 0 )[$__interval:]) or make a recording rule for

Re: [prometheus-users] Using groups in prometheus alerts

2021-03-09 Thread Matthias Rampke
No, this won't work – the | in a label value is not special, amd there are no "multi valued labels" in the data model – the way to have a collection is to have multiple time series with different label values. /MR On Tue, Mar 9, 2021, 17:37 'Olaf K' via Prometheus Users <

Re: [prometheus-users] JMX Exporter converting real numbers into 64 bit floats

2021-03-08 Thread Matthias Rampke
No, the Prometheus exposition format supports only floats, so there is no way to force the client libraries to integers. If you are parsing it independently, you will need to accept floating point values. /MR On Mon, Mar 8, 2021, 12:05 Rip Jal wrote: > Checking on JMX exporter localhost end

Re: [prometheus-users] Re: prometheus - kubernetes - eureka discovery

2021-03-08 Thread Matthias Rampke
Prometheus works best when it is deployed in each cluster. For global visibility there are a few options besides federation (which is indeed limiting for this use case). If you don't need queries or graphs across clusters, Grafana with multiple data sources works well, especially if you give them

Re: [prometheus-users] How to get rid of context deadline exceeded without changing the scrape_timeout and scrape_interval

2021-03-06 Thread Matthias Rampke
Try removing the proxy, within a cluster Prometheus should be able to connect to the pods directly. Consider using Kubernetes SD to discover the daemonset pods dynamically. /MR On Sat, Mar 6, 2021, 10:02 suyog kulkarni wrote: > The current node-exporter is setup as daemon-set in kubernetes

Re: [prometheus-users] Re: Sending mail from via office365 with alternative from address

2021-03-06 Thread Matthias Rampke
The most thorough way to do this is to set the environment as an external label on your Prometheus servers. This will be passed to Alertmanager. If you are using one AM cluster for all environments, make sure to group_by: [ 'environment' ] in your routing configuration. Then each alert group (the

Re: [prometheus-users] Excluding labels from alertmanager email alerts

2021-03-06 Thread Matthias Rampke
The best way to exclude certain labels for certain alerts is to aggregate them away in the alert expression. Which aggregation (sum, avg, max, min etc) makes sense depends on the semantics if the metric. Yes, you can change the alert email template, there are examples how to do that here:

Re: [prometheus-users] prometheus - kubernetes - eureka discovery

2021-03-06 Thread Matthias Rampke
Whether this is possible at all depends on your Kubernetes network setup. What are you using there? Can you ping and curl a pod by hand, using its IP, from the Prometheus box? If so, you can use relabeling to set __address__ based on __meta_eureka_app_instance_ip_addr and

Re: [prometheus-users] Re: node exporter picks up non existent devices

2021-03-06 Thread Matthias Rampke
Which metrics are being reported (which collector)? Keep in mind that some metrics apply to the physical device, so they are applicable even if the filesystem on that device is not mounted; the device might be used raw by some application. Conversely, some metrics that apply to physical devices

Re: [prometheus-users] Calculate the percentage for process process exporter

2021-03-06 Thread Matthias Rampke
I am not sure I understand what percentage you mean … can you explain more what you want to measure? /MR On Fri, Mar 5, 2021, 15:13 saipradeep bojja wrote: > Hi > We have installed the process exporter in Linux machines to monitor the > processes in Prometheus. We are monitoring the

Re: [prometheus-users] String attributenames are ignored and not exported

2021-03-06 Thread Matthias Rampke
The Prometheus data model only has float values (this makes the highly compact storage possible), so the exporter has no way to represent "Good", "Bad", or "lily-of-the-valley" without further help from you. Can you write a pair of mapping rules for the exporter that match on the possible

Re: [prometheus-users] How to get rid of context deadline exceeded without changing the scrape_timeout and scrape_interval

2021-03-06 Thread Matthias Rampke
How long does it take when you request the metrics directly from node exporter ? Can you exclude one collector at a time to see which one is the slow one? Generally, node exporter doesn't "do" much other than collect metrics from the kernel; if it is slow most likely something else is too.

Re: [prometheus-users] Using groups in prometheus alerts

2021-03-05 Thread Matthias Rampke
This approach is not directly possible in Prometheus itself. Our recommendation for this case is to use some form of external templating to reduce repetition in such cases. However, there is a way using recording rules. You can add a rule for each route (again, you may want to template that) with

[prometheus-users] Re: [prometheus-developers] Mount Point missing alarm

2021-03-05 Thread Matthias Rampke
Moving this to prometheus-users where it fits better. Try using the "unless" operator to compare against a metric that is present for all instances that should have this mountpoint. Assuming that is the case for all targets under this job: up{job=XX"} unless on(instance)

Re: [prometheus-users] Answer from prometheus contains real measurements?

2021-02-03 Thread Matthias Rampke
The query range API returns aligned and interpolated values. I believe you can get "raw" samples by using the instant query API and querying for a matrix like memstats_memory_usage_percent{job=~"67ing-at-home.*"}[10m] /MR On Wed, 3 Feb 2021, 17:44 Rodolphe Ghio, wrote: > Hello, > I have a

Re: [prometheus-users] Re: Alert on no alerts?

2021-02-03 Thread Matthias Rampke
This time, it was Prometheus that was misconfigured. In our case, it was an iptables rule that prevented Alertmanager from reaching PagerDuty. By putting the final check all the way after PagerDuty, we know when anything in the chain stops working, not just the thing that broke last time. /MR On

Re: [prometheus-users] PromQL prolong alert 30 minutes

2021-01-28 Thread Matthias Rampke
Try `min_over_time` on a subquery: min_over_time(( some_traffic_metric / some_traffic_metrics offset 15m)[30m:]) < 0.8 /MR On Thu, Jan 28, 2021 at 7:37 AM fiala...@gmail.com wrote: > Hi, > > I have a simple rule for monitoring traffic drop (by 20% in last 15 > minutes): > >

Re: [prometheus-users] Re: can not get {{ $labels.instance }} value

2021-01-28 Thread Matthias Rampke
Your expression aggregates away all labels, including instance. As written, this alert would only fire if exactly one instance is down, not if two are down. Try removing the count altogether: up{group="tty",job="test101"} == 0 will tell you per-instance when they cannot be scraped.The

Re: [prometheus-users] Reloading DNS IPs in prometheus

2021-01-17 Thread Matthias Rampke
What is the TTL on these records? Also consider using DNS SD, it gives you slightly more control over the refresh cycle. /MR On Fri, Jan 15, 2021, 14:33 eswar yaga wrote: > > Hi, > > My configuration uses a DNS like this > > *- job_name: 'app-dev'* > *metrics_path: '/federate'* > *params:* >

Re: [prometheus-users] Tracking latency

2021-01-12 Thread Matthias Rampke
If you keep adding up the time the function has taken across all invocations, that is also a counter. You can get the average run time over, say, 5 minutes by tracking rate(function_run_time_sum[5m]) / rate(function_run_time_count[5m]) Using native Prometheus clients, you get both if you use

Re: [prometheus-users] What is Rules "Evaluation Time" and "Last Evaluation"? Do I need to invest time into query tuning?

2021-01-06 Thread Matthias Rampke
No, these are still very very fast. You only need to worry when it gets into the range of several seconds. Prometheus evaluates rules on a timer (typically every 10s to 1m, depending on your configuration), the "last evaluation time" is how long ago that last happened. If a rule group is still

Re: [prometheus-users] Too many open files and Promethues reboot is dead slow.

2021-01-02 Thread Matthias Rampke
1. it is likely that writing out blocks did not work because of issue 2. so after you resolve that, the long WAL replay should only happen one time. 2. How did you raise the limit? I often find that systemd has applied a setting that I did not expect. In the systemd unit, what is LimitNOFile?

Re: [prometheus-users] permission denied

2021-01-02 Thread Matthias Rampke
What are the permissions and ownership on /prometheus itself? I think the problem here is creating the file, which in the Unix permission model is an operation on the directory. /MR On Thu, Dec 24, 2020, 06:15 Mohan Nagandlla wrote: > Hi team > I have persisted the prometheus on last two

Re: [prometheus-users] Create silences based on matchers with OR condition

2021-01-02 Thread Matthias Rampke
There is no way to express this in one silence, but you can get the same effect by creating two separate silences. /MR On Mon, Dec 28, 2020, 13:24 ajai kumar wrote: > Hi Team, > > Currently to silence any alert, the matcher accepts labels with AND > condition (array of conditions - sample JSON

Re: [prometheus-users] Unknown series references

2020-11-26 Thread Matthias Rampke
Prometheus 2.15 is rather old, have you considered upgrading to the latest version? There have been many fixes around the WAL and crash handling since then. /MR On Thu, Nov 26, 2020 at 2:23 AM alexb...@gmail.com wrote: > Prometheus version: v2.15.2 > Problem:the size of prometheus's wal is

Re: [prometheus-users] prometheus statefulset with thanos stop working after two days

2020-11-26 Thread Matthias Rampke
You are using a single PersistentVolumeClaim rather than letting Kubernetes create one per instance of the statefulset. Even though it is of mode ReadWriteOnce, I believe it is being bounced around between the different pods and that is causing problems. Rather than creating the PVC yourself, use

Re: [prometheus-users] Alert goes to Firing --> Resolved --> Firing immediately.

2020-11-25 Thread Matthias Rampke
This could be many things, likely it has to do with the formulation of the alert. What does it look like in Prometheus? Specifically - the ALERTS metric shows what is pending or firing over time - evaluate the alert expression in Prometheus for the given time period. Are there gaps or does e.g. a

Re: [prometheus-users] Re: Keeping specific systemd name labels

2020-11-20 Thread Matthias Rampke
Go regexes unfortunately do not support negative matches. I believe you can get the behavior that you need by having a temporary label, and changing it based on the various conditions: 1. you want to keep all metrics by default (action replace, target __tmp_keep, replacement "yes") 2. you want to

Re: [prometheus-users] Question about "rate" & Performance in Prometheus

2020-11-19 Thread Matthias Rampke
What is "rate_1m" in this case? Is this a metric that has rate in the name, or are you asking about the rate query function ? Ingestion is mostly independent from query evaluation. It is in principle possible to overload the

Re: [prometheus-users] Node exporter||Need help reading non-root FS mount point metrics

2020-11-18 Thread Matthias Rampke
; wrote: >> >>> Would you please suggest how can I run node exporter under root user? >>> >>> >>> Regards, >>> Nagarjuna >>> >>> >>> On Wed, Nov 18, 2020 at 2:28 PM Matthias Rampke >>> wrote: >>> >>>

Re: [prometheus-users] Error on ingesting samples with different value but same timestamp

2020-11-18 Thread Matthias Rampke
Actually, I may have misunderstood – do you want to scrape the proxy itself, or something behind it? In the former case, and assuming there is no further load balancing involved, you can use a service discovery mechanism that is appropriate in your environment. In a static config you directly

Re: [prometheus-users] Error on ingesting samples with different value but same timestamp

2020-11-18 Thread Matthias Rampke
Is this proxying to more than one backend? And in dev, there is only one? /MR On Tue, Nov 17, 2020 at 2:07 PM bruno bourdolle wrote: > hi, > I localise the part of the conf that generate the error, but I'm not sure > to understand how to correct it. What is strange is that the same conf on >

Re: [prometheus-users] Node exporter||Need help reading non-root FS mount point metrics

2020-11-18 Thread Matthias Rampke
The directory permissions *only* grant access to the user, not the group, so adding the exporter to the group won't help. Given PostgreSQL is this strict about permissions, you will have to run the exporter either as root or as the postgres user :/ The other metrics are collected in different

[prometheus-users] Re: instant and range query duration

2020-11-18 Thread Matthias Rampke
On Tue, Nov 17, 2020 at 4:16 PM kiran wrote: > Thank you, Matthias. Please see my comments below. > > On Monday, November 16, 2020, Matthias Rampke > wrote: > >> 1. Now >> > I would like to understand the definition of ‘Now’. Is it more recent one > wit

Re: [prometheus-users] Error on ingesting samples with different value but same timestamp

2020-11-16 Thread Matthias Rampke
This can happen in a few ways: 1. whatever exports the metrics, does so with a timestamp, but actually changes the value on you between scrapes without updating the timestamp. This is relatively unlikely unless this is something very specialized. 1.1 or it actually exposes the same metric twice

Re: [prometheus-users] Node exporter||Need help reading non-root FS mount point metrics

2020-11-16 Thread Matthias Rampke
Looking at the statfs man page, it suggests that this error would happen if it is lacking "search" permission (the execute bit) on any of the enclosing directories. I would recommend giving +rx on all directories in that path (/pg-data, /pg-data/postgresql, /pg-data/postgresql/12 and so on) –

Re: [prometheus-users] Alert Manager notification by label

2020-11-16 Thread Matthias Rampke
I think this is a misunderstanding about inhibit rules, they will not help with your use case. Use an additional route to match the environment and send dev/qa to a different receiver (which may not have any configs, so go nowhere). Inhibit rules specify how one *alert* can inhibit (~silence)

Re: [prometheus-users] Need help figuring out where(or how) to find logs

2020-11-16 Thread Matthias Rampke
Kubernetes collects and handles the logs already. Can Splunk read the logs directly from Kubernetes, e.g. via a daemonset? With the redirect, something may be going wrong with redirecting stdout *and* stderr? I cannot coherently say why, but I found that command > logfile 2>&1 works in that

Re: [prometheus-users] instant and range query duration

2020-11-16 Thread Matthias Rampke
1. Now 2. Uh, not sure, probably better to specify one  3. Probably not, end might default to "now" implicitly? 4. They are different – there are two different types involved here. `http_request_total` returns an instant vector: a 1-dimensional list of the value at that instance for each label

Re: [prometheus-users] Prometheus query to count unique label values

2020-11-11 Thread Matthias Rampke
You can directly use the count function for this: count by(customer) (customer_alerts) /MR On Wed, Nov 11, 2020, 17:41 Shilpa Akhilesh wrote: > Hi, > am looking for a way to count the unique label values and show each > count for each customer name. Here are my metrics. I need to get the

Re: [prometheus-users] Best way to present a pushgateway metric in Grafana

2020-10-13 Thread Matthias Rampke
These visualizations often default to showing the average over the dashboard's time period. This looks a bit different depending on the panel, but under the panel options, look for an option to show the "Current" or "Last" value. /MR On Tue, Oct 13, 2020 at 8:40 AM Tamar wrote: > Hi, > > I am

Re: [prometheus-users] Re: Global Labels in Alerts

2020-09-09 Thread Matthias Rampke
As Brian said, use alert relabel configs, which are global, or set the datacenter as an external label (`external_labels` in the configuration file). /MR On Wed, Sep 9, 2020 at 12:10 PM goel@gmail.com wrote: > In the below code the highlighted part is hard coded. I am am not getting >

Re: [prometheus-users] Two different thresholds with condition in one alert rule

2020-09-09 Thread Matthias Rampke
Define another alert rule. It is okay to give it the same name, as long as the labels are definitely distinct. /MR On Wed, Sep 9, 2020 at 10:53 AM neel patel wrote: > Hi Team, > > I have defined below the alert rule which says if disk usage exceeds 80%, > it should raise the critical alert.

[prometheus-users] Kubernetes HPA based on multiple Prometheus servers

2020-09-09 Thread 'Matthias Rampke' via Prometheus Users
Hello, we would like to scale Kubernetes deployments based on custom metrics from namespace-local Prometheus servers, in other words, multiple (tens) Prometheus servers per cluster. Has this been done? As far as I can tell, the custom metrics API

Re: [prometheus-users] mysqld_exporter causing slow running queries

2020-09-07 Thread Matthias Rampke
t;> ifnull(ROW_FORMAT, 'NONE') as ROW_FORMAT, >> ifnull(TABLE_ROWS, '0') as TABLE_ROWS, >> ifnull(DATA_LENGTH, '0') as DATA_LENGTH, >> ifnull(INDEX_LENGTH, '0') as INDEX_LENGTH, >> ifnull(DATA_FREE, '0') as DATA_FREE, >> ifnull(CREATE_OPTIONS, 'NONE') as CREATE_OPTIONS >

Re: [prometheus-users] mysqld_exporter causing slow running queries

2020-09-04 Thread Matthias Rampke
Are these queries actually slow (long run time) or are they notifying about full table scans? What is your slow query log related MySQL configuration? The latter happens e.g. when you have many connections open (long `information_schema.processlist`). These MySQL-internal tables are in-memory or

Re: [prometheus-users] mysqld_exporter error after upgrading mysql to 5. 7

2020-09-04 Thread Matthias Rampke
The exporter collects multiple pieces of information in parallel, and thus (I think) opens multiple connections. I don't know if there is an upper bound. It appears that your MySQL setup is limiting how many connections are allowed per user? /MR On Thu, Sep 3, 2020 at 11:06 AM vijay kumar

Re: [prometheus-users] any best practice on using limited le's for a given histogram

2020-09-03 Thread Matthias Rampke
For a (conservative) guide see this article . You have the right intuition – ~10 buckets is a good number. You *can* go higher if you use high-powered machines (mostly: lots of RAM) for Prometheus, but you will run into increasing problems as

Re: [prometheus-users] Grafana keeps logging 1 lin eper minute i dont understand.

2020-09-03 Thread Matthias Rampke
What is 10.251.100.81 in your environment? This looks like some health checking? /MR On Thu, Sep 3, 2020 at 5:11 AM Big dong wrote: > not sure what kind of action you have done, status 302 means resource > moved. > What kind of query are you doing? > > On Thu, Sep 3, 2020 at 12:51 PM Danny de

Re: [prometheus-users] Pod level cpu usage metrics

2020-09-01 Thread Matthias Rampke
CPU usage counters represent the CPU time spent in seconds. You can get the CPU usage (time spent in seconds, per second) by using the rate function: rate(container_cpu_usage_seconds_total[1m]) /MR On Tue, Sep 1, 2020 at 11:32 AM Manish G wrote: > Hi All, > > I want to display pod cpu usage

Re: [prometheus-users] kubernetes SD config for Ingressroute Traefik

2020-08-08 Thread Matthias Rampke
Can you explain more what "metrics exposed by ingressroute" looks like? This only configures traefik; are the metrics you are looking for exposed by traefik or by the service that the ingressroute refers to? /MR On Fri, Aug 7, 2020, 09:27 Ishvar B wrote: > Hi, > > I am trying to use auto

Re: [prometheus-users] help with "expected timestamp or new record, got \"MNAME\"

2020-08-08 Thread Matthias Rampke
I think Prometheus is expecting the TYPE and HELP lines that are part of the exposition format. Which client library are you using? /MR On Fri, Aug 7, 2020, 18:48 Bob wrote: > hello - I am struggling to resolve a problem with a new prometheus custom > scrape target. The following scrape

Re: [prometheus-users] How to add new label for particular vm's

2020-07-02 Thread Matthias Rampke
Relabel configs apply in the service discovery stage, once for each target – it looks like you have one instance, so you probably need `metric_relabel_configs`. Those are applied to the metric after scraping. /MR On Thu, Jul 2, 2020, 12:00 'Владимир organ2' via Prometheus Users <

Re: [prometheus-users] Alertmanager

2020-07-02 Thread Matthias Rampke
What command are you using to check the configuration? /MR On Thu, Jul 2, 2020 at 7:04 AM Bhupendra kumar wrote: > Hi All, > > I am facing this error but I don't know about this error so please help me > out. > > Checking /etc/prometheus/alertmanager.yml > FAILED: parsing YAML file

  1   2   >