from:"Stuart Clark"

Re: [prometheus-users] Wildcard in PromQL "5.+" vs "5\\d{2}"

2023-10-03 Thread Stuart Clark


On 2023-10-03 09:09, 'Jason' via Prometheus Users wrote:

Hi

I will write my query like this (with * wildcad)

sum(http_requests_total{status_code=~"5.+"})

In internet I found this syntax \\d{2}

sum(http_requests_total{status_code=~"5\\d{2}"})

What is this? Where to find more info?
Why I should use the 2nd query and not the first?


In reality both will do the same thing, although the second is 
technically more correct.


The first regular expression is matching "5" followed by 1 or more other 
characters, while the second is matching "5" followed by exactly 2 
numbers. So the first one would also match "50" or "5frogs" which aren't 
valid status codes, but in reality your application would have to be 
having serious problems to be setting those values anyway.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1a56f89b81a2f5202b3613d12330b1c8%40Jahingo.com.

Re: [prometheus-users] Not able add remote_write in Prometheus config for k8s monitoring

2023-09-18 Thread Stuart Clark


On 2023-09-18 07:20, Prashant Singh wrote:

Hello,

I am not able add remote_write detail in k8s prometheus config file.
kubectl version: 1.27

error :
Error from server (BadRequest): error when creating "config-map.yaml":
ConfigMap in version "v1" cannot be handled as a ConfigMap: strict
decoding error: unknown field "remote_write"

prometheus config file:
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
name: prometheus-server-conf
  namespace: monitoring

remote_write:
   - url: "http://x.x.x.x:31000/api/prom/push;



Not really a Prometheus thing, but that isn't a valid ConfigMap. It is 
expecting top level fields called "metatdata" and "data", with a 
filename within the data section that then contains whatever data you 
are wanting (which for Prometheus is a YAML config file).


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/762c7be3990926a7c2867148f68db2f0%40Jahingo.com.

Re: [prometheus-users] Re: Promteheus HA different metrics

2023-09-05 Thread Stuart Clark


On 2023-09-05 14:26, Анастасия Зель wrote:

yeah, i think scrape manually it will be useful but remember that its
k8s pods :)
i only have pod ip and i cant get it from prometheus node because they
are in different subnets. Pods subnet don't have access to outside
network.
so i dont know how i can scrape manually particular pod target from
prometheus server.



That would explain why it isn't working. You need to have network 
connectivity to all of your scrape targets from the Prometheus server. 
So if you have configured Prometheus to scrape every pod (via the 
Kubernetes SD for example) the Prometheus server will either need to be 
inside the cluster or connected to the same network mechanism as the 
pods.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4eb0b62f043f84563619eecb8ba0c307%40Jahingo.com.

Re: [prometheus-users] Re: Prometheus file format

2023-08-08 Thread Stuart Clark

On 08/08/2023 20:31, Matt Doughty wrote:

So you are trying to get discreet metrics for every run of the batch
job? That sounds like an unbounded cardinality problem as you would
end up with a timeseries for every run of the batch job.
Am I misunderstanding or is this accurate?

You're right I don't need the exact time when the metric is fetched. I only
need it to differentiate between iterations within the batch job. Then is
creating a separate metric the best way to go?

If that is the case then Prometheus isn't the right tool. Having
distinctly detectable groups of data for a particular job run indicates
you are talking about events which are quite different to metrics. For
events you'd want to be looking at tools such as Elasticsearch, Loki or
a standard SQL database.

Events and metrics can (and often are) used in parallel. For example
Prometheus would tell you that the average job runtime is 5 minutes over
the past 3 hours, but you'd then use the events system to find the exact
durations for each run (or the number of events processed, or the error
message returned, etc.).

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/579c5062-cc5a-0d7b-7353-61ed436e25b6%40Jahingo.com.

Re: [prometheus-users] Re: Prometheus file format

2023-08-07 Thread Stuart Clark


On 07/08/2023 21:00, Moe wrote:

Thanks Brian that was really helpful,

2. The use case I want this for doesn't need continuous ingestion. In 
that case is there a way for me to add time stamp to MetricFamilySample?


That isn't how Prometheus works. It will scrape that metric every 10 
seconds to 2 minutes. If you need to know when that set of metrics was 
created a common pattern is to include a metric who's value is the 
timestamp.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d401af4d-8583-d2d6-dbbe-89853000e421%40Jahingo.com.

Re: [prometheus-users] Re: Unusual traffic in prometheus nodes.

2023-07-27 Thread Stuart Clark


On 27/07/2023 15:51, Uvais Ibrahim wrote:

Hi Brain,

This is the query that I have used.

sum(scrape_samples_scraped)without(app,app_kubernetes_io_managed_by,clusterName,release,environment,instance,job,k8s_cluster,kubernetes_name,kubernetes_namespace,ou,app_kubernetes_io_component,app_kubernetes_io_name,app_kubernetes_io_version,kustomize_toolkit_fluxcd_io_name,kustomize_toolkit_fluxcd_io_namespace,application,name,role,app_kubernetes_io_instance,app_kubernetes_io_part_of,control_plane,beta_kubernetes_io_arch,beta_kubernetes_io_instance_type, 
beta_kubernetes_io_os, failure_domain_beta_kubernetes_io_region, 
failure_domain_beta_kubernetes_io_zone,kubernetes_io_arch, 
kubernetes_io_hostname, kubernetes_io_os, 
node_kubernetes_io_instance_type, nodegroup, 
topology_kubernetes_io_region, 
topology_kubernetes_io_zone,chart,heritage,revised,transit,component,namespace, 
pod_name, pod_template_hash, security_istio_io_tlsMode, 
service_istio_io_canonical_name, 
service_istio_io_canonical_revision,k8s_app,kubernetes_io_cluster_service,kubernetes_io_name,route_reflector)


Which simply excluded every label but still I am getting a result like 
this


{}  7525871918

I'm not sure what you are expecting, as that sounds about right. The 
query is adding together all the different variants of the 
scrape_samples_scraped metric (removing all the different labels), so if 
that is indeed a list of every label the query is going to return a 
value without any associated labels.


You want to be instead just graphing the raw scrape_samples_scraped 
metric (no sum or without) and see how it varies over time. Is there a 
particular job or target which has a huge increase in the graph, or new 
series appearing? As to why that might happen it could be many different 
reasons, but ideas could include:


* new version of software which increases number of exposed metrics (or 
more granular labels)
* bug in software where a label is set to something with high 
cardinality (e.g. there is a "path" label from a web app, which means a 
potentially infinite cardinality, and you could have had a web scan 
producing millions of combinations)
* lots of changes to the targets, such as new instances of software or 
high churn of applications restarting


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7423743a-8b29-625d-4472-6aa710cf1179%40Jahingo.com.

Re: [prometheus-users] Tracking Metrics Per Session/Request

2023-07-03 Thread Stuart Clark

On 03/07/2023 19:59, KW wrote:

Hello,

I work with an ASP.NET Core API that uses sessions to store the user's
state. The API calls out to many other different microservices to run
logic based on the user's request, we update and save the state on the
server, and return the relevant information in the response. I'm
looking to be able to troubleshoot performance issues by looking at
metrics for an entire session. I roughly understand how I can add
timing metrics, but what is the mechanism I'd use to differentiate
sessions, and even similar requests within the session? For example,
I'd like to know that the third identical request that was made on a
given session took 43 seconds while the others were only 2 seconds. If
I use a label for session id, I won't see the individual request
timings... would I need to create an additional "request id" label
that I can use? Perhaps pull the Activity.Current trace id that is
used for exemplars? Not sure what the best practices are here.

This sounds like you might actually want traces rather than just
metrics, which can be added using Tempo or other tools.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/1211d451-51ff-440b-6938-3df8dbb30d94%40Jahingo.com.

Re: [prometheus-users] Prometheus counter reset

2023-05-25 Thread Stuart Clark


On 25/05/2023 15:59, Yogita Bhardwaj wrote:

How can i apply rate or increase while using http api.

You just need to include it in the query you are sending the Prometheus API.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1c31a201-729f-a0a9-e90d-904d42257e83%40Jahingo.com.

Re: [prometheus-users] Prometheus counter reset

2023-05-25 Thread Stuart Clark


On 25/05/2023 06:56, Yogita Bhardwaj wrote:
I am using Prometheus counters in my project but the value of counter 
reset when service restart. how can I prevent counter reset while 
fetching data from Prometheus using http api. -- 


It is expected that counters will reset when a service restarts, so you 
don't need to do anything. Prometheus handles counter resets 
automatically when you are using functions like rate().


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0d139bb9-e5d0-0f6d-6649-fdfb5a8d8ef7%40Jahingo.com.

Re: [prometheus-users] Blackbox exporter http module check interval - how to change default 15s

2023-05-18 Thread Stuart Clark


On 2023-05-18 11:27, Paweł Błażejewski wrote:

Hello,

Blackbox exporter http module check https site every 15s by default.
Can you please tell me Is it posible to change this interval to 1
minute. I add scrape_interval parametr in prometheus config as you see
below, but it doesn't change anything. Samples are stiil every 15s.

can you please tell me how and where change it.
I user prometheus 2.37.7, blackbox_exporter 0.23.0

prometheus config:

* job_name: 'blackbox-http-csci-prod'
scrape_interval: 60s
scrape_timeout: 50s
metrics_path: /blackbox/probe
params:
module: [http_2xx]
static_configs:

* targets:

* https://ci-jenkins***/login?from=%2F
* https://ci-jenkins-***l/login?from=%2F

* https://ci-**/login?from=%2F
.
.
.
relabel_configs:

* source_labels: [ADDRESS]
target_label: __param_target
* source_labels: [__param_target]
target_label: instance  * target_label: ADDRESS

Samples still every 15s.

2023-05-15 09:45:00
0.0624
2023-05-15 09:45:15



How are you producing that data?

Did you remember to reload/restart Prometheus after making the change to 
your config?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/89897569de531dc500edf3645dc6bf1b%40Jahingo.com.

Re: [prometheus-users] Sum up request duration correctly

2023-04-30 Thread Stuart Clark

On 25/04/2023 10:50, ofir y wrote:
I have a Prometheus server which scrapes data from my API metrics
endpoint that is populated using Prometheus.net library . the scraping
interval set to 15 seconds. I'm publishing a request duration summary
metric to it. this metric is published at random times to the
endpoint. but the scrape interval makes Prometheus thinks it is a new
value every 15 seconds, even if no new data was published. this causes
the _count & _sum values of the metric to be wrong, as they consider
every 15 seconds to be a new point.

my goal is to be able to count & to sum up all requests actions. so if
I had 3 requests over a period of 2 minutes like so:

00:00 request 1: duration 1 sec
00:30 request 2: duration 1 sec
01:55 request 3: duration 2 sec

the _count will be 3, and the _sum will be 4 seconds. can I achieve
this somehow by using labels or something else?

It sounds like you are trying to use Prometheus to store events, which
won't work as Prometheus is a metric system.

Normally what you would expose from your application are counters giving
the total number of the event being monitored as well as the total
duration of all of that event.

Once scraped you can then show things like the number of events over a
given period of time, as well as the average durations of those events
over that period. What you cannot do with a metric system is know
anything specific about an individual event. To do that you need an
event system, such as Loki, Elasticsearch or a SQL database.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/09688fe8-de60-88d8-3c51-ef8dbb5d87b5%40Jahingo.com.

Re: [prometheus-users] Re: Do we need 2 config.yml for fetching metrics from 2 separate regions ?

2023-04-30 Thread Stuart Clark


On 21/04/2023 13:53, Arunprasadh PM wrote:

Hi,

We are facing the same issue, do we have a solution to manage multiple 
regions?
One is global/us-east-1 for Cloudfront  and other region for services 
like RDS.


We are facing issue to configure both in a  single config file?


You would need to be running multiple instances of the exporter.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/afe00535-e1c0-8580-c359-52cdab1a7b22%40Jahingo.com.

Re: [prometheus-users] NLOG target?

2023-04-30 Thread Stuart Clark


On 19/04/2023 15:19, Thomas Coakley wrote:
We are considering converting from application insights to Prometheus. 
We have used nlog to send our trace data.


Does anyone know if it is possible to target prometheus from nlog?


Assuming that is traces (as in spans containing call durations & 
details, etc.) then no, as Prometheus is a metric system. Tempo is the 
tracing system from Grafana that can work alongside Prometheus (and 
there are other OS and commercial offerings too).


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0f66413d-f4dd-76ab-2759-396fabb8908d%40Jahingo.com.

Re: [prometheus-users] Cloudwatch Exporter configuration for existing Grafana Dashboards

2023-04-30 Thread Stuart Clark

On 19/04/2023 14:52, Uğurcan Çaykara wrote:

Hello everyone,
I have currently have an eks cluster running on AWS. I installed helm
chart to setup prometheus and grafana. All metrics I need for
EKS(deployments, services, pods etc.) are totally fine. So I wanted to
centrally use that grafana for other AWS services that I use and
that's why I configured config.yml for lambda metrics as given at the
repository and deployed cloudwatch exporter
(https://github.com/prometheus/cloudwatch_exporter/). I can see the
related metrics at the Grafana dashboard. When I hit the explore tab
at left menu from Grafana UI and enter lambda, all metrics given at
the config.yml for lambda metrics are totally fine. I can query them.
And now I wanted to use a dashboard for lambda
-> https://grafana.com/grafana/dashboards/593-aws-lambda/
However it uses cloudwatch as datasource not prometheus and that's why
I can't see no data for that specific dashboard. What's best way to
overcome this. Is it something quickly editable and fixable ? or is it
better to start creating dashboard from beginning. If someone. can
help me here, I would appreciate that.

If you use dashboards from the Grafana site that aren't designed for
Prometheus their usefulness is limited. Other datasources (such as
Cloudwatch or Datadog) use totally different query languages, so what
you are really gaining is an outline of a design rather than anything
you can directly use. You would need to rewrite all the queries to
operate in a similar manner using PromQL.

Whenever I'm having a look for Grafana dashboards I use the datasource
filter such that only those designed specifically for Prometheus are
listed - at least then they mostly work out of the box (although often
still need slight tweaks due to different job names or use of labels).

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/ac78d732-a7de-b7ec-d3cc-82800fd839c6%40Jahingo.com.

Re: [prometheus-users] Can prometheus detect the periodicity of a metric?

2023-04-18 Thread Stuart Clark

That's definitely something that can be done _using_ Prometheus, but not 
something done _within_ it.

You'd have an application which uses the query API (or is sent live data via 
remote write) to fetch metrics and then does whatever calculations are needed 
(for example using machine learning methods). You might then expose back new 
metrics to represent the outcome of those calculations (via scraping, remote 
read or remote write) which can then be used for visualisation and alerting.

Broadly what you are talking about is anomaly detection, which have a lot of 
ideas, PoCs, tools and blog articles created - you will find various ideas and 
ways to achieve such things from talks given at PromCon over the past years for 
example. 

On 17 April 2023 07:51:21 BST, "Jónás Jurásek"  wrote:
>I want to create alerts based on if a periodic signal(not perfectly 
>periodic) has changed significantly from its previous behaviour. For 
>example: A size of a table in a db is growing a certain amount every hour, 
>but then every day it goes back to the size it was a day before. I want to 
>alert if this table grew twice as much, or hasn't grown at all, or if if 
>doesn't go back to it's original size. But rather than me giving the 1 day 
>periodicity, I want prometheus to detect it, because different tables can 
>have different periods. 
>Is there any way to do that with prometheus? Or is the any other way to 
>detect the change of the behaviour?
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Users" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to prometheus-users+unsubscr...@googlegroups.com.
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-users/efa79410-f645-4bea-b0d9-9a9afc153244n%40googlegroups.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/FAAF6777-9BC3-4D24-ABF4-A13BCCBE6436%40Jahingo.com.

Re: [prometheus-users] Custom metric timestamps

2023-04-18 Thread Stuart Clark

The timestamp feature is only really designed for limited situations, such as 
ingesting metrics from another metric system. Scraping is generally expected to 
be returning the metric value from "now" (or in the case of coming via another 
metric system within the last few minutes). As Prometheus is primarily a metric 
system for metrics around operational systems you want alerts to appear in a 
timely manner if issues occur.

Ingesting really old data is often needed to backfill metrics, but this is done 
via a different method to scraping.

Are you able to give a bit more insight around the ingestion of really old 
data? 

On 12 April 2023 08:55:39 BST, Ahmed Kooli  wrote:
>I am using theprometheus_client library on Python to monitor some metrics 
>and these were created in the past (months ago). When I fetch them with 
>Prometheus they are displayed at collection time and I would like to 
>specify a custom timestamp. I tried to create custom metrics and override 
>built-in functions to specify a timestamp but it seems to only work for a 
>short time shift (= custom timestamp - time at collection), does anyone 
>have an idea ? Thanks.
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Users" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to prometheus-users+unsubscr...@googlegroups.com.
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-users/92cceaaf-3dcf-4484-b490-97493e9f3203n%40googlegroups.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/73397A39-FC51-471A-A00E-B3FADAEA513A%40Jahingo.com.

Re: [prometheus-users] mTLS was enabled but failed to access Prometheus via web

2023-04-08 Thread Stuart Clark


On 07/04/2023 10:29, Boyu Du wrote:

Hi Team,
I enabled mTLS on Prometheus server via web-config:
tls_server_config:
  cert_file: 
  key_file: 
client_auth_type: RequireAndVerifyClientCert
client_ca_file: 

This worked fine since all my underlying Prometheus Agent and Grafana 
could talk with this server successfully. However, when I tried to 
check the targets it monitors via browser, it says:
"The connection for this site is not secure.  
didn't accept your login certificate, or a login certificate may not 
have been provided."


And from the log file of Prometheus Server:
"caller=stdlib.go:105 level=error component=web caller="http: TLS 
handshake error from " msg="tls: 
client didn't provide a certificate""


The server I access the Prometheus Server URL is a windows and it has 
cert imported, which is signed by the same CA.


May I know what I missed in the config?

How have you configured the Windows machine? Have you just imported the 
CA into Windows, or did you generate a client certificate and import / 
configure that too?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d11a6665-48db-e1ac-3226-ad101ff7776a%40Jahingo.com.

Re: [prometheus-users] Immediately pull metrics from target

2023-03-30 Thread Stuart Clark


On 2023-03-30 12:48, Ben Kochie wrote:

No. As Brian says, it's intentional that this is not possible in order
to avoid load spikes.


And as Ben mentioned earlier the normal scrape intervals are usually 
15/30 seconds for normal metrics, or 1/2 minutes for slower use cases. 
Therefore you'd only have to wait a short amount of time before metrics 
start appearing - although often you need to wait for at least a few 
scrapes to be able to do things like looking at counter increase rates, 
etc.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/328bf2ccb6b6eabf4fcd4205caa50835%40Jahingo.com.

Re: [prometheus-users] Alerts Description and Summary

2023-03-27 Thread Stuart Clark


On 2023-03-27 14:43, sayf.eddi...@gmail.com wrote:

Hello, I have looked online and I cant find any best practices for
filling up the description and the summary. from the examples I see
that Summary should be the shortest (plus the minimum usage of
labels). But maybe it is an observation bias.

I am trying to generate some automatic documentation around alerting
and having a lot of labels makes it as user friendly as reading the
yaml file directly



It really depends how you are wanting to use those. If you are wanting 
to use the summary in an email's subject line then you probably want it 
to be fairly short for example. You can have as many labels/annotations 
as you like, so you don't even have to have one called "summary" if you 
don't want to, and there's nothing stopping you from having much more 
specific labels (e.g. severity, service, environment) which you can then 
include in email/ticket subjects.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0a4beb659ca258bac3946b69f7280b3d%40Jahingo.com.

Re: [prometheus-users] Graph how long a job takes

2023-03-27 Thread Stuart Clark


On 2023-03-24 14:01, Nunni wrote:

Hello.

I need is observe how long it takes to complete the execution of a
particular mssql query. The query is executed once every hour, and I
want to graph that for say a period of ten days.
The query is up and running using sql_exporter and prometheus gets the
results and correctly graphs it, but that’s not what I need.



Unfortunately it isn't clear what you are asking for. From your 
description it sounds like you are graphing the data, but you also say 
it isn't what you need (but don't say how or what you are hoping for). 
If you could describe what you are doing, what is happening & what you 
are wanting instead someone might be able to suggest something...


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/07bd9e311b187f0b1dd544d9bd4ed92e%40Jahingo.com.

Re: [prometheus-users] Separate endpoint for aggregate metrics?

2023-03-27 Thread Stuart Clark

On 2023-03-25 07:30, Kevin Z wrote:

Hi,

We have a server that has a high cardinality of metrics, mainly due to
a label that is tagged on the majority of the metrics. However, most
of our dashboards/queries don't use this label, and just use aggregate
queries. There are specific scenarios where we would need to debug and
sort based on the label, but this doesn't happen that often.

Is it a common design pattern to separate out two metrics endpoints,
one for aggregates, one for labelled metrics, with different scrape
intervals? This way we could limit the impact of the high cardinality
time series, by scraping the labelled metrics less frequently.

Couple of follow-up questions:
- When a query that uses the aggregate metric comes in, does it matter
that the data is potentially duplicated between the two endpoints? How
do we ensure that it doesn't try loading all the different time series
with the label and then aggregating, and instead directly use the
aggregate metric itself?
- How could we make sure this new setup is more efficient than the old
one? What criteria/metrics would be best (query evaluation time?
amount of data ingested?)

You certainly could split things into two endpoints and scrape at
different intervals, however it is unlikely to make little/any
difference. From the Prometheus side data points within a time series
are very low impact. So for your aggregate endpoint you might be
scraping every 30 seconds and the full data every 2 minutes (the slowest
available scrape interval) meaning there are 4x less data points, which
has very little memory impact.

You mention that there is a high cardinality - that is the thing which
you need to fix, as that will be having the impact. You say there is a
problematic label applied to most of the metrics. Can it be removed?
What makes it problematic?

--
Stuart Clark

Re: [prometheus-users] Query data is empty when step is 1h in query_range api

2023-03-17 Thread Stuart Clark


On 17/03/2023 10:33, Bo Liu wrote:


I set scrape_interval to 1h, if prometheu is pulled from 00.30.00, 
then data's timestamp is 00.30.00, 01.30.00, 02.30.00



What I am trying to do is use query_range api to query from that day's 
00.00.00 to 23.59.59, and step is per hour(because scrape_interval is 
1h,I think it's more reasonable)



But I get empty result, I think when I run this query, it check data 
at 00.00.00 but data at 00.30.00, then he misses every data point.



I didn't add grafana  because I'm not familiar with it, which make 
this problem more complex.


Currently I use default TSDB as prometheus database because I didn't 
decide to use whick database already.



So is there anyway to do this in promql or prometheus api?


PS: Please don't mind my pool English


The maximum scrape interval is about 2 minutes due to the way staleness 
handling works, so anything longer than that will likely result in what 
you are seeing.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/71f0ba3a-efb2-30ef-5f80-609b69c0bb84%40Jahingo.com.

Re: [prometheus-users] scrape targets on Prometheus

2023-03-02 Thread Stuart Clark


On 02/03/2023 17:29, Pratibha Channamsetty wrote:

Hi,
I have thousand  stateful servers. I am not able to use any of 
Prometheus service discovery approach.
Is there a way I can use  either ip CIDR range for target scrapping or 
regex patter on hostnames.


Any other better approach will help here .

I would suggest using the file SD option and then having a tool creating 
a JSON/YAML file containing the servers to scrape.


Is there really no list of those servers? For example are you using 
something like Ansible or Chef to provision/maintain them?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5ed4f3b1-4305-5fd9-acd9-f3ac2a7bb03c%40Jahingo.com.

Re: [prometheus-users] Create custom metrics and add it to prometheus

2023-02-28 Thread Stuart Clark


On 28/02/2023 17:12, BHARATH KUMAR wrote:

Hello All,

I want to calculate the pod age running on every server. I wrote a 
shell script. I added this script under the folder /var/lib/node_exporter.


I created a cronjob for this script to run every minute
* * * * * root bash /var/lib/node_exporter/custom_metrics.sh > 
/var/lib/node_exporter/apt1-prom


I store the above cronjob in the /etc/cron.d folder with filename: 
prom-apt1.


But I am not able to see the metrics I created in Prometheus UI.



But similarly, I created another shell script file to fetch some 
metrics. I followed the same procedure as above.


* * * * * root bash /var/lib/node_exporter/custom1.sh > 
/var/lib/node_exporter/apt-prom


I store the above cronjob in the /etc/cron.d folder with filename: 
prom-apt.


The metrics which I mentioned in custom1.sh, I am able to see those 
metrics in Prometheus UI.



Could anyone help me?


What is the contents of those files in /var/lib/node_exporter?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/34cc064c-d700-931b-d912-1f9dd256b14c%40Jahingo.com.

Re: [prometheus-users] How to keep servers up while computer is off. (Blackbox_Exporter)

2023-02-21 Thread Stuart Clark


On 21/02/2023 21:07, Sean Coh wrote:

Hi Guys,

I came across an issue where if I turn my computer off the services 
will go down a few minutes later and will not scrape my targets.


I know the blackbox host and port is: *127.0.0.1:9115. *Is there a way 
I can keep my targets scrape while my computer is away?


Do you mean you are switching off the machine which is running Blackbox 
Exporter or Prometheus?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f0d23a58-1413-add8-067e-c80aed096b6e%40Jahingo.com.

Re: [prometheus-users] server uptime

2023-02-21 Thread Stuart Clark


On 21/02/2023 09:56, sri L wrote:

Thanks for your reply Julius Volz.
Yes, we are monitoring servers with Node Exporter. We are looking for 
Uptime average based on calendar Month instead of last 30 days.
For example, February Month it should give the uptime average for 
28days i.e., 1st Feb to 28th feb


On Saturday, February 18, 2023 at 2:11:33 AM UTC+5:30 Julius Volz wrote:

Assuming you are monitoring your servers via something like the
Node Exporter, and you want the trailing 30-day upness percentage,
you could use the Node Exporters "up" metric like this:

    avg_over_time(up{job="node"}[30d]) * 100


You would need to adjust the query to be

avg_over_time(up{job="node"}[28d]) * 100

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4af97a2d-03c6-3cc5-c173-ec5cd01b6d44%40Jahingo.com.

Re: [prometheus-users] metric_relabel_configs not dropping metrics

2023-02-21 Thread Stuart Clark


On 21/02/2023 17:29, Jihui Yang wrote:
I didn't find a way to adjust those. If I append scrape config jobs to 
the end of the config file, they should be able to overwrite existing 
job right?


No. Job configurations are self contained. So metrics scraped by a 
particular job will have any relabelling rules applied for that 
particular job only. The only way you can set relabelling rules for a 
job is by editing the job config. For the Prometheus Operator this is 
done by adjusting the PodMonitor/ServiceMonitor objects.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/523da4f7-4772-b333-1c5d-2e78f345e305%40Jahingo.com.

Re: [prometheus-users] fading out sample resolution for samples from longer ago possible?

2023-02-20 Thread Stuart Clark

On 21/02/2023 03:29, Christoph Anton Mitterer wrote:

Hey.

I wondered whether one can to with Prometheus something similar that
is possible with systems using RRD (e.g. Ganlia).

Depending on the kind of metrics, like for those from the node
exporter, one may want a very high sample resolution (and thus short
scraping interval) for like the last 2 days,... but the further one
goes back the less interesting those data becomes, at least in that
resolution (ever looked a how much IO a server had 2 years ago per 15s)?

What one may however want is a rough overview of these metrics for
those time periods longer ago, e.g. in order to see some trends.

For other values, e.g. the total used disk space on a shared
filesystem or maybe a tape library, one may not need such high
resolution for the last 2 days, but therefore want the data (with low
sample resolution, e.g. 1 sample per day) going back much longer, like
the last 10 years.

With Ganglia/RRD it one would then simply use multiple RRDs, each for
different time spans and with different resolutions... and RRD would
interpolate it's samples accordingly.

Can anything like this be done with Prometheus? Or is that completely
out of scope?

I saw that one can set the retention period, but that seems to affect
everything.

So even if I have e.g. my low resolution tape library total size,
which I could scrape only every hour or so, ... it wouldn't really
help me.
In order to keep data for that like the last 10 years, I'd need to set
the retention time to that.

But then the high resolution samples like from the node exporter would
also be kept that long (with full resolution).

Prometheus itself cannot do downsampling, but other related projects
such as Cortex & Thanos have such features.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/f72674b0-a7de-ec89-955a-608b521cb754%40Jahingo.com.

Re: [prometheus-users] metric_relabel_configs not dropping metrics

2023-02-20 Thread Stuart Clark


On 20/02/2023 23:14, Jihui Yang wrote:
I'm using prometheus-operator. It only allows loading 
addionalScrapeConfigs 
<https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/additional-scrape-config.md> to 
append to the end of the config file. The other config jobs were added 
as part of loading prometheus-operator. I'm not sure I can change those.


The other jobs are probably from PodMonitor & ServiceMonitor objects, so 
you'd need to adjust those.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/410d4240-2f08-510b-3a75-49be47cd77b9%40Jahingo.com.

Re: [prometheus-users] metric_relabel_configs not dropping metrics

2023-02-20 Thread Stuart Clark


On 20/02/2023 22:33, Jihui Yang wrote:
I think these metrics are being scraped from another job. What I want 
is to drop any scraped metrics with names match the regex I provided

Then you need to add the relabel config to that other job.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/354ed09b-b32c-4352-e786-cdca89c56a21%40Jahingo.com.

Re: [prometheus-users] metric_relabel_configs not dropping metrics

2023-02-20 Thread Stuart Clark


On 20/02/2023 19:10, Jihui Yang wrote:

Hi, so I added this section to match all namespaces:
```
kubernetes_sd_configs:
 - role: endpoints kubeconfig_file: ""
   follow_redirects: true
   namespaces:
     names:
         - example1
         - example2
         - example3
```
as well as
```
authorization:
     type: Bearer
     credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
```
I turned on debug logging, and i'm getting
```
ts=2023-02-20T19:08:30.169Z caller=scrape.go:1292 level=debug 
component="scrape manager" scrape_pool=drop_response_metrics 
target=http://10.10.188.252:25672/metrics msg="Scrape failed" err="Get 
\"http://10.10.188.252:25672/metrics\": EOF"
ts=2023-02-20T19:08:30.465Z caller=scrape.go:1292 level=debug 
component="scrape manager" scrape_pool=drop_response_metrics 
target=http://10.10.152.96:10043/metrics msg="Scrape failed" 
err="server returned HTTP status 500 Internal Server Error"
ts=2023-02-20T19:08:30.510Z caller=scrape.go:1292 level=debug 
component="scrape manager" scrape_pool=drop_response_metrics 
target=http://10.10.241.97:9100/metrics msg="Scrape failed" 
err="server returned HTTP status 400 Bad Request"

```

The metrics are still not dropped


I'm not really following exactly what your config is.

Those errors suggest that at least some of the scrapes are failing.

When you say "the metrics are still not dropped" are these metrics that 
are being scraped in this job?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4bd680fe-400e-de75-d137-7c96e25a08e0%40Jahingo.com.

Re: [prometheus-users] metric_relabel_configs not dropping metrics

2023-02-20 Thread Stuart Clark


On 17/02/2023 22:02, Jihui Yang wrote:
I'm using prometheus-operator's addionalScrapeConfigs 
<https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/additional-scrape-config.md> to 
add metric drop rules. Example:


```
- job_name: drop_response_metrics
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: 
(response_total|response_latency_ms_count|response_latency_ms_sum)

    replacement: $1
    action: drop
```

The config is successfully loaded to prometheus and I can view it in 
`/config` endpoint. But for some reason I still can see the metrics. 
can you let me know what to do?


Is that the full config? I'm not seeing a Service Discovery section 
(e.g. Kubernetes or file based) to tell Prometheus where to scrape from.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/797c0c2e-d0d2-8b1a-2067-305307d6ed2e%40Jahingo.com.

Re: [prometheus-users] targetmissing alert is fooled by two endpoints on one job

2023-02-17 Thread Stuart Clark


On 17/02/2023 16:16, Mario Cornaccini wrote:

hi,

i have a job 'node-tools'ansible',
with two endpoints.. one for each node- and shell exporter.

in prometheus/targets i see the same set of labels for each endpoint.

for testing i stoppped the node exporter, but the alert is based on 
the following expr:

up{} == 0

on the graph i can see that the up{}==0 expr has value 0 for a few 
seconds, then gaps;

when i remove the ==0 i can see it goes from 0 to 1.

so it seems to me that the other (shell) exporter mixes into the up 
metric;

and that is because the endpoint have the same labels, right ?

so in my scrape defintion i need to specify one differing label for 
the shell exporter..

i could make an exporter label, setting it to 'node'/'shell' i guess..
or how do you guys handle that?

They can't have identical labels. Even if the job label is the same the 
instance label should be different.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9c013f7c-ab3f-b33b-f972-d5507e2536ad%40Jahingo.com.

Re: [prometheus-users] 404 Error as prometheus adds a port 443 to my aws app runner domain

2023-02-13 Thread Stuart Clark


On 13/02/2023 09:16, V P wrote:
We want to scrape a https app runner domain without a port number and 
the Prometheus configuration is according to that. Is there a way to 
prevent the default port number for https(443) to get applied in the 
domain name of app runner when the Prometheus targets the https domain?


As my domain https://app-runner.com/actuator/prometheus gets converted to
https://app-runner.com:443/actuator/prometheus in the targets and 
gives a 404 error due to the addition on port number in url.


I changed the scheme to https, tried adding 443 port in target field 
of yml config file as well still getting the same error 404


I'm not quite sure what you are meaning by "without a port number"? All 
connections are to a specific port. If you don't specify a port when 
doing a web request it will default to 80 or 443 (depending on if it is 
HTTP or HTTPS).


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2f842031-0a26-3c52-53c2-925cdeee7f09%40Jahingo.com.

Re: [prometheus-users] Workaround to get the status of a service if port changes each time? (Blackbox_Exporter)

2023-02-03 Thread Stuart Clark


On 03/02/2023 19:07, Sean Coh wrote:

Hi Guys,

I have an issue that I encounter with some of the services that I'm 
working with on Blackbox_Exporter. The services that I'm trying to 
scrape changes port each time. For example on www.example.com:1234 
will change to www.example.com:5678 each time the service is 
stop/start. The hostname stays the same.


Is there another exporter that I can use? or might know of an 
workaround that would work in this case. Since the url change each 
time, blackbox exporter cannot be used in this instance.



How would you know what port to use?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/44831027-37ca-0fa8-3b26-f5f0261e93b7%40Jahingo.com.

Re: [prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-01 Thread Stuart Clark


On 02/02/2023 06:26, koly li wrote:

Hi,

Currently, node_exporter exposes time series for each cpu core (an 
example below), which generates a lot of data in a large cluster (10k 
nodes cluster). However, we only care about total cpu usage instead of 
usage per core. So is there a way for node_exporter to only 
expose aggregated node_cpu_seconds_total?


we also notice there is an discussion here (reduce cardinality of 
node_cpu_seconds_total 
<https://groups.google.com/g/prometheus-developers/c/tvPCYZYHOYc>), 
but it seems no conclusion.


node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="system",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 
9077.24 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="user",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 
19298.57 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="idle",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 
1.060892164e+07 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="iowait",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 
4.37 1675059665571


You can't remove it as far as I'm aware, but you can use a recording 
rule to aggregate that data to just give you a metric that represents 
the overall CPU usage (not broken down by core/status).


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6f9d281c-76b1-f1bb-8bae-f3e0ed5ca471%40Jahingo.com.

Re: [prometheus-users] Prometheus same host multiple endpoints

2023-01-23 Thread Stuart Clark


On 2023-01-23 05:52, Kishore Chopne wrote:

Hi,
 We have a situation where same metrics are published on two
different endpoints. Is it possible to pick one endpoint and discard
the other while writing a PromQL query ?
Is it possible to configure Prometheus to collect metrics from only
one endpoint?



If they are literally the same metrics available in multiple endpoints 
then I'd suggest only scraping one. That's controlled via your scrape 
configuration. Depending on which mechanism you are using to manage that 
it could mean changes to prometheus.yaml, removing an endpoint from a 
YAML/JSON file or changing AWS/Kubernetes tags.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6da844283f2fdfc77f7ecebc157bef7d%40Jahingo.com.

Re: [prometheus-users] Time series with change interval much less than scrape interval

2023-01-18 Thread Stuart Clark


On 18/01/2023 00:15, Mark Selby wrote:

I am struggling with PromQL over an issue dealing with a metric that
changes less frequently than the scrape interval. I am trying to use
Prometheus as a pseudo event tracker and hoping to get some advice on
how to best try and accomplish my goal.
I think this is the fundamental issue you are facing. Prometheus isn't 
an event system. It is designed for metrics, which are pretty different 
to events. It sounds like you should look at a system like Loki, 
Elasticsearch or a general purpose SQL or key/value database, as they 
are likely to be a much better fit for you than a timeseries database 
and ecosystem that is designed for handling metrics.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/35ad9d4d-fc3f-2d6a-9f48-4cddb96b9fe9%40Jahingo.com.

Re: [prometheus-users] AlertManager rules examples

2023-01-18 Thread Stuart Clark

Grafana does have its own alerting solution, but that's not something to do 
with anything Prometheus. You'd need to ask the Grafana lists around how to do 
it with that option. 

On 17 January 2023 21:11:45 GMT, Eulogio Apelin  
wrote:
>Thanks for the info. it helps.
>
>Would be nice if there are examples on web pages or you tube vids.  We also 
>have Grafana, but it sounds like the engineers are trying to maybe pick 
>alertmanager over grafana as it currently is a mix and it's not straight 
>forward to us when configuring both.  Mainly because we don't have a 
>dedicated person working on alerts.  It tends to be the lower 10-20% on the 
>priority list for us and with other companies i've been with also deal with 
>this in the same way.  Just my 2 cents on this
>
>The lazy in my just wants to click click click and be done.
>
>
>On Friday, January 13, 2023 at 1:53:14 AM UTC-10 Stuart Clark wrote:
>
>> On 11/01/2023 19:58, Eulogio Apelin wrote:
>> > I'm looking for information, primarily examples, of various ways to 
>> > configure alert rules.
>> >
>> > Specifically, scenarios like:
>> >
>> > In a single rule group:
>> > Regular expression that determined a tls cert expires in 60 days. send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 40 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 30 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 20 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 10 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 5 days, send 
>> > 1 alert
>> > Regular expression that determined a tls cert expires in 0 days, send 
>> > 1 alert
>> >
>> > Another scenario is to
>> > send an alert once day to an email address.
>> > send an alert if it's the 3rd day in a row, send the alert to another 
>> > set of address. and stop alerting.
>> >
>> > can alertmanager send alerts to teams like it does slack?
>> >
>> > And another other general examples of alert manager rules.
>> >
>> I think it is best not to think of alerts as moment in time events but 
>> as being a time period where a certain condition is true. Separate to 
>> the actual alert firing are then rules (in Alertmanager) of how to route 
>> it (e.g. to Slack, email, etc.), what to send (email body template) and 
>> how often to remind people that the alert is happening.
>>
>> So for example with your TLS expiry example you might have an alert 
>> which starts firing once a certificate is within 60 days of expiry. It 
>> would continue to fire continuously until either the certificate is 
>> renewed (i.e. it is over 60 days again) or stops existing (because 
>> you've reconfigured Prometheus to no longer monitor that certificate). 
>> Then within Alertmanager you can set rules to send you a message every 
>> 10 days that alert is firing, meaning you'd get a message at 60, 50, 40, 
>> etc days until expiry.
>>
>> More complex alerting routing decisions are generally out of scope for 
>> Alertmanager and would be expected to be managed by a more complex 
>> system (such as PagerDuty, OpsGenie, Grafana On-Call, etc.). This would 
>> cover you example of wanting to escalate an alert after a period of 
>> time, but would also cover things like having on-call rotas where 
>> different people would be contacted by looking at a rota calendar.
>>
>> -- 
>> Stuart Clark
>>
>>
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Users" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to prometheus-users+unsubscr...@googlegroups.com.
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-users/e0e6a8bd-3d65-4573-b524-7a9af1578e95n%40googlegroups.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/AE1AB352-B549-47C5-BEBE-4C3A9E0F881E%40Jahingo.com.

Re: [prometheus-users] AlertManager rules examples

2023-01-13 Thread Stuart Clark

On 11/01/2023 19:58, Eulogio Apelin wrote:
I'm looking for information, primarily examples, of various ways to
configure alert rules.

Specifically, scenarios like:

In a single rule group:
Regular expression that determined a tls cert expires in 60 days. send
1 alert
Regular expression that determined a tls cert expires in 40 days, send
1 alert
Regular expression that determined a tls cert expires in 30 days, send
1 alert
Regular expression that determined a tls cert expires in 20 days, send
1 alert
Regular expression that determined a tls cert expires in 10 days, send
1 alert
Regular expression that determined a tls cert expires in 5 days, send
1 alert
Regular expression that determined a tls cert expires in 0 days, send
1 alert

Another scenario is to
send an alert once day to an email address.
send an alert if it's the 3rd day in a row, send the alert to another
set of address. and stop alerting.

can alertmanager send alerts to teams like it does slack?

And another other general examples of alert manager rules.

I think it is best not to think of alerts as moment in time events but
as being a time period where a certain condition is true. Separate to
the actual alert firing are then rules (in Alertmanager) of how to route
it (e.g. to Slack, email, etc.), what to send (email body template) and
how often to remind people that the alert is happening.

So for example with your TLS expiry example you might have an alert
which starts firing once a certificate is within 60 days of expiry. It
would continue to fire continuously until either the certificate is
renewed (i.e. it is over 60 days again) or stops existing (because
you've reconfigured Prometheus to no longer monitor that certificate).
Then within Alertmanager you can set rules to send you a message every
10 days that alert is firing, meaning you'd get a message at 60, 50, 40,
etc days until expiry.

More complex alerting routing decisions are generally out of scope for
Alertmanager and would be expected to be managed by a more complex
system (such as PagerDuty, OpsGenie, Grafana On-Call, etc.). This would
cover you example of wanting to escalate an alert after a period of
time, but would also cover things like having on-call rotas where
different people would be contacted by looking at a rota calendar.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/b43cfa1a-18c1-3c44-48f3-46349d8cdffa%40Jahingo.com.

Re: [prometheus-users] Prometheus-Thanos-Grafana Integration || error || open /etc/prometheus-shared/prometheus.yaml: no such file or directory

2023-01-06 Thread Stuart Clark


On 05/01/2023 18:12, Gaurav Nagarkoti wrote:

hi everyone,

w.r.t. Prometheus stateful set deployment I'm receiving the following 
error within logs


/*level=error msg="Error loading config 
(--config.file=/etc/prometheus-shared/prometheus.yaml)" 
file=/etc/prometheus-shared/prometheus.yaml err="open 
/etc/prometheus-shared/prometheus.yaml: no such file or directory"*/

/*
*/

  * would like to know the probable cause for the same
  o places to check
  * troubleshooting steps for  resolving the same
  * any workaround

As the error message says the file prometheus.yaml in the directory 
/etc/prometheus-shared doesn't exist, which is the configuration file 
for Prometheus.


You need to either ensure the file does exist at that location or adjust 
the command line to give the correct location if different.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/17b0538a-5c8d-6964-3032-fb0a8589d64b%40Jahingo.com.

Re: [prometheus-users] Issue getting data in time range

2022-12-17 Thread Stuart Clark


On 13/12/2022 10:19, Yogita Bhardwaj wrote:
I m not able to fetch metrics counts in given time range using 
Prometheus client.


I'm not really clear about exactly what you are trying to do & what is 
happening.


Could you please show what you are running, the query, what you are 
expecting & what you actually see?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2f286411-6ae8-1e4e-c535-7660c481a05d%40Jahingo.com.

Re: [prometheus-users] Issue while exposing Prometheus metric from Spring Boot application

2022-12-14 Thread Stuart Clark

It is a requirement that counters end with _total which is likely why it is 
being added for you. 

On 14 December 2022 17:50:32 GMT, "arnav...@gmail.com"  
wrote:
>Hi,
>
>I am exposing two counter metrics from my spring boot application. The 
>metrics are getting generated and the counters increase by 1 when the 
>requests succeed. I am facing 2 issues - 1) The name of the counter is 
>getting appended by "_total" even though I did not add that and I don't 
>want it to be added, 2) The last label ends with an open ','. I want to 
>remove this last ','. Here is my code:
>public class PrometheusUtility { 
>  private PrometheusUtility(){ } 
>
>  static CollectorRegistry registry = CollectorRegistry.defaultRegistry; 
>static final Counter counter = 
>Counter.build().name("http_request").help("help").labelNames("method","status","api",
> 
>"component").register(registry); 
>  static final Counter dsCounter = 
>Counter.build().name("http_downstream_request").help("Records the 
>downstream request count").labelNames("method","status","api", "component", 
>"downstream").register(registry); 
>
>  public static void incrementCounter(String method, String status, String 
>apiName, String component){ 
> counter.labels(method,status,apiName,component).inc(); 
> } 
>  public static void incrementDownstreamCounter(String method, String 
>status, String apiName, String component, String downstream){
> dsCounter.labels(method,status,apiName,component, downstream).inc();
> }
> }
>
>I am calling these functions from the business class:
>PrometheusUtility.incrementCounter("Post", 
>String.valueOf(HttpStatus.SC_OK), "newMail", "mx"); 
>PrometheusUtility.incrementDownstreamCounter("Post", 
>String.valueOf(HttpStatus.SC_OK), "newMail", "mx", "mwi"); 
>
>The pom.xml has this dependency added
> io.prometheus 
>simpleclient_spring_boot 0.16.0 
> 
>
>The output I am checking in the browser where my application is up and 
>running (not in Prometheus):
>http_request_total{method="Post",status="200",api="newMail",component="mx",} 
>1.0 
>http_downstream_request_total{method="Post",status="200",api="newMail",component="mx",downstream="mwi",}
> 
>1.0 
>
>Issue:
>
>   1. 
>   
>   Both metric names are getting an additional "_total" appended to it. I 
>   don't want this to be added by default. It should be same as the name I 
>   have put in the Utility class.
>   2. 
>   
>   Both metrics have an open ',' or comma at the end. Ideally this should 
>   not be there. A metric does not have an open comma at the end. Not sure why 
>   it's adding this comma. I have specified the correct number of labels at 
>   the creation time of the counter and populating the labels correctly for 
>   incrementing. Not sure where this comma is coming from.
>   
>Please let me know how I could fix these issues.
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Users" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to prometheus-users+unsubscr...@googlegroups.com.
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-users/469b7f35-d6ab-4af0-903f-d2821a5f6b60n%40googlegroups.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/A356CAF0-B077-4B38-B35F-2DE1AD85006E%40Jahingo.com.

Re: [prometheus-users] Null value in alerts

2022-12-09 Thread Stuart Clark


On 09/12/2022 08:49, sebagloc...@gmail.com wrote:


Thanks for advice,

So in this case I just need to use absent like this In alert?:

  - alert: Resource group in cluster is down

    expr: absent(windows_mscluster_resourcegroup_state 
{name!~"Available Storage"}) == 1


You aren't listing a metric here as you are using !~. You need to ensure 
you are only using = in any labels.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/02603cd3-ccdd-9a73-cd3e-1aa9ed55e093%40Jahingo.com.

Re: [prometheus-users] How to combine counter data which scrapes every 5 min using node_exporter

2022-12-03 Thread Stuart Clark


On 03/12/2022 09:37, Umut Cokbilir wrote:

Hi All,

I've tried to use "rate" function to combine but it didn't work as 
expected. How to combine the data below.


example; it returns empty query result
rate(ping_rtt_min{Description="1673(1)-Trunk28#PTV1801-3-LDX-RX1/TX1*#SG520_W#@1890606165@||DWDM-1-1||_NNI_ANK"}[10m])


*# HELP ping_rtt_min RTT Min
*
*# TYPE ping_rtt_min counter*
ping_rtt_min{Description="1673(1)-Trunk28#PTV1801-3-LDX-RX1/TX1*#SG520_W#@1890606165@||DWDM-1-1||_NNI_ANK",Interface="Eth-Trunk28.1",Management_IP="10.85.4.5",NNI_IPAddress="10.145.91.178",NNI_RTT_Min="1.69",NeName="1713(1)-PTN3125574_POLATLI_BSC",Network_Address="10.145.91.176/30",Packet_Loss="0.0"} 
9.906 
ping_rtt_min{Description="1673(2)-Trunk28#PTV1802-3-LDX-RX1/TX1*#SG520_P#@1890606165A@||DWDM-1-1||_NNI_ANK",Interface="Eth-Trunk28.1",Management_IP="10.85.4.6",NNI_IPAddress="10.145.91.182",NNI_RTT_Min="11.02",NeName="1713(2)-PTN3125575_POLATLI_BSC",Network_Address="10.145.91.180/30",Packet_Loss="0.0"} 
19.042 
ping_rtt_min{Description="1713(1)-Trunk28#PSK8803-31-TQX-RX2/TX2*#SG520_W#@1890606165@||DWDM-1-1||_NNI_ANK",Interface="Eth-Trunk28.1",Management_IP="10.85.4.1",NNI_IPAddress="10.145.91.177",NNI_RTT_Min="1.57",NeName="1673(1)-PTN3120590_PURSAKLAR_PLAZA",Network_Address="10.145.91.176/30",Packet_Loss="0.0"} 
8.336 
ping_rtt_min{Description="1713(2)-Trunk28#PSK8804-30-TTX-RX8/TX8*#SG520_P#@1890606165A@||DWDM-1-1||_NNI_ANK",Interface="Eth-Trunk28.1",Management_IP="10.85.4.2",NNI_IPAddress="10.145.91.181",NNI_RTT_Min="11.28",NeName="1673(2)-PTN3120591_PURSAKLAR_PLAZA",Network_Address="10.145.91.180/30",Packet_Loss="0.0"} 
7.773


From that entry it looks like data values are being included within the 
metric as labels. These need splitting out into their own separate 
metrics (e.g. NNI_RTT_Min and Packet_Loss). As it stands you are 
creating new time series for basically every scrape, which results in 
the lines you are seeing in your graph. Also as it is a cardinality 
issue it will cause problems with the data storage side of things too.


It is important that labels are never used for data values, and just 
contain fixed descriptive entries (such as names, etc.).


If you correct that things should look better.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6ec3ff02-97d8-5203-a2ac-23b58d0bd2b6%40Jahingo.com.

Re: [prometheus-users] Prometheus reaction time

2022-11-23 Thread Stuart Clark

On 23/11/2022 15:28, Nenad Rakonjac wrote:

Hello,

Does everyone have clue how much time need for prometheus metrics to
go from application to alertmenager? Can this time be bigger than one
minute? --

Metrics don't go to Alertmanager. Instead you create alerting rules
which query metrics to produce alerts.

How long something changing to an alert being fired depend on lots of
different factors:

How often you scrape the application (so Prometheus has the latest metrics)
The query you are using in your alerting rule (for example you might be
alerting based on an average rate over the last few minutes, so a sudden
spike wouldn't immediately trigger an alert)
If you have a "for" clause in the alert rule (which is generally
recommended so as not to send an alert for something that goes away very
quickly, such as a transient spike)

In general the "speed" of alerting isn't particularly critical. Instead
what is important is producing useful actionable alerts. Sending an
alert as soon as a resource goes above 90% isn't particularly useful if
a second later it drops to 10% - nothing bad happened and there is
nothing to be addressed. In general it would actually be more useful to
alert if that threshold is breached for over say 5 minutes, or even more
usefully when an SLO is failed or is projected to fail within the next
30 minutes.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/ef6ed871-c01c-56c1-b7f3-68074fffb745%40Jahingo.com.

Re: [prometheus-users] prometheus federate transfter alerts stopped to main prometheus

2022-11-21 Thread Stuart Clark


On 22/11/2022 05:53, Prashant Singh wrote:

Dear All,

I am using federate prometheus and configured to central prometheus . 
but getting alerts from federate prometheus also . how can avoids 
alerts from federate.



  - job_name: 'K8S-Federate'
    #scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    scheme      : http

    params:
      'match[]':
       - '{job!=""}'

The alert rules are set in each Prometheus, so if you are not wanting 
them you'd need to look at the configuration and change/remove.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/da2abd71-8606-7e34-f566-c37313fa8b91%40Jahingo.com.

Re: [prometheus-users] Prometheus Growth

2022-11-21 Thread Stuart Clark


On 20/11/2022 22:53, Julio Leal wrote:

Hi everyone!
I'm doing a study about how much time we have in our prometheus instances.
First of all, I thought that prometheus memory grew because number 
timeseries ingested. I thought like this because timeseries stay in 
memory to recover informations more faster.
But I got timeseries from last 3 months and I found that my timeseries 
ingested was grew little, like this graph and in the same time, I 
needed to increase the memory ram and CPU of the our instances sometimes.


There was an interesting talk at the recent PromCon EU 
(https://promcon.io/2022-munich/talks/why-is-it-so-big-analysing-the-m/) 
which is available on YouTube, but basically the main memory usage is 
generally due to the number of time series, with only a very minor usage 
due to queries.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1ec0a449-2a46-3c3d-31aa-79c0e0280bd3%40Jahingo.com.

Re: [prometheus-users] Modify the check interval of blackbox exporter HTTP probe

2022-11-21 Thread Stuart Clark


On 18/11/2022 19:09, Lunar Angelo Pajaroja wrote:
Hi is it possible to modify the check interval of HTTP probe. Example 
I just want blackbox (http module) to check "http://example-endpoint; 
just every one minute.
Just change the scrape interval of the job that is checking that 
endpoint with the Blackbox Exporter.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4739bb6f-de84-b3ad-59eb-ea35d51b739a%40Jahingo.com.

Re: [prometheus-users] Prometheus jobs are not showing prometheus UI Page

2022-11-09 Thread Stuart Clark


On 08/11/2022 10:05, Venkatraman Natarajan wrote:

Hi Team,

We have 100+ prometheus jobs in prometheus UI but not showing all the 
jobs.


Do we have limitations in prometheus jobs.?

I would like to add more prometheus jobs to scrape the metrics.


Could you describe more what you are seeing?

Where are you looking in the UI and what is showing?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/43114d97-9b74-1e1f-a0aa-9c4a4660d75b%40Jahingo.com.

Re: [prometheus-users] Is there any plan for prometheus to support string type metrics

2022-11-01 Thread Stuart Clark


On 2022-11-01 12:09, Prashant Singh wrote:

Currently prometheus metrics types include counter, gauge, histogram,
and summary. They are all numeric types. I would like to see if
prometheus can support string type metric for exmplate SQL QUREY HAVE
OUTPUT STRING OR ALPHANUMERICS

I AM USING SQL_EXPORTER AGENT FOR MONITORING POSTGRES DB  QUREY HAVING
OUTPUT STRING BASED.



Prometheus is a metrics system, so everything is based on numeric data - 
time series are collections of values representing the state at a 
particular point in time.


It sounds like you are instead talking about events, which are very 
different and not limited to number (e.g. logs).


Maybe it would be best to describe what you are trying to achieve?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/026361116bfe3e6e25f3494fe19d5816%40Jahingo.com.

Re: [prometheus-users] Guaranteed ingestion of metrics with historical timestamps

2022-10-30 Thread Stuart Clark

On 28/10/2022 21:44, Omar Khazamov wrote:

Thank you.

>>/Alternatively, if the existing metric system already has
extensive historical data which you'd like to be able to query
(for dashboards and alerts) take a look at the remote read system./

This is probably a silly question, but is it also true for remote
write? I may use Prometheus-compatible remote
storageVictoriaMetricsand it looks like it supports only the remote write.

Remote read & remote write are complimentary but different.

Remote write will send a copy of the metrics you have just scraped to an
external system. This could be some sort of metrics storage system, but
could also be something like a machine learning analytics tool.

Remote read allows Prometheus to query an external system any time a
PromQL request is made. Whatever data is returned is merged into any
local data and presented to the requester. Again this could be some sort
of metrics store, but could also be something different like a
forecasting system or an event store.

Support for remote read & write are up to the external system. While for
the use case of an external metrics store (for long term or global
storage) it makes sense to support both, there are plenty of use cases
which only require one or the other.

--
Stuart Clark

Re: [prometheus-users] Guaranteed ingestion of metrics with historical timestamps

2022-10-25 Thread Stuart Clark

If you are trying to interface with another metrics system the Push Gateway 
isn't the right tool. The main use case for the Push Gateway is for batch jobs 
that aren't able to be directly scraped, but still have useful metrics. For 
systems which are constantly running you should instead look at direct 
instrumentation or the use of exporters.

Is this a custom metrics system, or something off the shelf and common? If so, 
there might already be an exporter available.

If you do need to make a custom exporter, I'd suggest looking at some of the 
similar existing ones (for example the Cloudwatch exporter) to see how they are 
made - but basically when a scrape request is received API calls would be made 
to your other metrics system to fetch the latest values, converted to 
Prometheus format (including the timestamp of that latest value from the other 
metric system) and returned. Prometheus would regularly scrape that exporter 
and add new values on a regular basis.

Alternatively, if the existing metric system already has extensive historical 
data which you'd like to be able to query (for dashboards and alerts) take a 
look at the remote read system. With this option Prometheus would use the 
remote system as an additional data source, running queries as needed (based on 
the PromQL queries it receives), combining the data with local information as 
needed. There are already remote read integrations available for some data 
stores. 

On 25 October 2022 18:24:56 BST, Omar Khazamov  wrote:
>Thanks, I'm importing metrics from our internal metrics system. Do you have
>any advice on how to push withthe explicit   timestamps?
>
>Le ven. 7 oct. 2022 à 13:30, Stuart Clark  a
>écrit :
>
>> On 07/10/2022 10:16, Omar Khazamov wrote:
>>
>> Hi Stuart,
>>
>> I can see that support of timestamps has been discontinued around November
>> 21st, 2021. Indeed, when I try
>>
>> C02G74F9Q6LR bat-datapipeline % echo "test_metric_with_timestamp 33
>> 1665623039" | curl --data-binary @- https://
>> /metrics/job/pushgateway-job
>>
>> I get  *"pushed metrics are invalid or inconsistent with existing
>> metrics: pushed metrics must not have timestamps"*
>>
>> Could you please specify how do you use timestamps in metrics? Thanks
>>
>> As mentioned before timestamps in general should not be used.
>>
>> You should always be publishing the "latest" value of any metric when
>> Prometheus scrapes the endpoint (or the push gateway in this case).
>>
>> --
>> Stuart Clark
>>
>>
>
>-- 
>Thanks,
>Omar Khazamov

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/B2EE7458-3DD5-4143-B0FF-BF45680855E2%40Jahingo.com.

Re: [prometheus-users] Use Case Assessment

2022-10-17 Thread Stuart Clark

On 17/10/2022 07:26, Rishabh Arora wrote:

Hello!

I'm currently in the process of implementing Prometheus along with
Alertmanager as our de facto solution for node health monitoring. We
have a kubernetes, kafka, mqtt setup and for monitoring our
infrastructure, prometheus is an obvious good fit.

We have an application / business case, where I'm wondering whether
Prometheus may be a reasonable solution. Our application needs to meet
certain SLAs. In case those SLAs are not being, some alerts need to be
firing. For example, consider the following case which bears close
resemblance to our real business case:

An /Order/ schema in our system has a /payment/ field which can be one
of ['COMPLETED','FAILED','PENDING']. In our HA real time system, we
need to fire alerts for Orders which are in a PENDING state. Rows in
our /Orders/ collection will be in the order of potentially millions.
An order also has a /paymentEngine/ field, which represents the entity
responsible for processing the payment for the order.

Now, with Prometheus, finding the total count of PENDING Orders would
be a simple metric, but what we're interested in is also the Order
IDs. For instance, is there a way I could capture the PENDING order
IDs in the "metadata"(???) or "payload" of the metric? Downstream in
the alertmanager, I'd also like to group by /paymentEngine/__so I
could potentially inhibit alerts for an unstable engine.

Can anyone please help me out? Apologies in advance for my naivety :)

What you are asking for isn't really the job of Prometheus.

Having a metric detailing the number of pending orders & alerting on
that is completely within the normal area for Prometheus & Alertmanager
- observing the system and alerting if there are issues that need
investigation. However the next step of dealing with the individual
events/orders is the job for a different system. If paymentEngine could
be a small number of options (e.g. PayPal, Swipe, Cash) then it would be
reasonable to have that as a label to the pending orders metric (which
then would allow you to alert if one method stops working), but order ID
isn't something you should ever put in the metrics. Instead once you
were alerted about a potential issue you might query your order database
directly or look at log files to dig into the detail and figure out what
is happening.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/43479ddc-5970-194e-4779-97b6fc6e1e32%40Jahingo.com.

Re: [prometheus-users] Mysql handshake error with blackbox tcp_connect

2022-10-13 Thread Stuart Clark

On 12/10/2022 15:46, Dimitris Tzampanakis wrote:

Hello.
When blackbox exporter makes tcp connect to mysql on port 3306 i
notice handshake errors and after 100 errors the host is blocked from
mysql. It was very frustrating and difficult to spot, since other
success connect from same host reset the error_count from mysql and
nothing was logged. This same setup is running on other environments
(without problem). But in this env that big traffic didn't started
yet, locks happen.

I also found this similar
https://github.com/prometheus/blackbox_exporter/issues/505
Is there anything that i missing in the config?
Is there any other that user tpc_connect for mysql checks with similar
problems?

As you are just doing a TCP connection request it will look to be some
sort of failure from the MySQL server's perspective - you are just
opening a connection and then not doing anything. You would need to look
at the server configuration to whitelist the IP of the Blackbox Exporter
so it doesn't get blocked.

Alternatively look at using the MySQL Exporter instead of using Blackbox
Exporter, which will give you more insight above just availability too?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/70708a5c-5ef8-8333-5c55-c43d1d304a4f%40Jahingo.com.

Re: [prometheus-users] Guaranteed ingestion of metrics with historical timestamps

2022-10-07 Thread Stuart Clark


On 07/10/2022 10:16, Omar Khazamov wrote:

Hi Stuart,

I can see that support of timestamps has been discontinued around 
November 21st, 2021. Indeed, when I try


C02G74F9Q6LR bat-datapipeline % echo "test_metric_with_timestamp 33 
1665623039" | curl --data-binary @- 
https:///metrics/job/pushgateway-job


I get *"pushed metrics are invalid or inconsistent with existing 
metrics: pushed metrics must not have timestamps"*

*
*
Could you please specify how do you use timestamps in metrics? Thanks


As mentioned before timestamps in general should not be used.

You should always be publishing the "latest" value of any metric when 
Prometheus scrapes the endpoint (or the push gateway in this case).


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9f86edb0-5ea3-2780-6a91-8e280be04f3d%40Jahingo.com.

Re: [prometheus-users] Metrics vs log level

2022-10-07 Thread Stuart Clark


On 07/10/2022 04:09, Muthuveerappan Periyakaruppan wrote:
we have a situation , where we have 8 to 15 million head series in 
each Prometheus and we have 7 instance of them (federated). Our 
prometheus are in a constant flooded situation handling the incoming 
metrics and back end recording rules. 


8-15 million time series on a single Prometheus instance is pretty high. 
What spec machine/pod are these?


When you say "flooded" what are you meaning?

One thought which came to was - do we have something similar to log 
level for prometheus metrics ? If its there then... we can benefit 
from it  by configuring to run all targets in error level in 
production and in debug/info level in development... This will help 
control flooding of metrics.


I'm not sure what I understand what you are suggesting. What would be 
the difference between setting this hypothetical "error" and "debug" 
levels? Are you meaning some metrics would only be exposed on some 
environments?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/17f3d5ca-4369-96ee-feb9-a4bbe0bc3ca1%40Jahingo.com.

Re: [prometheus-users] Re: Hisorical data

2022-09-26 Thread Stuart Clark


On 26/09/2022 14:33, BHARATH KUMAR wrote:

Thanks for your reply.

If I use deletion API will it delete all the stored data or will it 
delete the instances data I deleted in the Prometheus.yml file.


It will delete whatever you tell it to delete:

https://prometheus.io/docs/prometheus/latest/querying/api/#delete-series

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a745eee2-58bf-06ce-ffcf-95fc96ab2c9b%40Jahingo.com.

Re: [prometheus-users] Re: Hisorical data

2022-09-26 Thread Stuart Clark

On 26/09/2022 13:10, BHARATH KUMAR wrote:

*Query:* up{instance=~"$unreachbale_instance"} and up

This query is giving me some instances which were already removed from
promethues.yml

That's expected.

The jobs in prometheus.yaml and the targets referenced are just what
Prometheus should scrape "now". You can add and remove targets whenever
you want, which will change what new data is collected. However it has
nothing to do with the ongoing storage of data within the timeseries
database. The main configuration for local storage is the number of days
(or total storage) values, which default to storing 14 days of data. So
by default if you stop scraping something today it would continue to
exist (with all previous metrics) for another two weeks. Note also that
any remote read system could also continue to return data depending on
how that external system was configured.

If you have something stored in the local Prometheus store (rather than
something you are using remote read with) and don't want to wait until
it expires (14 days or whatever you configured) you can use the deletion
API.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/042d05bd-173d-8af9-b44d-fe8c53df0547%40Jahingo.com.

Re: [prometheus-users] is it already a known issue with alertmanager integration to pagerduty

2022-09-26 Thread Stuart Clark

On 26/09/2022 12:02, Brian Candler wrote:
> When, I have multiple alerts for the same route with same severity, 
all ended up in pagerduty as expected

Yes, because these are separate alerts that are not grouped together.

> But, when I have alerts triggered for the same instance with 
different severity, then I have only one (may be also the first alert) 
sent to pagerduty

But you said you've told alertmanager to group on "instance", so 
that's what it will do: all alerts for the same instance (i.e. with 
the same value of the 'instance' label) will be delivered in a single 
message.  Whether the *body* of the pagerduty message shows those 
individual alerts, depends on how you've configured things.

You might want to show your alertmanager pagerduty_config 
<https://prometheus.io/docs/alerting/latest/configuration/#pagerduty_config>, 
and also your alert grouping config.

In general if you are sending alerts through to an external incident 
management system (PagerDuty, OpsGenie, JIRA, etc.) you are probably 
best not doing any grouping at all:

|group_by: ['...']|

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f3618026-bfd9-7a42-ac99-7fe07b01e39a%40Jahingo.com.

Re: [prometheus-users] /metrics lifecycle

2022-09-08 Thread Stuart Clark

On 31/08/2022 02:16, Vasiliy B wrote:

Folks,
Am researching a use case where we collect /metrics data with
Prometheus only when needed to do some investigation. During normal
operating hours, we would flush the /metrics endpoint on a time
schedule which is greater than the scrape_interval ?

From the documentation, it is clear that Prometheus server will honor
the scrape_interval config setting. But what about on the service
side? Do the metrics have the ability to reset to zero after a
predefined time, i.e. 1 minute?

Looking for feedback if this is a feasible, any gotchas, or
clarification on how metrics are stored on the client side.

A call to /metrics should always return the "current" value of every
metric. For counters (which are generally recommended) they always
increase and therefore only reset to zero if the application itself is
restarted. For gauges the ideal situation is that the scraping request
returns the live value of those metrics. For some metrics (such as where
a call to an external system is needed to generate) it can be quite
resource intensive to produce, and therefore that process wants to be
limited in frequency. In that case the common method is to have a
separate timed process which updates the metric, with the call to
/metrics just returning the latest values calculated. While this can
reduce the impact of such "heavy" metrics generation processes it does
mean that you may be getting old values (depending on the frequency of
the metric generation process) and there is a risk of that process
breaking without you noticing.

So in general, no there isn't any idea of "flushing" or resetting metrics.

Would you be able to give a bit more detail about what you are trying to
do, and why you think resetting metrics is important / needed?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/3e0bf352-3959-18c0-7cb7-c159ca497cae%40Jahingo.com.

Re: [prometheus-users] Re: up query

2022-08-16 Thread Stuart Clark


On 2022-08-16 15:08, BHARATH KUMAR wrote:

hello,

max_over_time(up[2d]) == 0 is giving me the info like ...for the last
two days if the server goes down for 1 minute also it was displaying
in the graph which I don't want. I want the information that for the
last "X" days it should be completely in an unreachable state.



So you are only wanting it if every single scrape failed over the past 2 
days?


Try sum() instead of max_over_time().

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9c67136f433aae5e60531df463901ca8%40Jahingo.com.

Re: [prometheus-users] How could I trucate data after pull in Java client.

2022-08-12 Thread Stuart Clark


On 09/08/2022 08:56, Hello Wood wrote:
Hi, I using Prometheus to collect Spring Boot service metrics in Java. 
But I found a problem that data still exsit after pull, that made the 
instant data is not correct.


Like at now there has one label like 'app_version{version=a} 100', 
then the metrics updated, and add a new label value b, the metrics 
come to 'app_version{version=a} 50' and 'app_version{version=b} 50'; 
Then, label a no longer update, and metrics come to 
'app_version{version=b} 100'.


When I pull metrics form Spring Boot service, the metrics is 
'app_version{version=a} 50' and 'app_version{version=b} 100'. But 
expect data should be 'app_version{version=b} 100' only.


How could I fix this issue? Thanks.


I think possibly you aren't using labels in the way expected.

Labels are used to "slice & dice" the data, so for example to be able to 
see which specific HTTP response code was returned from a web call, etc.


What is the value of the metric app_version supposed to signify?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8ba27ec4-9c10-d6e3-1b8d-4e48b209010f%40Jahingo.com.

Re: [prometheus-users] Prometheus Api to json to use in react.js

2022-08-12 Thread Stuart Clark


On 06/08/2022 22:58, Geet wrote:

Hi ,

What is the best way to collect data from Prometheus metrics which is 
in dev environment and convert the metrics to json and to use in 
react.js for making bar charts?



Take a look at the HTTP query API which returns a JSON response for a 
query you send:


https://prometheus.io/docs/prometheus/latest/querying/api/

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/adb4f598-806d-8563-5625-5aa425e622b7%40Jahingo.com.

Re: [prometheus-users] Re: evaluation interval cannot be recognized in seperate job

2022-08-12 Thread Stuart Clark


On 12/08/2022 08:29, nina guo wrote:
So if with push gateway or textfile collector, we need to also to 
customize our metrics, am I right?


What do you mean?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/999dfca4-1bdc-a1dc-6135-12a5d532a0fe%40Jahingo.com.

Re: [prometheus-users] Re: evaluation interval cannot be recognized in seperate job

2022-08-12 Thread Stuart Clark


On 12/08/2022 08:09, nina guo wrote:
OK. So if I want to scrape the metrics for 1 day interval, which way 
is better to implement?


Some options would be:

- Scrape it every 2 minutes instead of daily
- Use the textfile collector of the node exporter, with a scheduled job 
to update the file daily

- Use the push gateway with a scheduled job that updates the API daily

For the second two options you will lose the ability to use the "up" 
metric (as that will now refer to the node exporter/push gateway 
instead), but both add their own additional metrics containing timstamps 
of the last time the metric was updated.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ca768714-190e-93da-84ff-66ce4878d2d4%40Jahingo.com.

Re: [prometheus-users] Re: evaluation interval cannot be recognized in seperate job

2022-08-12 Thread Stuart Clark

On 12/08/2022 06:46, nina guo wrote:
what i want to implement is to scrape and evaluate one kind of metrics
not so frequently, I want to adjust the interval to 1d or 2d,
something like this.

On Friday, August 12, 2022 at 11:06:15 AM UTC+8 nina guo wrote:

Hi, I received following error.

- job_name: TLS Connection
scrape_interval: 1d
evaluation_interval: 1d
metrics_path: /probe
params:
module: [smtp_starttls]
file_sd_configs:
- files:xxx

kubectl logs prometheus -c prometheus -n monitoring
level=error ts=2022-08-12T03:03:50.120Z caller=main.go:347
msg="Error loading config
(--config.file=/etc/prometheus/prometheus.yml)" err="parsing YAML
file /etc/prometheus/prometheus.yml: yaml: unmarshal errors:\n
line 54: field evaluation_interval not found in type
config.ScrapeConfig"

Two things here:

1. There is no entry called "evaluation_interval" within a scrape
config, so that needs removing to clear the unmarshal error.

2. The maximum sensible scrape interval is around 2-3 minutes, so 1 day
is far too long. With a longer interval you will end up with stale time
series and "gaps" in all your graphs.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/903f4c47-1a79-fc56-6f71-039caf0aa400%40Jahingo.com.

Re: [prometheus-users] node exporter text file collector

2022-08-08 Thread Stuart Clark


On 08/08/2022 09:58, nina guo wrote:
But the following 3 lines should be appended to a file first, then 
next time override the old content. But how to make the old content be 
overried by previous ones?


print (" # HELP ldap_query_success LDAP query command", 
file=open("/var/log/node_exporter/filecollector/ldap_query.prom", "a+"))
 print (" # TYPE ldap_query_success gauge", 
file=open("/var/log/node_exporter/filecollector/ldap_query.prom", "a+"))
print 
('ldap_query_success'+'{'+'ldap_uri'+'='+service+','+'ldap_search_base'+'='+ldap_search_base+','+'} 
'+str(query_check), 
file=open("/var/log/node_exporter/filecollector/ldap_query.prom", "a+"))


File mode "a" will open for appending (so preserve anything already in 
the file). Instead to fully replace the file you'd need to use file mode 
"w".


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c7354142-2105-dd77-721c-255d557cc058%40Jahingo.com.

Re: [prometheus-users] node exporter text file collector

2022-08-08 Thread Stuart Clark


On 2022-08-08 09:14, nina guo wrote:

Hi,

I used the following way to output the metrics and values to a file,
and let node exporter to scrape it.

I have a question here, how to let the next metrics value overide the
previous'?

I checked .prom file, there are some metrics with same labels and same
values of labels but different value of the metrics. It is not
correct. The next value of the metrics should override the previous
old value. But how to implement this?

print (" # HELP ldap_query_success LDAP query command",
file=open("/var/log/node_exporter/filecollector/ldap_query.prom",
"a+"))
 print (" # TYPE ldap_query_success gauge",
file=open("/var/log/node_exporter/filecollector/ldap_query.prom",
"a+"))
print
('ldap_query_success'+'{'+'ldap_uri'+'='+service+','+'ldap_search_base'+'='+ldap_search_base+','+'}
'+str(query_check),
file=open("/var/log/node_exporter/filecollector/ldap_query.prom",
"a+"))



You shouldn't be appending to an existing file, but instead replacing 
the file contents each time you do an update.


Depending on the tooling used it is often better to create a new file 
with a random non *.prom filename and then rename it into the correct 
name - renames are generally atomic whereas file updates often aren't, 
meaning if you modify the file directly you could end up reading part 
finished data.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/661c81dacfd90b1b59ed060aed4101f2%40Jahingo.com.

Re: [prometheus-users] synthetic histograms in Prometheus

2022-08-07 Thread Stuart Clark

On 07/08/2022 18:14, Johny wrote:
Gauge contains most recent values of a metric, sampled every 1 min or
so, and exported by a user application, e.g. some latency sampled at 1
minute intervals by a client application. Lets presume this time
series (scraped by Prometheus or sent via remote write) is absolute
containing all the information we need for calculating derived
statistics. In the most raw form, you can fetch the data points, sort
them and calculate percentile. Incidentally, legacy backend has
efficient mechanisms to calculate percentiles by scanning and reducing
data using map-reduce.

I'm presuming there are more than one request/event every minute or so?

If that is the case it would mean that you can't make a histogram that
shows what you actually want to know. While in theory you could look at
the 60 samples per hour and plot those on a histogram it would be pretty
meaningless. If we assumed 1 request per second, sampling the latest
latency value every minute would mean that 59/60 events are being
discarded - so you have no idea what is actually happening from looking
at that single sampled latency. Your samples could all be returning
"low" values, which makes you believe that everything is working fine,
but in actual fact the other 59 events per minute are "high" and you
would never know.

This is the reason why histograms exist, and why more generally counters
are more useful than gauges. A gauge can only tell you about "now" which
may or may not be representative of what has actually been happening
since the last scrape. A counter however will tell you the absolute
change since the last scrape (e.g. the total number of requests since
the previous scrape, or the sum of the latencies of all events since the
scrape) meaning you never lose information (a counter that represents
total latency won't let you know if there was one spike or everything
was slow, but it will give you an average since the last scrape instead
of losing data).

It would be worth understanding why you aren't able to produce a
histogram in the application (or externally via processing an event
feed, such as logs)? By design a simple histogram is pretty low impact,
being a set of counters for each bucket.

--
Stuart Clark

Re: [prometheus-users] synthetic histograms in Prometheus

2022-08-07 Thread Stuart Clark

On 07/08/2022 08:23, Johny wrote:
We are migrating telemetry backend from legacy database to Prometheus
and require estimating percentiles on gauge metrics published by user
applications. Estimating percentiles on a gauge metric in Prometheus
is not feasible and for a number of reasons, client applications will
be difficult to modify to start publishing histograms.

I am exploring feasibility of creating a histogram in a recording rule
in Prometheus based on the metrics published by users. The partial
work put in so far seems inefficient, also illegible. Is there a
recommended approach to solve this problem? As stated earlier, it will
be extremely hard to solve the problem on the client side and I am
looking for a solution within Prometheus.

*Current metric is a gauge with with values representing request latency.*
http_duration_milliseconds_gauge{instance="instance1:port1"}[1h]
1659752188 100
1659752068 120
..
1659751708 150
1659751588 160

I'm not really sure what you are meaning by this metric?

A histogram of request latencies needs access to all the events that
occur, with details of every single latency value. It can then increment
the counter for a particular sot of range buckets to map the
distribution over time. I don't really understand what the single gauge
represents? Is that the latency of the most recent event? Some average
over the last hour?

Without access to the underlying events I can't see how this can be
possible - which is only possible in the application, or if you store
events elsewhere (e.g. in log files) in a tool that connects to your
event store system.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/bd420182-c1da-d47b-ae66-3f6cdf8032b9%40Jahingo.com.

Re: [prometheus-users] Re: blackbox metrics scraping

2022-07-25 Thread Stuart Clark


On 2022-07-25 09:18, nina guo wrote:

Thank you Stuart.

May I ask why the maximum if 2.5mins?



By default Prometheus will look back for a maximum of 5 minutes to find 
the "most recent" data point. Therefore if there was no data recorded in 
the past 5 minutes a "no value" would be returned, and you'd have gaps 
in your graphs. The recommended maximum of about 2-2.5 minutes is to 
allow for a single scrape failure not to result in gaps as well as all 
the various processing times to actually do the scrape.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/aeae2e49c46a88f80c555ca4f46a99ec%40Jahingo.com.

Re: [prometheus-users] Re: blackbox metrics scraping

2022-07-25 Thread Stuart Clark


On 25/07/2022 08:28, nina guo wrote:
And one more question pls, I checked the log that the probe is sending 
every 2-3 seconds, can I adjust this frequency to about 1min?
Yes that's the scrape frequency, so you can adjust the job configuration 
up to a maximum of about 2.5 minutes.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4ae19a2c-8e90-bb7a-102d-f16ef8de0c8d%40Jahingo.com.

Re: [prometheus-users] Re: blackbox metrics scraping

2022-07-25 Thread Stuart Clark


On 25/07/2022 01:08, nina guo wrote:
Thank you Brian. " up to T - 5 minutes  ", this 5 mins is the scraping 
interval?


No. The scraping interval doesn't matter. Prometheus will by default 
look back at most 5 minutes for a value.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ec5af887-abba-b67c-af27-1024aea82736%40Jahingo.com.

Re: [prometheus-users] deactive alert after hook

2022-07-24 Thread Stuart Clark

On 24/07/2022 11:10, Milad Devops wrote:

hi all
I use Prometheus to create alert rules and hook alerts using alertmanager.
My scenario is as follows:
- The log publishing service sends logs to Prometheus Exporter
- Prometheus takes the logs every second and matches them with our rules
- If the log applies to our rules, the alertmanager sends an alert to
the frontend application. It also saves the alert in the elastic

My problem is that when sending each alert, all the previous alerts
are also stored in Elastic in the form of a single log and sent to my
front service as a notification (web hook).

Is there a way I can change the alert status to resolved after the
hook so that it won't be sent again on subsequent hooks?

Or delete the previous logs completely after the hook from Prometheus
Or any other suggested way you have
Thank you in advance

I'm not sure I really understand what you are asking due to your
mentioning of logs.

Are you saying that you are using an exporter (for example mtail) which
is consuming logs and then generating metrics?

When you create an alerting rule in Prometheus it performs the PromQL
query given, and if there are any results an alert is fired. Once the
PromQL query stops returning results (or has a different set of time
series being returned) the alert is resolved.

So for example if you had a simple query that said "alert if the number
of error logs [stored in a counter metric] increases by 5 or more in the
last 5 minutes" as soon as the metric returned an increase of at least 5
over the last 5 minutes it would fire. It would then continue to fire
until that is no longer true - so if the counter kept recording error
log lines such that the increase was still over 5 per 5 minutes it would
keep firing. It would only resolve once there were no more than 5 new
long lines recorded over the past 5 minutes.

Alertmanager just routes alerts that are generated within Prometheus to
other notification/processing systems, such as email or webhooks. It
would normally fire the webhook once the alert starts firing, and then
periodically (if it keeps firing, at a configurable interval) and then
finally (optionally) once it resolves. This is a one-way process -
nothing about the notification has any impact on the alert firing or
not. Only the PromQL query controls the alert.

I'm not sure if that helps.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/040d084b-4046-6bbf-3691-5c9bedd51343%40Jahingo.com.

Re: [prometheus-users] Prometheus agent mode flag

2022-07-21 Thread Stuart Clark


On 21/07/2022 11:31, ritesh patel wrote:

Hello Team,

I have prometheus running ad a docker container on docker. I want to 
use prometheus as aagent mode.
So can someone guide me where I need to put this flag 
—enable-feature=agent .

On prometheus.yml or something else.

On kubernet i did same things on pass this flag on deployment but on 
docker where?



You need to adjust the command line to add that.

How you do it depends on how you are managing your Docker containers. 
For example using Terraform, Docker Compose, etc. all have different 
methods for setting the command line for a container.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/26de563a-bb0d-b3ad-363d-c4a3b0dfa9db%40Jahingo.com.

Re: [prometheus-users] Prom QL

2022-07-20 Thread Stuart Clark

On 20/07/2022 08:49, BHARATH KUMAR wrote:

Hello all,

I installed node exporters on many servers (around 300). Few of the
servers are unreachable. So because of that, we are unable to get the
CPU, and memory values of those servers.

Now I want to add a filter in the Grafana dashboard to check the least
CPU used, most CPU used servers. But due to unreachability, we are not
getting values for a few servers.

My question is
"*how to compare the output of the Prometheus query is NULL"*

Generally, I am comparing the output of the prom query like
I) if the CPU usage is less than 10% then I am comparing like
query >=0<=10%
ii) if the CPU usage is greater than 10% and less than 30% then I am
comparing like

query >10<=30
*similarly how to check the null values using the Prometheus query.*

For servers which can't be scraped there will be no metrics, so any
queries won't have any data to query.

However Prometheus itself creates certain metrics for all scrape
targets, including one called "up" which is either 0 or 1 - where 0
means the scrape failed. You can therefore create dashboards and alerts
that list the servers which aren't accessible (up == 0).

--
Stuart Clark

Re: [prometheus-users] Re: Https issue when using prometheus federation

2022-07-19 Thread Stuart Clark

On 19/07/2022 14:51, Shi Yan wrote:

Thanks, Brian for helping look into it.

Yes, in our setup, `another_prom_server` is deployed on the k8s 
cluster and it is behind an F5 ingress proxy, which terminates the TLS 
protocol. So we use HTTPS here.
And I've tried to add port 443 explicitly in the targets config, but 
the error is still the same.

msg="Scrape failed" err="Get 
\"https://example.com:443/federate?match%5B%5D=%7Bjob%3D%22jobname%22%7D\": 
read tcp x.x.x.x:58342->y.y.y.y:443: read: connection reset by peer"

While I can manually curl it with either
 > curl https://example.com
 Found

or the one with the exact URL parameters from the error msg.
 > curl 
'https://example.com:443/federate?match%5B%5D=%7Bjob%3D%22jobname%22%7D'

 .# can get all the metrics correctly

How long does it take curl to respond with all the metrics?

Could it be that it takes a while and your load balancer is configured 
with a shorter timeout?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e0c701db-9c73-becb-c9a6-743f6c7384b3%40Jahingo.com.

Re: [prometheus-users] Use remote-write instead of federation

2022-07-19 Thread Stuart Clark

On 19/07/2022 13:24, tejaswini vadlamudi wrote:
@Ben: Makes a point, but getting Thanos or Cortex into the picture
could be a way forward after some time. For now, do you think it is
good enough to use remote-write instead of federation? From a
performance and resource consumption POV, do you see remote-write as
the way-forward?

With remote write you could use agent mode, so you don't have to have
local storage other than for the destination instance.

However again it depends what you are trying to achieve and why you have
suggested having four instances. Are you wanting to query all four
instances or only the "global" one? Are you wanting to copy all data to
the "global" instance or only some metrics? Every data point, or only at
a lower frequency?

If you are intending to copy all data (both metrics & data points) that
leans towards remote write as federation works differently. But in that
case there doesn't seem to be any advantage in having the extra three
instances at all (unless you are intending on doing local querying,
alerting or recording rules) - so I'd just have a single instance that
scrapes all namespaces.

Alternatively if you are needing to have separate instances with local
storage/querying then I'd probably not look to copy all the data to the
"global" instance (which just doubles storage and memory usage) and
either use remote write for a much smaller subset of metrics, federation
with a slower scrape rate/reduced set of metrics, or as Ben suggested
something like Thanos (other options exist as well) to do away with the
fourth instance entirely and distribute the queries to the individual
instances instead.

Maybe if you could explain a bit about what the design is hoping to
achieve it would help us advise better?

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/a98eb9c0-21ce-ecac-7bb6-100b28d50986%40Jahingo.com.

Re: [prometheus-users] Best way to export status

2022-07-19 Thread Stuart Clark


On 19/07/2022 10:41, Roman Baeriswyl wrote:

Why not both:

idrac_amperage_probe_status{index="1",statusName="other",statusNumber="1"} 
0
idrac_amperage_probe_status{index="1",statusName="unknown",statusNumber="2"} 
0

idrac_amperage_probe_status{index="1",statusName="ok",statusNumber="3"} 1
idrac_amperage_probe_status{index="1",statusName="nonCriticalUpper",statusNumber="4"} 
0
idrac_amperage_probe_status{index="1",statusName="criticalUpper",statusNumber="5"} 
0
idrac_amperage_probe_status{index="1",statusName="nonRecoverableUpper",statusNumber="6"} 
0
idrac_amperage_probe_status{index="1",statusName="nonCriticalLower",statusNumber="7"} 
0
idrac_amperage_probe_status{index="1",statusName="criticalLower",statusNumber="8"} 
0
idrac_amperage_probe_status{index="1",statusName="nonRecoverableLower",statusNumber="9"} 
0
idrac_amperage_probe_status{index="1",statusName="failed",statusNumber="10"} 
0


This way, one can use the name or the number if that would be easier 
(for < or > checks).


The downside with numeric statuses is that you need more knowledge to 
use them compared with the label method. I have to know that 7 = unknown 
or 5 = too hot, etc.


That suggestion wouldn't actually help BTW as the statusNumber is a 
label so you could only use regex matches rather than >/<. If you wanted 
that as well you'd need a separate metric 
(idrac_amperage_probe_status_number or something) that has no labels and 
just the 1-10 value.


The value of that purely numeric status metric also depends on what the 
status values actually are. It might be more useful for things which 
"progress" (good, poor, bad, broken) but probably not for statuses which 
are unrelated (network error, disk error, hardware fault, temperature 
error) as you are unlikely to use >/<


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ad64d7e2-5bb5-ab98-c968-67015a38ee1d%40Jahingo.com.

Re: [prometheus-users] Use remote-write instead of federation

2022-07-18 Thread Stuart Clark


On 18/07/2022 18:00, tejaswini vadlamudi wrote:

Hello Stuart,

I have the 4 Prometheus instances in the same cluster.

  * Instance-1, monitoring k8s & cadvisor
  * Instance-2, monitoring workload-1 in namespace-1
  * Instance-3, monitoring workload-2 in namespace-2
  * Instance-4 is the central one collecting metrics from all 3
instances (for global querying and alerting). not sure if the
federation is a good fit for this sort of deployment pattern.

What's the reason for having all the different instances? Are these all 
full instances of Prometheus (with local storage) or using agent mode?


If you are just going to copy everything to the "central" instance on 
the same cluster, why not just do without the extra three clusters and 
have just the one instance that monitors everything?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d047d5f6-ad6b-a334-699d-8d7a4399e26a%40Jahingo.com.

Re: [prometheus-users] Use remote-write instead of federation

2022-07-18 Thread Stuart Clark

On 18/07/2022 17:21, tejaswini vadlamudi wrote:
Can someone point me to the advantages of using remote-write over
federation?
I understand that remote-write is more of a standard interface in the
monitoring domain.

Are there any handy performance measurements that were observed/recorded?

They are really quite different.

Federation is a way of pulling data from a remote Prometheus into
(generally) a local one. The puller gets to choose how often to pull
data and what data to fetch. If the puller can't fetch the data for any
reason (local/remote outage, network issues, etc.) there will be gaps.

Remote write is a way of pushing data from a Prometheus server to
"something else", which could be another Prometheus or one of the many
things which implement the API (e.g. various databases, Thanos, custom
analytics tools, etc.). For these you get all the data (basically as
soon as it has been scraped) with the ability to do filtering via
relabling. If there is an outage/disconnect data will be queued for a
while (too long and things will get lost) so small issues can be handled
transparently.

So you have a difference in what data you get - either all (filtered)
data or data on a schedule (so in effect a form of built-in
downsampling), and who controls that - either the data source Prometheus
or the destination.

Which is "better" depends on what you are trying to achieve and the
constraints you might have (for example difficulties with accepting
network connections or data storage/transfer limits). Don't forget the
organisation differences too - for remote write adding/changing a
destination (or filter rules) needs changes to every data source
Prometheus where federation is purely controlled at the other end, which
might be a good or bad thing depending on team responsibilities/timings.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/0a925bd8-4cbb-99ff-c372-311488751943%40Jahingo.com.

Re: [prometheus-users] Change response of Alert

2022-07-11 Thread Stuart Clark


On 2022-07-11 11:23, Test Kumar wrote:

Hi Team,

Is there a way so that I can change the response of Webhook alert .
Because I need to work according to the response.



How do you mean?

Webhooks are connected to Alertmanager as a destination for alerts 
(depending on the routing rules). It is a one-way process - the webhook 
is triggered when an alert fires (and optionally clears). I'm not really 
sure what you are hoping for?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6cf523898e331fbb2ffc39ff10b176f0%40Jahingo.com.

Re: [prometheus-users] Re: Alerts are getting auto resolved automatically

2022-07-05 Thread Stuart Clark

Two alerts suggests that the two instances aren't talking to each other. How 
have you configured them? Does the UI show the "other" instance? 

On 5 July 2022 08:34:45 BST, Venkatraman Natarajan  wrote:
>Thanks Brian. I have used last_over_time query in our expression instead of
>turning off auto-resolved.
>
>Also, we have two alert managers in our environment. Both are up and
>running. But Nowadays, we are getting two alerts from two alert managers.
>Could you please help me to sort this issue as well.?
>
>Please find the alert manager configuration.
>
>  alertmanager0:
>image: prom/alertmanager
>container_name: alertmanager0
>user: rootuser
>volumes:
>  - ../data:/data
>  - ../config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
>command:
>  - '--config.file=/etc/alertmanager/alertmanager.yml'
>  - '--storage.path=/data/alert0'
>  - '--cluster.listen-address=0.0.0.0:6783'
>  - '--cluster.peer={{ IP Address }}:6783'
>  - '--cluster.peer={{ IP Address }}:6783'
>restart: unless-stopped
>logging:
>  driver: "json-file"
>  options:
>max-size: "10m"
>max-file: "2"
>ports:
>  - 9093:9093
>  - 6783:6783
>networks:
>  - network
>
>Regards,
>Venkatraman N
>
>
>
>On Sat, Jun 25, 2022 at 9:05 PM Brian Candler  wrote:
>
>> If probe_success becomes non-zero, even for a single evaluation interval,
>> then the alert will be immediately resolved.  There is no delay on
>> resolving, like there is for pending->firing ("for: 5m").
>>
>> I suggest you enter the alerting expression, e.g. "probe_success == 0",
>> into the PromQL web interface (query browser), and switch to Graph view,
>> and zoom in.  If you see any gaps in the graph, then the alert was resolved
>> at that instant.
>>
>> Conversely, use the query
>> probe_success{instance="xxx"} != 0
>> to look at a particular timeseries, as identified by the label9s), and see
>> if there are any dots shown where the label is non-zero.
>>
>> To make your alerts more robust you may need to use queries with range
>> vectors, e.g. min_over_time(foo[5m]) or max_over_time(foo[5m]) or whatever.
>>
>> As a general rule though: you should consider carefully whether you want
>> to send *any* notification for resolved alerts.  Personally, I have
>> switched to send_resolved = false.  There are some good explanations here:
>>
>> https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
>>
>> https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/
>>
>> You don't want to build a culture where people ignore alerts because the
>> alert cleared itself - or is expected to clear itself.
>>
>> You want the alert condition to trigger a *process*, which is an
>> investigation of *why* the alert happened, *what* caused it, whether the
>> underlying cause has been fixed, and whether the alerting rule itself was
>> wrong.  When all that has been investigated, manually close the ticket.
>> The fact that the alert has gone below threshold doesn't mean that this
>> work no longer needs to be done.
>>
>> On Saturday, 25 June 2022 at 13:27:22 UTC+1 v.ra...@gmail.com wrote:
>>
>>> Hi Team,
>>>
>>> We are having two prometheus and two alert managers in separate VMs as
>>> containers.
>>>
>>> Alerts are getting auto resolved even though the issues are there as per
>>> threshold.
>>>
>>> For example, if we have an alert rule called probe_success == 0 means it
>>> is triggering an alert but after sometime the alert gets auto-resolved
>>> because we have enabled send_resolved = true. But probe_success == 0 still
>>> there so we don't want to auto resolve the alerts.
>>>
>>> Could you please help us on this.?
>>>
>>> Thanks,
>>> Venkatraman N
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to prometheus-users+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/68bff458-ee79-42ce-bafb-facd239e26aen%40googlegroups.com
>> 
>> .
>>
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Users" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to prometheus-users+unsubscr...@googlegroups.com.
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-users/CANSgTEbTrr7Jjf_XwD0J8wgMAdiLg9g_MmWDK%3DpgkTjwMA5YZA%40mail.gmail.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an

Re: [prometheus-users] AM Executor on Kubernetes Cluster

2022-07-05 Thread Stuart Clark

What are you meaning by "AM executor"? Are you meaning the Alertmanager 
application?

If so, there are both community Helm charts and a Docker image available, so 
running within Kubernetes should be no problem. 

On 5 July 2022 08:03:45 BST, Test Kumar  wrote:
>Hi Team,
>
>I installed the AM executor as a stand-alone application.
>But I want to install it as a Kubernetes pod service same as Kubernetes 
>Prometheus.
>So is there a way to install AM executor on Kubernetes?
>
>Thanks & Regards,
>Test Kumar 
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Users" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to prometheus-users+unsubscr...@googlegroups.com.
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-users/a8b46c22-160d-4c37-b600-7511b7433849n%40googlegroups.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/26D5B237-5E46-42AD-ACEE-414B71ACFF25%40Jahingo.com.

Re: [prometheus-users] Guaranteed ingestion of metrics with historical timestamps

2022-06-18 Thread Stuart Clark

On 14/06/2022 18:32, Jeremy Collette wrote:

Hello,

We have written a custom exporter that exposes metrics with explicit
timestamps, which Prometheus periodically scrapes. In the case where
Prometheus becomes temporarily unavailable, these metric samples will
be cached in the exporter until they are scraped, causing affected
metrics to age.

I understand that if a metric is older than a certain threshold, it
will be rejected by Prometheus with the message: "Error on ingesting
samples that are too old or are too far into the future".

I'm trying to understand if there are any guarantees surrounding the
ingestion of historical metrics. Is there some metric sample age that
is guaranteed to be recent enough to be ingested? For example, are
samples with timestamps within the last hour always going to be
considered recent? Within the last five minutes?

According to this previous thread: Error on ingesting samples that are
too old
<https://groups.google.com/g/prometheus-users/c/rKJYm6naEow/m/zylud_J4AAAJ>,
MR seems to indicate that metrics as old as 1 second can be dropped
due to being too old. Is this interpretation correct? If so, is there
any way to ensure metrics with timestamps won't be dropped for being
too old?

The use of timestamps in metrics is not something that should be used
except in some very specific cases. The main use case for adding a
timestamp is when you are scraping metrics into Prometheus that have
been sourced from another existing metrics system (for example things
like the Cloudwatch Exporter). You also mention something about your
exporter caching things until they are scraped, which also sounds like
something that is not advisable. The action of the exporter shouldn't
really be changing depending on the requests being received (or not
received).

An exporter is expected to return the various metrics that reflect
"now", in the same way that a directly instrumented application would be
expected to return the current state of the metrics being maintained in
memory. For a simple exporter the normal mechanism is for a request to
be received which then triggers some mechanism to generate the metrics.
For example with something like the MySQL Exporter a request would
trigger a query on the connected database which then returns various
information that is converted into Prometheus metrics and returned. In
some situations the process to fetch information from the underlying
system can be quite resource intensive or slow. In that case a common
design is to decouple the information fetching process from the request
handling process. One example is to perform the information fetching
process on a periodic timer, with the information fetched then stored in
memory. The request process then reads and returns that information -
returning the same values for every request until the next cycle of the
information fetching process. In none of these standard scenarios would
you expect timestamps to be attached to the returned metrics.

It would be good to hear a bit more about what you are trying to do, as
it is highly likely that the use of timestamps in your use case is
probably not the right option and they should just be dropped.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/007cf462-d87e-d4c7-316e-4007567c74a1%40Jahingo.com.

Re: [prometheus-users] Does Prometheus recommend exposing 2M timeseries per scrape endpoint?

2022-06-14 Thread Stuart Clark


On 14/06/2022 12:32, tejaswini vadlamudi wrote:
Thanks Stuart, this expectation is coming as a legacy requirement for 
a Telco cloud-native application that has huge cardinality (3000 
values for a label-1) and little dimensionality (2 labels) for 300 
metrics.


Is there any recommendation like not more than 1k or 10k series per 
endpoint?


The general expectation is that each Prometheus server would have no 
more than a few million time series in total. The normal use case is to 
have 10s/100s of jobs & targets, each exposing 100s/1000s of time 
series, rather than a single target exposing significant number of time 
series itself.


If this is a single application what is the need for that level of 
cardinality? For example something like a high volume e-commerce system 
with millions of users might only need labels with say a cardinality of 
a few 10s/100s each (for example HTTP status code, section of website, 
etc.). What is it about this system that you think it needs very high 
cardinality labels?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fba029fa-6fb0-4b26-daae-8fe4377ed3e6%40Jahingo.com.

Re: [prometheus-users] Does Prometheus recommend exposing 2M timeseries per scrape endpoint?

2022-06-14 Thread Stuart Clark


On 14/06/2022 12:13, tejaswini vadlamudi wrote:
I have a use case where a particular service (that can be horizontally 
scaled to desired replica count) exposes a 2 Million time series. 
Prometheus might expect huge resources to scrape such service (this is 
normal). But I'm not sure if there is a recommendation from the 
community on instrumentation best practices and maximum count to expose.



Two million time series returned from a single scrape request?

That's way out of the expected ballpark for Prometheus, and also sounds 
totally outside what I'd expect from any metrics system.


Would you be able to explain a bit more about what you are wanting to be 
able to achieve and we can suggest an alternative to Prometheus/metrics 
that would help?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/001c5333-a851-c781-83a4-05391824323e%40Jahingo.com.

Re: [prometheus-users] Does Prometheus support Netapp Trident NFS storage backend?

2022-05-31 Thread Stuart Clark


On 2022-05-31 15:45, tejaswini vadlamudi wrote:

Hi Stuart,

I forgot to ask the most important question on this topic :-)
Could you explain the reason for not supporting NFS based storage in
Prometheus?



At its core Prometheus contains a high performance time-series database. 
Network filesystems just don't have the same performance characteristics 
or features as a direct local disk. It is similar to not expecting 
support for storing a MySQL/Oracle/MSSQL/etc. database on a network 
storage system - it might technically work, but you are likely to 
encounter problems at some point, and you are unlikely to recenive any 
free/paid support in that architecture.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/83d885d70a47b03fb24f22374c3d6618%40Jahingo.com.

Re: [prometheus-users] Does Prometheus support Netapp Trident NFS storage backend?

2022-05-03 Thread Stuart Clark


On 03/05/2022 01:34, tejaswini vadlamudi wrote:
Prometheus documentation recommends to avoid NFS storage backends. But 
a few users over internet claim support for NFS based storage from 
Netapp in Prometheus. Just checking for opinions and feedback if this 
storage backend is compatible enough for regular Prometheus operations.


NFS isn't a supported option for Prometheus. Prometheus can do a lot of 
I/O and therefore network storage often doesn't work all that well. You 
can also have issues if the filesystem isn't fully POSIX compliant.


Just because it isn't supported doesn't mean it won't work. However 
there are issues that people find, and if you do find those the 
recommendation would always be to switch the filesystem to something 
local. So basically you'd be fairly on your own if you did have problems.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/91e130f1-1a94-b31e-59e6-960aa2dfae37%40Jahingo.com.

Re: [prometheus-users] Running a prometheus on kubernetes in an offline env

2022-04-28 Thread Stuart Clark


On 2022-04-28 13:28, shiran vaturi wrote:

Hi guys, as the title suggest, I'm running a prometheus on offline env
over kubernetes.
I can't get to make metrics show on ptometheus.

Your assistance will be much appriciated.



When you say offline I'm assuming you just mean that environment has no 
Internet access? There is nothing within Prometheus which should require 
such access. What isn't working? What does the target page show 
regarding the things you are scraping?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9424de070068318323706b7da2411187%40Jahingo.com.

Re: [prometheus-users] Re: Run two node exporters on same server

2022-04-21 Thread Stuart Clark


On 2022-04-21 16:03, BHARATH KUMAR wrote:

thanks for your reply. I think we fixed some firewall issues and now
working fine for most servers. But still we are facing new error like

Get "http://some_ip:port_number/metrics": dial tcp
some_ip:port_number: connect: connection refused

what could be the reason for this error?



That generally means that the connection is passing through the 
firewalls ok but the end server is then rejecting it. Usually because 
the port number is wrong or the service attached to that port isn't 
running. For containers it could mean the port hasn't been exposed to 
the outside host.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bbd74dc8fa03688e6714b86ae1f79100%40Jahingo.com.

Re: [prometheus-users] Facing 5m staleness issue even with 2.x

2022-04-19 Thread Stuart Clark

On 2022-04-19 08:58, Aniket Kulkarni wrote:

Hi,

I have referred below links:

I understand this was a problem with 1.x
https://github.com/prometheus/prometheus/issues/398

I also got this link as a solution
https://promcon.io/2017-munich/talks/staleness-in-prometheus-2-0/

No doubt it's a great session. But I am still not clear as to what
change I have to make and where?

I also couldn't find the prometheus docs useful for this.

I am using following tech stack:
Gatling -> graphite-exporter -> prometheus-> grafana.

I am still facing staleness issue. Please guide me on the solution or
any extra configuration needed?

I am using the default storage system by prometheus and not any
external one.

Could you describe a bit more of the problem you are seeing and what you
are wanting to do?

All time series will be marked as stale if they have not been scraped
for a while, which causes data to stop being returned by queries, which
is important as things like labels will change over time (especially for
things like Kubernetes which include pod names). It is expected that
targets will be regularly scraped, so things shouldn't otherwise
disapear (unless there is an error, which should be visible via
something like the "up" metric).

As the standard staleness interval is 5 minutes it is recommended that
the maximum scrape period should be no more that 2 minutes (to allow for
a failed scrape without the time series being marked as stale).

--
Stuart Clark

Re: [prometheus-users] Backfilling data into Prometheus

2022-04-17 Thread Stuart Clark


On 12/04/2022 23:06, John Grieb wrote:
I am backfilling a month's worth (March 1st to 31st, 2022) of Zabbix 
trend data (hourly avg values) for a single metric (gauge) with a 
single label (Hostname). There are 746 datapoints in my OpenMetrics 
file which I'm converting to TSDB format using the command:


promtool tsdb create-blocks-from openmetrics 30030360463_history.txt

When I move the data into the Prometheus storage directory the first 
15 day and 17 hours of data are removed for some reason. Can anyone 
tell me why and what I have to do to keep all the data?


What have you set your Prometheus retention period to? By default it is 
2 weeks.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/30023ad3-bfcb-2c5a-07b8-0e9f2b574174%40Jahingo.com.

Re: [prometheus-users] Counter metric resets

2022-04-07 Thread Stuart Clark


On 07/04/2022 14:04, Yaron B wrote:

Hello,
we have a counter metric that counts each time a pod is doing a 
specific action.
I need to count how many times the pod (actually sum of all the pods 
from a certain deployment) did the action over 24 hours.
problem is, the pod is on spot, and when it gets restarted, the 
counter resets, so the metric might be 20 at 1:00, but at 2:00 it 
might be 3, so when I try to do delta, or sum over time, I am getting 
wrong results..

any ideas how can I get the real delta for the action in a 24 hours range?


Look at using rate() which handles counter resets. If you multiply the 
value produced by the time period it is over you would get the number of 
actions that occurred. Note that this will only ever be an estimate (for 
example you might not scrape a pod before it is destroyed, missing the 
detection of some actions) and will most likely not be an integer (due 
to the way interpolation happens).


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2cf67da0-1566-3cdf-f467-8eda19ac7b9f%40Jahingo.com.

Re: [prometheus-users] Prometheus storage wrt Pushgateway metrics

2022-03-18 Thread Stuart Clark

On 18/03/2022 03:38, anuj tyagi wrote:

Hi All,

I have a question for a use case. If we are pushing batch jobs metrics
to Pushgateway.

Some of the job groups of those metrics are getting pushed to
pushgateway every 24 hrs. So, metric values are updating once in a day.

There are other job groups pushing to pushgateway every 15 seconds,
and updating metrics values every 15 seconds in Pushgateway.

Eg.
Backup_timestamp: x
Backup_files_count:
so values are getting updated for same metrics. So, All these requests
are overwriting the metrics value so not much increase in storage with
time.

Now, Prometheus is scraping all the jobs every 30 seconds. Even job
groups with metrics getting pushed every 24 hrs at a time in
Pushgateway, prometheus is scraping every 30 seconds.

Do you think Pushgateway scraping with such short time interval adding
storage even though metric value stay same for 24 hrs.

For this reason, one way is to clean Pushgateway job which are older
than maybe few seconds like 50 seconds. So, Prometheus will not scrape
job at all. This way I can save Prometheus storage and scraping effort?

Consider I'm pushing 10k metrics in total part of different job
groups. Half of those are getting pushed/updated to Pushgateway only
once in a day?

So, the question is how much it impact on storage of Prometheus if
Prometheus scraping metrics from Pushgateway every 30 seconds with no
change in value for 1 day.

The storage usage for a metric that isn't changing is next to nothing,
so I wouldn't worry about it. What you describe would be exactly how I'd
expect Pushgateway to be behaving - some metrics are updated more
frequently and others less, but they are always there and being scraped
at the same frequency.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/93475233-e826-ffa9-67a2-828a1e202c97%40Jahingo.com.

Re: [prometheus-users] Scrape_interval

2022-02-28 Thread Stuart Clark


On 28/02/2022 11:41, BHARATH KUMAR wrote:

Hello all,

Q1) What will happen if we set scrape interval as 5 minutes or 10 minutes?
Q2) I am observing some misbehavior when we set different scrape 
interval. what is the reason behind that?


The standard maximum scape interval is about 2 minutes, due to staleness 
- less frequent scrapes would likely cause series to be marked as stale 
and therefore disappear from alerts/dashboards.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0c4fa543-07ab-a119-5344-c10d514055a2%40Jahingo.com.

Re: [prometheus-users] Alertmanager Sendgrid WEB API Integration

2022-02-23 Thread Stuart Clark


On 22/02/2022 19:06, Dennis Naranjo wrote:


Does anyone here knows if it exists a way to integrate Alertmanager 
with the Sendgrid WEB API?


Currently I'm using the SMTP API integration, but I'd like to take 
advantage of the WEB API instead


If you have the API details for the Sendgrid Web API then yes. There is 
the webhook mechanism for Alertmanager which allows you to send alerts 
to anything you want, via a small service/lambda that translates from 
the webhook API to the destination API. There are numerous such services 
that are available already (e.g. for sending alerts to MS Teams), but 
I'm not sure if a Sendgrid one has already been written.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/521feaf9-eba6-2317-a93a-382c01dc37c1%40Jahingo.com.

Re: [prometheus-users] Target Server - Which Prometheus Server Is Scraping

2022-02-16 Thread Stuart Clark


On 16/02/2022 01:11, kekr...@gmail.com wrote:
Stuart, I am not sure I understand the log files question.  I am not 
aware of any log files related to the scrape itself.  We do have log 
files related to the exporters running on the server but they do not 
capture the scrapes.  I am trying to get details of what is going on 
on the target server itself, not so concerned about what the 
Prometheus server has log wise.



I was talking about logs from the application being scraped.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bbc21e3c-3d86-d3d1-4573-871f019201a2%40Jahingo.com.

Re: [prometheus-users] Target Server - Which Prometheus Server Is Scraping

2022-02-15 Thread Stuart Clark


On 15/02/2022 22:29, kekr...@gmail.com wrote:
I am not seeing the frequency is more often than I expect.  I am being 
told a log file is being created by the scrapes in a temp directory 
every minute.  I am saying it is not Prometheus. So now i have to 
prove it is not Prometheus.


As an alternate solution, I am trying to use the Prometheus timestamp 
function on the metric being created by the scrape in Grafana to get 
the time history of the metric as proof.  The thought being the time 
difference between the metric history is 3 minutes.  But I am having 
trouble getting the value of the timestamp function to act as an epoch 
date.    If I use the value returned in a web epoch translator, it 
translate to the correct date.  If I multiple the value by 1000, as 
you do every epoch date in Grafana, it actually multiplies the value 
rather than putting it in human readable date format.
I'm not clear if you are getting logs from these requests or not? I'd 
expect any request logs to include the path being requested, time & 
source IP. What do you see?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e8c647e2-e295-9c59-01a7-9321b34d9962%40Jahingo.com.

1 2 3 4 5 6 >

1 - 100 of 509 matches

Mail list logo