[prometheus-users] Prometheus AlertManager Alert Grouping

2020-06-17 Thread Zhang Zhao
Hi, I have a question for alert grouping in AlertManager. I integrated 
Prometheus Alerts to ServiceNow via Webhook.  I see the events were 
captured on ServiceNow side as below. However, inside each of events below, 
there were multiple alerts included. Is there a way to break it off so that 
one alert from Prometheus corresponds to one event on ServiceNow? I tried 
to group_by by alertname and status, but it didn't work as expected. Seems 
have to add other condition in group_by setting. Thanks.
[image: image.png]


global:
  resolve_timeout: 5m
receivers:
- name: prometheus-snow
  webhook_configs:
  - url: "https://;
http_config:
  basic_auth:
username:
password: 
route:
  group_by: ['alertname','status']
  group_interval: 10m
  group_wait: 5m
  repeat_interval: 5m
  receiver: prometheus-snow







{
   "receiver":"prometheus-snow",
   "status":"firing",
   "alerts":[
  {
 "status":"firing",
 "labels":{
"alertname":"Critical_TEST",
"cluster":"espr-aksepme-dev-westus-cluster-01",
"endpoint":"https-metrics",
"geo":"us",
"instance":"172.25.33.132:10250",
"job":"kubelet",
"metrics_path":"/metrics",
"namespace":"kube-system",
"node":"aks-esprepmedv01-44274363-vmss00",
"prometheus":
"espr-prometheus-nonprod/prometheus-prometheus-oper-prometheus",
"region":"westus",
"service":"prometheus-operator-kubelet",
"severity":"critical"
 
},
 "annotations":{
"message":"This is for ServiceNow integration testing."
 
},
 "startsAt":"2020-06-14T17:42:40.558Z",
 "endsAt":"0001-01-01T00:00:00Z",
 "generatorURL":"
http://prometheus-prometheus-oper-prometheus.espr-prometheus-nonprod:9090/graph?g0.expr=up+%3D%3D+0=1
",
 "fingerprint":"52169c17bfa388eb"
  
},
  {
 "status":"firing",
 "labels":{
"alertname":"Critical_TEST",
"cluster":"espr-aksepme-dev-westus-cluster-01",
"endpoint":"https-metrics",
"geo":"us",
"instance":"172.25.33.132:10250",
"job":"kubelet",
"metrics_path":"/metrics",
"namespace":"kube-system",
"node":"aks-esprepmedv01-44274363-vmss00",
"prometheus":
"espr-prometheus-nonprod/prometheus-prometheus-oper-prometheus",
"region":"westus",
"service":"prometheus-prometheus-oper-kubelet",
"severity":"critical"
 
},
 "annotations":{
"message":"This is for ServiceNow integration testing."
 
},
 "startsAt":"2020-06-14T17:42:40.558Z",
 "endsAt":"0001-01-01T00:00:00Z",
 "generatorURL":"
http://prometheus-prometheus-oper-prometheus.espr-prometheus-nonprod:9090/graph?g0.expr=up+%3D%3D+0=1
",
 "fingerprint":"0f7d36efbba0e03c"
  
},
  {
 "status":"firing",
 "labels":{
"alertname":"Critical_TEST",
"cluster":"espr-aksepme-dev-westus-cluster-01",
"endpoint":"https-metrics",
"geo":"us",
"instance":"172.25.33.132:10250",
"job":"kubelet",
"metrics_path":"/metrics/cadvisor",
"namespace":"kube-system",
"node":"aks-esprepmedv01-44274363-vmss00",
"prometheus":
"espr-prometheus-nonprod/prometheus-prometheus-oper-prometheus",
"region":"westus",
"service":"prometheus-operator-kubelet",
"severity":"critical"
 
},
 "annotations":{
"message":"This is for ServiceNow integration testing."
 
},
 "startsAt":"2020-06-14T17:42:40.558Z",
 "endsAt":"0001-01-01T00:00:00Z",
 "generatorURL":"
http://prometheus-prometheus-oper-prometheus.espr-prometheus-nonprod:9090/graph?g0.expr=up+%3D%3D+0=1
",
 "fingerprint":"4f6c2a8be6e9985d"
  
},
  {
 "status":"firing",
 "labels":{
"alertname":"Critical_TEST",
"cluster":"espr-aksepme-dev-westus-cluster-01",
"endpoint":"https-metrics",
"geo":"us",
"instance":"172.25.33.132:10250",
"job":"kubelet",
"metrics_path":"/metrics/cadvisor",
"namespace":"kube-system",
"node":"aks-esprepmedv01-44274363-vmss00",
"prometheus":
"espr-prometheus-nonprod/prometheus-prometheus-oper-prometheus",
"region":"westus",
"service":"prometheus-prometheus-oper-kubelet",
"severity":"critical"
 
},

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 

[prometheus-users] prometheus snmp pre-collector

2020-06-17 Thread Виталий Ковалев
Hello. I use prometheus + snmp_exporter to monitor network swithes(D-Link 
and Huawei) with snmp.
I have an issue with huawei switches - some of them can't give ports 
utilization and i have to calculate it manualy.
As i know prometheus can't setup per job data retention(Is it rigth?), and 
i decided to start another one prometheus server which will collect 
IfInOctets/IfOutOctets every 5 seconds, and calculate utilization, when 
first server will scrape second server.
I dont need to keep IfInOctets/IfOutOctets too long,and i decided setup 
storage.tsdb.min-block-duration to 30 m and storage.tsdb.retention.time to 
1d.
So, the questions are:
Is it right way?
What difference between storage.tsdb.min-block-duration and 
storage.tsdb.max-block-duration?
Will prometheus calculate utilization for metrics which still in memory?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/08d2ca92-8f2a-4bcc-a98a-3e6d4530e3d4o%40googlegroups.com.


Re: [prometheus-users] consul_exporter - network connections

2020-06-17 Thread Dennis Kelly
I have not yet, that was next on my list. I was more so curious why 
consul_exporter would need so many connections to the same three servers 
for only 1,000 services (i.e. why not reruse a connection? why don't the 
close when done... most are TIME_WAIT). 


On Tuesday, June 16, 2020 at 11:39:05 PM UTC-7 sup...@gmail.com wrote:

> Have you tried setting the --consul.request-limit to limit the number of 
> concurrent connections?
>
> On Wed, Jun 17, 2020 at 6:37 AM Dennis Kelly  wrote:
>
>> We have a consul cluster of 3 members and about 1k services. 
>> consul_exporter has been using significantly more CPU and is also logging 
>> this:
>>
>> level=error ts=2020-06-16T23:56:46.593Z caller=consul_exporter.go:400 
>> msg="Failed to query service health" err="Get 
>> \"http://consul.service:8500/v1/health/service/[service 
>> name]?stale= 
>> \":
>>  
>> context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
>>
>> It is running as a docker container in Nomad. I bumped the CPU resource 
>> from the default to 900 MHz and also the consul.timeout to 2s. This has 
>> improved things, but we still sporadically receive this error. I haven't 
>> had a chance to dig through the entire source yet, but wondering why too 
>> consult_exporter has so many open connections to the same 3 consul servers:
>>
>> $ netstat | grep :8500 | wc -l
>>
>> 13653
>>
>> Why would the connections remain, and also if they do remain, not reused? 
>> I suspect we may be hitting up against this issue, but hoping for further 
>> clarification:
>>
>> https://github.com/prometheus/consul_exporter/issues/102
>>
>> Thanks!
>>
>> Dennis
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/ece427fb-99ea-4deb-a99c-60707f2c807dn%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4927e749-b9d9-4c97-a2fe-d78fca3810fen%40googlegroups.com.


Re: [prometheus-users] Re: Node Exporter

2020-06-17 Thread Ben Kochie
Yes, we use a recording rules to represent node memory utilization.

https://gitlab.com/gitlab-com/runbooks/-/blob/1682a5632f3eaf0548dfa8277a421de2aff24245/rules/node.yml#L91-106

On Wed, Jun 17, 2020 at 6:14 PM Christian Hoffmann <
m...@hoffmann-christian.info> wrote:

> Hi,
>
> On 6/17/20 4:44 PM, Yasmine Mbarek wrote:
> > I have a tiny problem with node exporter. If you can help me I will be
> > very grateful .
> > So my node exporter implemented in my parc of machines , for some
> > machine it works fine and returns all metrics values but in other
> > machine it returns everything but "RAM Used" , the difference between
> > the working machines and the others is that : on the machines that RAM
> > used is working fine , the OS is redhat 7
> > the rest of machines Redhat 6 everything is working but the RAM used
> > returns NO DATA
> > Is there any  explanation ??
>
> I assume you are using Grafana with some kind of node_exporter dashboard?
>
> You would have to look into the actual queries to see what's causing this.
>
> My guess: The dashboard most likely uses the MemAvailable metric (which
> comes from /proc/meminfo). This is exposed by the Linux kernel, but only
> after some specific version. RHEL6 does not expose it.
>
> There are ways to calculate something similar for RHEL6 (the "free"
> command line tool has some logic for this).
>
> I think GitLab had a public Prometheus config with a recording rule,
> maybe it was this one:
>
>
> https://gitlab.com/gitlab-org/gitlab-foss/-/commit/e91c7469ad0be5f429548d4142ca93c17ec9e71e
>
> You would have to set up such a recording rule and would have to modify
> your dashboard accordingly.
>
> As an alternative: Try talking your administrators into abandoning
> RHEL6. It'll be out of support at the end of the year anyway, IIRC. :)
>
> Kind regards,
> Christian
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/6fbaa443-48f3-1bd0-962d-490e4cde4b80%40hoffmann-christian.info
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmpC_vTZtq5o1gzBeMG%3Dt6H4MVrg%2Bt34%2BzK2S_6OZu4EJg%40mail.gmail.com.


[prometheus-users] Re: Node Exporter

2020-06-17 Thread Yasmine Mbarek
Thank you very much , i will study your solution and i hope it can be 
beneficial :)))

On Wednesday, June 17, 2020 at 3:44:19 PM UTC+1, Yasmine Mbarek wrote:
>
> Hello , 
> I have a tiny problem with node exporter. If you can help me I will be 
> very grateful .
> So my node exporter implemented in my parc of machines , for some machine 
> it works fine and returns all metrics values but in other machine it 
> returns everything but "RAM Used" , the difference between the working 
> machines and the others is that : on the machines that RAM used is working 
> fine , the OS is redhat 7 
> the rest of machines Redhat 6 everything is working but the RAM used 
> returns NO DATA 
> Is there any  explanation ??
> Thank you .
>
> ᐧ
>
> Le mar. 16 juin 2020 à 16:55, Yasmine Mbarek  a 
> écrit :
>
>> Hello , 
>> I have a tiny problem with node exporter. If you can help me I will be 
>> very grateful .
>> So my node exporter implemented in my parc of machines , for some machine 
>> it works fine and returns all metrics values but in other machine it 
>> returns everything but "RAM Used" , the difference between the working 
>> machines and the others is that : on the machines that RAM used is working 
>> fine , the OS is redhat 7 
>> the rest of machines Redhat 6 everything is working but the RAM used 
>> returns NO DATA 
>> Is there any  explanation ??
>> Thank you .
>>
>> -- 
>>
>> أطيب التحيات  / Cordialement / Best regards,
>> *--*
>> *Yasmine MBAREK*
>> *Emails : **yasmine.mba...@tek-up.de *
>>   *bm.yasmi...@gmail.com *
>> *Phone : **+216 52 94 61 38 *
>>
>>
>>
>> ᐧ
>>
>
>
> -- 
>
> أطيب التحيات  / Cordialement / Best regards,
> *--*
> *Yasmine MBAREK*
> *Emails : **yasmine.mba...@tek-up.de *
>   *bm.yasmi...@gmail.com *
> *Phone : **+216 52 94 61 38 *
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9dc88640-a29e-4162-b5ab-bffa346c14b9o%40googlegroups.com.


Re: [prometheus-users] Windows Metrics

2020-06-17 Thread Christian Hoffmann
Hi,

On 6/16/20 4:10 PM, Freddy Mack wrote:
> Hello Chris,
> 
> I am looking for the Metrics to show in Grafana like the below example
> which I have executed in Linux, Want the same MEtrics for windows:
> For example I have for Memory
> 100 * (windows_os_physical_memory_free_bytes{instance=~"$instance"}) /
> windows_cs_physical_memory_bytes{instance=~"$instance"}

I suggest either working based on existing examples (such as the
referenced dashboard) or creating the queries yourself.

The windows_exporter collectors seem to be documented just fine:
https://github.com/prometheus-community/windows_exporter/blob/master/docs/collector.cpu.md

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/892206e7-d889-af78-7545-7d4836469278%40hoffmann-christian.info.


Re: [prometheus-users] Re: Node Exporter

2020-06-17 Thread Christian Hoffmann
Hi,

On 6/17/20 4:44 PM, Yasmine Mbarek wrote:
> I have a tiny problem with node exporter. If you can help me I will be
> very grateful .
> So my node exporter implemented in my parc of machines , for some
> machine it works fine and returns all metrics values but in other
> machine it returns everything but "RAM Used" , the difference between
> the working machines and the others is that : on the machines that RAM
> used is working fine , the OS is redhat 7 
> the rest of machines Redhat 6 everything is working but the RAM used
> returns NO DATA 
> Is there any  explanation ??

I assume you are using Grafana with some kind of node_exporter dashboard?

You would have to look into the actual queries to see what's causing this.

My guess: The dashboard most likely uses the MemAvailable metric (which
comes from /proc/meminfo). This is exposed by the Linux kernel, but only
after some specific version. RHEL6 does not expose it.

There are ways to calculate something similar for RHEL6 (the "free"
command line tool has some logic for this).

I think GitLab had a public Prometheus config with a recording rule,
maybe it was this one:

https://gitlab.com/gitlab-org/gitlab-foss/-/commit/e91c7469ad0be5f429548d4142ca93c17ec9e71e

You would have to set up such a recording rule and would have to modify
your dashboard accordingly.

As an alternative: Try talking your administrators into abandoning
RHEL6. It'll be out of support at the end of the year anyway, IIRC. :)

Kind regards,
Christian

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6fbaa443-48f3-1bd0-962d-490e4cde4b80%40hoffmann-christian.info.


[prometheus-users] Metrics Deduplication

2020-06-17 Thread Adso Castro
Hey all,

I have a curious question here:

I have 3 Prometheus replicas running under my Prometheus-Operator (plus 
Thanos) inside a Kubernetes Cluster. When I query something within the 
range of 3 or 6+ hours, I get the same metric x 3.
Is that correct or should I get a single metric already refined by Thanos?

Thank you.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e94d30b8-6060-4a2c-94ed-d2cb6e624fc5o%40googlegroups.com.


[prometheus-users] Re: Windows Metrics

2020-06-17 Thread Freddy Mack
Can I get some help please.

On Monday, June 15, 2020 at 3:44:28 PM UTC-5, Freddy Mack wrote:
>
> Can I have scripts/Metrics for Windows clients in Grafana
> § CPU utilization   
> § File System utilization   
> § System messages (ex. Errors in /var/log/messages) Any critical 
> errors reported in logs
> § Disk INODE utilization   
> § System process monitoring (ex. Set of services on a server if they are 
> running or not) Alarm if services are not running. (edited) (edited) 
> 11:08 
> For Memory I got this executed fine :
> 100 * (windows_os_physical_memory_free_bytes{instance=~"$instance"}) / 
> windows_cs_physical_memory_bytes{instance=~"$instance"}
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/332e2670-fc15-4e84-9a75-24def62ef350o%40googlegroups.com.


[prometheus-users] Windows Metrics for Grafana 7.0 - Help Please

2020-06-17 Thread Freddy Mack
Can I have scripts/Metrics for Windows clients in Grafana
§ CPU utilization   
§ File System utilization   
§ System messages (ex. Errors in /var/log/messages) Any critical 
errors reported in logs
§ Disk INODE utilization   
§ System process monitoring (ex. Set of services on a server if they are 
running or not) Alarm if services are not running. (edited) (edited) 
11:08 
For Memory I got this executed fine :
100 * (windows_os_physical_memory_free_bytes{instance=~"$instance"}) / 
windows_cs_physical_memory_bytes{instance=~"$instance"}

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/305a49a4-5644-4f64-a48a-f454ac809baao%40googlegroups.com.


[prometheus-users] RAM used No DATA

2020-06-17 Thread Yasmine Mbarek
Hello , 
I have a tiny problem with node exporter. If you can help me I will be very 
grateful .
So my node exporter implemented in my parc of machines , for some machine 
it works fine and returns all metrics values but in other machine it 
returns everything but "RAM Used" , the difference between the working 
machines and the others is that : on the machines that RAM used is working 
fine , the OS is redhat 7 
the rest of machines Redhat 6 everything is working but the RAM used 
returns NO DATA 
Is there any  explanation ??
Thank you .

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4a76cb80-bfab-43ba-a438-4415970ff779o%40googlegroups.com.


[prometheus-users] Re: Node Exporter

2020-06-17 Thread Yasmine Mbarek
Hello ,
I have a tiny problem with node exporter. If you can help me I will be very
grateful .
So my node exporter implemented in my parc of machines , for some machine
it works fine and returns all metrics values but in other machine it
returns everything but "RAM Used" , the difference between the working
machines and the others is that : on the machines that RAM used is working
fine , the OS is redhat 7
the rest of machines Redhat 6 everything is working but the RAM used
returns NO DATA
Is there any  explanation ??
Thank you .

ᐧ

Le mar. 16 juin 2020 à 16:55, Yasmine Mbarek  a
écrit :

> Hello ,
> I have a tiny problem with node exporter. If you can help me I will be
> very grateful .
> So my node exporter implemented in my parc of machines , for some machine
> it works fine and returns all metrics values but in other machine it
> returns everything but "RAM Used" , the difference between the working
> machines and the others is that : on the machines that RAM used is working
> fine , the OS is redhat 7
> the rest of machines Redhat 6 everything is working but the RAM used
> returns NO DATA
> Is there any  explanation ??
> Thank you .
>
> --
>
> أطيب التحيات  / Cordialement / Best regards,
> *--*
> *Yasmine MBAREK*
> *Emails : **yasmine.mba...@tek-up.de *
>   *bm.yasmi...@gmail.com *
> *Phone : **+216 52 94 61 38 *
>
>
>
> ᐧ
>


-- 

أطيب التحيات  / Cordialement / Best regards,
*--*
*Yasmine MBAREK*
*Emails : **yasmine.mba...@tek-up.de *
  *bm.yasmi...@gmail.com *
*Phone : **+216 52 94 61 38 *

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAC1sCO1sZs-ZngGLMeH2F5uk1E3Z1CzJNWiJ2XhRbmiKuvRd_w%40mail.gmail.com.


[prometheus-users] How to fill the external labels in prometheus.yml

2020-06-17 Thread Hari Yada
Hi Experts

In prometheus.yaml file i would like to fill the external labels with 
cluster and pod name. What is the ideal way to achieve this?

external_labels:
cluster: $(CLUSTER_NAME)
replica: $(POD_NAME)


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bd33c95f-5fa1-4832-adf5-5e469a3e31c5o%40googlegroups.com.


Re: [prometheus-users] block size calculation behaves unintuitively when only using size based retention

2020-06-17 Thread Brian Candler
If you add --storage.tsdb.max-block-duration=1d does this solve your 
problem?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/60737a1c-acd2-4493-8f65-41b95a2bb434o%40googlegroups.com.


[prometheus-users] Using JMX Exporter to export Cassandra's Metrics.

2020-06-17 Thread Yagyansh S. Kumar
Hi. I need to export metrics from Cassandra DB, and I there have been mixed 
suggestions to use JMX Exporter or Standalone Cassandra Exporter(Which are 
many). Which one should be the correct way to go? Also, are there any known 
issues with JMX Exporter, I mean does it hamper's Cassandra performance or 
the VM's in any way?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6c04bea4-3f91-48ea-99b2-b0a77e52da7ao%40googlegroups.com.


Re: [prometheus-users] block size calculation behaves unintuitively when only using size based retention

2020-06-17 Thread Dom Prittie
FYI - it appears someone else has run into this issue before:
https://github.com/prometheus/prometheus/issues/6857

On Wed, 17 Jun 2020 at 09:00, Dom Prittie 
wrote:

> The issue is with the size of the blocks being created and how the block
> size calculation appears to behave differently for time based vs size based
> retention settings. Just before we hit our retention limit (2.5T) we can
> query ~30 days of metrics, when we hit the limit the oldest block is
> removed and the size of that block often uncomfortably large. The example
> block I gave in my first email is 1.1T which represents about 14 days of
> data, so when that block is the oldest and we hit our size based retention
> limit 14 days of data will disappear at once, because Prometheus removes
> whole blocks at a time.
>
> The thing that is odd is that I expect the behaviour around retention
> limits to be the same regardless of whether you are using just time based
> retention, just size based retention, or a combination of both. If I had
> used a time based retention of 30 days (currently roughly equivalent to my
> size limit of 2.5T) then instead of losing 14 days worth of data when we
> hit the limit we would lose 3 days of data or ~ 10%.
>
> We don't wish to use time based retention at the moment because the
> amount of samples we ingest daily is in flux. So how can I ensure we get
> reasonably sized blocks (blocks <= 10% of storage.tsdb.retention.size)?
>
> On Wed, 17 Jun 2020 at 08:28, Stuart Clark 
> wrote:
>
>> On 2020-06-17 08:23, Dom Prittie wrote:
>> > Hi,
>> >
>> > I have a Prometheus deployment for which we have only used size based
>> > retention, which we have set to 2.5T. We regularly see unexpectedly
>> > large blocks getting created, so that when we hit
>> > storage.tsdb.retention.size we see a huge drop in the metrics
>> > available for querying.
>> >
>> > For instance we currently have a block which is 1.1T and contains data
>> > from Mon 1 Jun 01:00:00 BST 2020 to Sun 14 Jun 13:00:00 BST 2020,
>> > which is ~45% of our retention size!
>> >
>> > I can see from the docs that "Compaction will create larger blocks up
>> > to 10% of the retention time, or 31 days, whichever is smaller". I
>> > expected this to mean 10% of whatever retention I set, but from what I
>> > have seen it looks like this means the max block size will always be
>> > 31 days if you have storage.tsdb.retention.time=0s and are just using
>> > size based retention.
>> >
>> > We have elected to use sized based retention instead of time because
>> > we are in the process of onboarding application exporters so the
>> > number of days we can retain is decreasing frequently, but the amount
>> > of space we have available is not. In this situation what would be the
>> > best way to configure storage.tsdb.max-block-duration?
>> >
>>
>> What is the exact issue you are seeing, as it sounds like the storage
>> usage is still within the 2.5T limit you set?
>>
>> --
>> Stuart Clark
>>
>

-- 


This e-mail together with any attachments (the "Message") is confidential 
and may contain privileged information. If you are not the intended 
recipient or if you have received this e-mail in error, please notify the 
sender immediately and permanently delete this Message from your system. Do 
not copy, disclose or distribute the information contained in this Message.


_Maven Investment Partners Ltd (No. 07511928), _Maven Investment Partners 
US Ltd (No. _11494299), Maven Europe Ltd (No. 08966), Maven Derivatives 
Asia Limited (No.10361312) & Maven Securities Holding Ltd (No. 07505438) 
are registered as companies in England and Wales and their registered 
address is Level 3, 6 Bevis Marks, London EC3A 7BA, United Kingdom. The 
companies’ VAT No. is 135539016. Maven Asia (Hong Kong) Ltd (No. 2444041) 
is registered in Hong Kong and its registered address is 20/F, Tai Tung 
Building, 8 Fleming Road, Wan Chai, Hong Kong. Maven Europe Ltd is 
authorised and regulated by the Financial Conduct Authority (FRN:770542). 
Maven Asia (Hong Kong) Ltd is registered and regulated by the Securities 
and Futures Commission (CE No: BJF060).___

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAPJy6TME3UiT%3DVbpOqiZLxACaP0k66fU0eMsmDhtCxQi%3Dy-25Q%40mail.gmail.com.


Re: [prometheus-users] Resources limits

2020-06-17 Thread Ben Kochie
The standard approach for larger setups is to start sharding Prometheus. In
Kubernetes it's common to have a Prometheus-per-namespace.

You may also want to look into how many metrics each of your pods is
exposing. 20GB of memory indicates that you probably have over 1M
prometheus_tsdb_head_series

Changing the scrape interval is probably not going to help as much as
reducing your cardinality per Prometheus.

For example, we have a couple different shards. One is using 33GB of memory
and managing 1.5M series. The other shard is 38GB and managing 2.5M series.
We allocate 64GB memory instances for these servers.

If you don't want to go down the sharding route, you'll likely need some
larger nodes to run Prometheus on.

On Wed, Jun 17, 2020 at 9:48 AM Tomer Leibovich 
wrote:

> Thanks, so if I cannot reduce the amount of pods, it’s better to change
> the scraper interval from default of 30s to 60s?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/71fc37fc-4e4f-4a14-9fdb-67ef49e5f661o%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmq0jPsn3NbVDwQh4iPBAvjMwf9ypvpHs4va_nezTm%3D_jw%40mail.gmail.com.


Re: [prometheus-users] Resources limits

2020-06-17 Thread Brian Candler
There's a calculator here:
https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion

You can see from this how much difference increasing the scrape interval 
would make.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e608eb59-419b-4aa3-a43e-43007a64d18eo%40googlegroups.com.


Re: [prometheus-users] block size calculation behaves unintuitively when only using size based retention

2020-06-17 Thread Dom Prittie
The issue is with the size of the blocks being created and how the block
size calculation appears to behave differently for time based vs size based
retention settings. Just before we hit our retention limit (2.5T) we can
query ~30 days of metrics, when we hit the limit the oldest block is
removed and the size of that block often uncomfortably large. The example
block I gave in my first email is 1.1T which represents about 14 days of
data, so when that block is the oldest and we hit our size based retention
limit 14 days of data will disappear at once, because Prometheus removes
whole blocks at a time.

The thing that is odd is that I expect the behaviour around retention
limits to be the same regardless of whether you are using just time based
retention, just size based retention, or a combination of both. If I had
used a time based retention of 30 days (currently roughly equivalent to my
size limit of 2.5T) then instead of losing 14 days worth of data when we
hit the limit we would lose 3 days of data or ~ 10%.

We don't wish to use time based retention at the moment because the
amount of samples we ingest daily is in flux. So how can I ensure we get
reasonably sized blocks (blocks <= 10% of storage.tsdb.retention.size)?

On Wed, 17 Jun 2020 at 08:28, Stuart Clark  wrote:

> On 2020-06-17 08:23, Dom Prittie wrote:
> > Hi,
> >
> > I have a Prometheus deployment for which we have only used size based
> > retention, which we have set to 2.5T. We regularly see unexpectedly
> > large blocks getting created, so that when we hit
> > storage.tsdb.retention.size we see a huge drop in the metrics
> > available for querying.
> >
> > For instance we currently have a block which is 1.1T and contains data
> > from Mon 1 Jun 01:00:00 BST 2020 to Sun 14 Jun 13:00:00 BST 2020,
> > which is ~45% of our retention size!
> >
> > I can see from the docs that "Compaction will create larger blocks up
> > to 10% of the retention time, or 31 days, whichever is smaller". I
> > expected this to mean 10% of whatever retention I set, but from what I
> > have seen it looks like this means the max block size will always be
> > 31 days if you have storage.tsdb.retention.time=0s and are just using
> > size based retention.
> >
> > We have elected to use sized based retention instead of time because
> > we are in the process of onboarding application exporters so the
> > number of days we can retain is decreasing frequently, but the amount
> > of space we have available is not. In this situation what would be the
> > best way to configure storage.tsdb.max-block-duration?
> >
>
> What is the exact issue you are seeing, as it sounds like the storage
> usage is still within the 2.5T limit you set?
>
> --
> Stuart Clark
>

-- 


This e-mail together with any attachments (the "Message") is confidential 
and may contain privileged information. If you are not the intended 
recipient or if you have received this e-mail in error, please notify the 
sender immediately and permanently delete this Message from your system. Do 
not copy, disclose or distribute the information contained in this Message.


_Maven Investment Partners Ltd (No. 07511928), _Maven Investment Partners 
US Ltd (No. _11494299), Maven Europe Ltd (No. 08966), Maven Derivatives 
Asia Limited (No.10361312) & Maven Securities Holding Ltd (No. 07505438) 
are registered as companies in England and Wales and their registered 
address is Level 3, 6 Bevis Marks, London EC3A 7BA, United Kingdom. The 
companies’ VAT No. is 135539016. Maven Asia (Hong Kong) Ltd (No. 2444041) 
is registered in Hong Kong and its registered address is 20/F, Tai Tung 
Building, 8 Fleming Road, Wan Chai, Hong Kong. Maven Europe Ltd is 
authorised and regulated by the Financial Conduct Authority (FRN:770542). 
Maven Asia (Hong Kong) Ltd is registered and regulated by the Securities 
and Futures Commission (CE No: BJF060).___

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAPJy6TN2QKD4PnoRqyzG5UKHxP-qEwCX6GaHLP2cD%2BiMdyYqVg%40mail.gmail.com.


Re: [prometheus-users] Resources limits

2020-06-17 Thread Stuart Clark

On 2020-06-17 08:34, Tomer Leibovich wrote:

I’m using Prometheus-Operator in my cluster and encountered an issue
with Prometheus pod that consumed 20GB RAM when my cluster grew and
consisted of 400 pods, eventually Prometheus chocked the server and I
had to terminate it.
How much memory should I allocate to the pod in order to keep it
running and avoid letting it to grow as it was?


The amount of memory needed depends on the scrape interval, number of 
timeseries being ingested and the query load.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4f99afef59ee049ded618fc11ecc9650%40Jahingo.com.


[prometheus-users] Resources limits

2020-06-17 Thread Tomer Leibovich
I’m using Prometheus-Operator in my cluster and encountered an issue with 
Prometheus pod that consumed 20GB RAM when my cluster grew and consisted of 400 
pods, eventually Prometheus chocked the server and I had to terminate it.
How much memory should I allocate to the pod in order to keep it running and 
avoid letting it to grow as it was?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5fee94a4-f444-4265-8dca-40b9d805c1f3o%40googlegroups.com.


Re: [prometheus-users] block size calculation behaves unintuitively when only using size based retention

2020-06-17 Thread Stuart Clark

On 2020-06-17 08:23, Dom Prittie wrote:

Hi,

I have a Prometheus deployment for which we have only used size based
retention, which we have set to 2.5T. We regularly see unexpectedly
large blocks getting created, so that when we hit
storage.tsdb.retention.size we see a huge drop in the metrics
available for querying.

For instance we currently have a block which is 1.1T and contains data
from Mon 1 Jun 01:00:00 BST 2020 to Sun 14 Jun 13:00:00 BST 2020,
which is ~45% of our retention size!

I can see from the docs that "Compaction will create larger blocks up
to 10% of the retention time, or 31 days, whichever is smaller". I
expected this to mean 10% of whatever retention I set, but from what I
have seen it looks like this means the max block size will always be
31 days if you have storage.tsdb.retention.time=0s and are just using
size based retention.

We have elected to use sized based retention instead of time because
we are in the process of onboarding application exporters so the
number of days we can retain is decreasing frequently, but the amount
of space we have available is not. In this situation what would be the
best way to configure storage.tsdb.max-block-duration?



What is the exact issue you are seeing, as it sounds like the storage 
usage is still within the 2.5T limit you set?


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/58883d30f3a95acf353ee940396b0bd4%40Jahingo.com.


[prometheus-users] Re: Help with understanding relabel config

2020-06-17 Thread Tomer Leibovich
Nothing?
Any help?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8bd7ea3f-157f-4856-85d7-fdd33329b164o%40googlegroups.com.


[prometheus-users] Re: Email notification doesn't Work

2020-06-17 Thread Shivam Soni
Use this it will definitely work


route:
  group_by: [Alertname]
  # Send all notifications to me.
  receiver: email-me

receivers:
- name: email-me
  email_configs:
  - to: ReciverMailId
from: Sendermailid
smarthost: smtp.gmail.com:587
auth_username: "sen...@gmail.com"
auth_identity: "sen...@gmail.com"
auth_password: "pwd"


On Thursday, June 11, 2020 at 5:19:27 PM UTC+7, Frederic Arnould wrote:
>
> Hello, 
>
> Somebody could check my configuration ?
> My postfix works correctly, but not mails in my prometheus container.
>
> [root@admin-toto ~]# docker exec -it monitor_prometheus sh
>
> /prometheus # cat alertmanager.yml 
> global:
>   resolve_timeout: 5m
>   smtp_smarthost: 10.10.10.10
>   smtp_from: al...@alertmanager.com 
>   smtp_require_tls: false
> route:
>   group_by: ['alertname']
>   group_wait: 10s
>   group_interval: 10s
>   repeat_interval: 4h
>   receiver: 'team-admin'
> receivers:
> - name: 'team-admin'
>   email_configs:
>   - to: 'frederi...@toto.com ' 
> #inhibit_rules:
> #  - source_match:
> #  severity: 'critical'
> #target_match:
> #  severity: 'warning'
> #equal: ['alertname', 'job', 'instance']
>
>
> Thanks by advance 
> Regards
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5e3ce383-2fcb-4dab-b95e-928edddfb6b5o%40googlegroups.com.


[prometheus-users] block size calculation behaves unintuitively when only using size based retention

2020-06-17 Thread Dom Prittie
Hi,

I have a Prometheus deployment for which we have only used size based 
retention, which we have set to 2.5T. We regularly see unexpectedly large 
blocks getting created, so that when we hit storage.tsdb.retention.size we 
see a huge drop in the metrics available for querying.

For instance we currently have a block which is 1.1T and contains data from 
Mon 1 Jun 01:00:00 BST 2020 to Sun 14 Jun 13:00:00 BST 2020, which is ~45% 
of our retention size!

I can see from the docs that "Compaction will create larger blocks up to 
10% of the retention time, or 31 days, whichever is smaller". I expected 
this to mean 10% of whatever retention I set, but from what I have seen it 
looks like this means the max block size will always be 31 days if you have 
storage.tsdb.retention.time=0s and are just using size based retention.

We have elected to use sized based retention instead of time because we are 
in the process of onboarding application exporters so the number of days we 
can retain is decreasing frequently, but the amount of space we have 
available is not. In this situation what would be the best way to configure 
storage.tsdb.max-block-duration?


Thanks,

Dom

-- 


This e-mail together with any attachments (the "Message") is confidential 
and may contain privileged information. If you are not the intended 
recipient or if you have received this e-mail in error, please notify the 
sender immediately and permanently delete this Message from your system. Do 
not copy, disclose or distribute the information contained in this Message.


_Maven Investment Partners Ltd (No. 07511928), _Maven Investment Partners 
US Ltd (No. _11494299), Maven Europe Ltd (No. 08966), Maven Derivatives 
Asia Limited (No.10361312) & Maven Securities Holding Ltd (No. 07505438) 
are registered as companies in England and Wales and their registered 
address is Level 3, 6 Bevis Marks, London EC3A 7BA, United Kingdom. The 
companies’ VAT No. is 135539016. Maven Asia (Hong Kong) Ltd (No. 2444041) 
is registered in Hong Kong and its registered address is 20/F, Tai Tung 
Building, 8 Fleming Road, Wan Chai, Hong Kong. Maven Europe Ltd is 
authorised and regulated by the Financial Conduct Authority (FRN:770542). 
Maven Asia (Hong Kong) Ltd is registered and regulated by the Securities 
and Futures Commission (CE No: BJF060).___

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/281cc4ae-f6f8-4677-8f15-e28e252b8721n%40googlegroups.com.


Re: [prometheus-users] consul_exporter - network connections

2020-06-17 Thread Ben Kochie
Have you tried setting the --consul.request-limit to limit the number of
concurrent connections?

On Wed, Jun 17, 2020 at 6:37 AM Dennis Kelly 
wrote:

> We have a consul cluster of 3 members and about 1k services.
> consul_exporter has been using significantly more CPU and is also logging
> this:
>
> level=error ts=2020-06-16T23:56:46.593Z caller=consul_exporter.go:400
> msg="Failed to query service health" err="Get 
> \"http://consul.service:8500/v1/health/service/[service
> name]?stale=
> \":
> context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
>
> It is running as a docker container in Nomad. I bumped the CPU resource
> from the default to 900 MHz and also the consul.timeout to 2s. This has
> improved things, but we still sporadically receive this error. I haven't
> had a chance to dig through the entire source yet, but wondering why too
> consult_exporter has so many open connections to the same 3 consul servers:
>
> $ netstat | grep :8500 | wc -l
>
> 13653
>
> Why would the connections remain, and also if they do remain, not reused?
> I suspect we may be hitting up against this issue, but hoping for further
> clarification:
>
> https://github.com/prometheus/consul_exporter/issues/102
>
> Thanks!
>
> Dennis
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/ece427fb-99ea-4deb-a99c-60707f2c807dn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmo3BXKDNcfHvxOAKPA8fv_RC88hQVw1Kfj-2qNOgrnZNg%40mail.gmail.com.