Re: [prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-05 Thread koly li
thank you everyone.

We are planning for 10K nodes and each node has 128 cores. So the 
timeseries data is: 128 * 8 * 1 = 10,240,000. Meanwhile, there are 
other timeseries data from kubelet、kube-state-metrics and more(business 
data). Totally, all data comes to around 30M timeseries, then prometheus 
eats 170G memory (as tested), and we think there should be some buffer 
(maybe 100G). So it makes sense to reduce the node-exporter timeseries data 
to 128 * 1 * 1 = 1,280,000. 

We will have a try keeping the idle mode only for cpu usage.  Some expr 
considered:
1 - avg without(cpu, mode) 
(rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m]))

avg(1 - 
avg(rate(node_cpu_seconds_total{origin_prometheus=~"$origin_prometheus",job=~"$job",mode="idle"}[$interval]))
 
by (instance)) * 100



On Monday, February 6, 2023 at 4:46:10 AM UTC+8 l.mi...@gmail.com wrote:

> On Sunday, 5 February 2023 at 11:13:23 UTC Brian Candler wrote:
> Timeseries in Prometheus are extremely cheap.  If you're talking 10K nodes 
> and 96 cores per node, that's less than 1m timeseries; compared to the cost 
> of the estate you are managing, it's a drop in the ocean :-)  How many 
> *other* timeseries are you storing from node_exporter?
>
> A single timeseries eats >=4KB on all nodes I touch. Having a lot of 
> labels (or long labels) will make it more expensive.
> So 1M timeseries will eat 4GB of memory.
> Not everyone would call that extremely cheap, especially if that's just to 
> tell what's the cpu usage of each server.
>
> There was a PR that tried to implement scrape time recording rules, which 
> would help here, but it didn't seem to go far - 
> https://github.com/prometheus/prometheus/pull/10529
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/410dd3e5-6abf-4aa3-ae4e-7fb55184c227n%40googlegroups.com.


Re: [prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-05 Thread l.mi...@gmail.com


On Sunday, 5 February 2023 at 11:13:23 UTC Brian Candler wrote:
Timeseries in Prometheus are extremely cheap.  If you're talking 10K nodes 
and 96 cores per node, that's less than 1m timeseries; compared to the cost 
of the estate you are managing, it's a drop in the ocean :-)  How many 
*other* timeseries are you storing from node_exporter?

A single timeseries eats >=4KB on all nodes I touch. Having a lot of labels 
(or long labels) will make it more expensive.
So 1M timeseries will eat 4GB of memory.
Not everyone would call that extremely cheap, especially if that's just to 
tell what's the cpu usage of each server.

There was a PR that tried to implement scrape time recording rules, which 
would help here, but it didn't seem to go far 
- https://github.com/prometheus/prometheus/pull/10529

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1faec13b-7cd5-408b-82d3-6157880b9f8cn%40googlegroups.com.


Re: [prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-05 Thread Ben Kochie
Well, there are 8 modes per CPU, so around 8M series. But still, that's not
much for such a large infra. Since it's bare metal, you can drop "steal" to
get it down to 7 modes.

If you really only cared about utilization, you could maybe just keep
"idle" and maybe "iowait".

It would probably be a small patch to the node_exporter to only expose
system-wide. But it's probably not something we would really want to
maintain upstream.

On Sun, Feb 5, 2023 at 12:13 PM Brian Candler  wrote:

> On Thursday, 2 February 2023 at 09:05:30 UTC koly li wrote:
> If using a recording rule to aggerate data, then I have to store both the
> per core samples and metric samples in the same prometheus, which costs
> lots of memory.
>
> Timeseries in Prometheus are extremely cheap.  If you're talking 10K nodes
> and 96 cores per node, that's less than 1m timeseries; compared to the cost
> of the estate you are managing, it's a drop in the ocean :-)  How many
> *other* timeseries are you storing from node_exporter?
>
> But if you still want to drop these timeseries, I can see two options:
>
> 1. Scrape into a primary prometheus, use recording rules to aggregate, and
> then either remote_write or federate to a second prometheus to store only
> the timeseries of interest.  This can be done with out-of-the-box
> components.  The primary prometheus needs only a very small retention
> window.
>
> 2. Write a small proxy which makes a node_exporter scrape, does the
> aggregation, and returns only the aggregates.  Then scrape the proxy.  That
> will involve some coding.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/a665ab71-484a-4b5f-8d69-bdaad8558880n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmpKrHQK-kQ5rYMo2X79Fi%2B1e4r%3D__gxtoNbW%3Dwsiy3VKg%40mail.gmail.com.


Re: [prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-05 Thread Brian Candler
On Thursday, 2 February 2023 at 09:05:30 UTC koly li wrote:
If using a recording rule to aggerate data, then I have to store both the 
per core samples and metric samples in the same prometheus, which costs 
lots of memory.

Timeseries in Prometheus are extremely cheap.  If you're talking 10K nodes 
and 96 cores per node, that's less than 1m timeseries; compared to the cost 
of the estate you are managing, it's a drop in the ocean :-)  How many 
*other* timeseries are you storing from node_exporter?

But if you still want to drop these timeseries, I can see two options:

1. Scrape into a primary prometheus, use recording rules to aggregate, and 
then either remote_write or federate to a second prometheus to store only 
the timeseries of interest.  This can be done with out-of-the-box 
components.  The primary prometheus needs only a very small retention 
window.

2. Write a small proxy which makes a node_exporter scrape, does the 
aggregation, and returns only the aggregates.  Then scrape the proxy.  That 
will involve some coding. 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a665ab71-484a-4b5f-8d69-bdaad8558880n%40googlegroups.com.


Re: [prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-02 Thread Ben Kochie
The node_exporter exposes per-cpu metrics because that's what most users
want. Knowing about per-core saturation, single-core IO wait, etc are
extremely useful and common use cases.

Using a recording rule is recommended.

On Thu, Feb 2, 2023 at 10:05 AM koly li  wrote:

> If using a recording rule to aggerate data, then I have to store both the
> per core samples and metric samples in the same prometheus, which costs
> lots of memory.
>
> After some investigation on node_exporter sourcecode, I found:
> 1. updateStat(cpu_linux.go
> )
> function reads the content of /proc/stat file and generate the
> node_cpu_seconds_total samples per core
> 2. updateStat function calls c.fs.Stat() to read and parse the content of
> /proc/stat file
> 3. fs.Stat() function parse the /proc/stat file and store the cpu total
> statics to Stat.CPUTotal(stat.go
> )
> 4. However, updateStat function ignores the Stat.CPUTotal, it only uses
> the stats.CPU which contains info per core
>
> so, the question is why node_exporter developers don't use the CPUTotal to
> expose a total cpu statics? Should the new metrics about total usage
> statics be added to node-exporter?
>
>
> On Thursday, February 2, 2023 at 2:40:34 PM UTC+8 Stuart Clark wrote:
> On 02/02/2023 06:26, koly li wrote:
> Hi,
>
> Currently, node_exporter exposes time series for each cpu core (an example
> below), which generates a lot of data in a large cluster (10k nodes
> cluster). However, we only care about total cpu usage instead of usage per
> core. So is there a way for node_exporter to only
> expose aggregated node_cpu_seconds_total?
>
> we also notice there is an discussion here (reduce cardinality of
> node_cpu_seconds_total
> ), but
> it seems no conclusion.
>
>
> node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="
> 10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="system",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
> 9077.24 1675059665571
>
> node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="
> 10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="user",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
> 19298.57 1675059665571
>
> node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="
> 10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="idle",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
> 1.060892164e+07 1675059665571
>
> node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="
> 10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="iowait",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
> 4.37 1675059665571
>
> You can't remove it as far as I'm aware, but you can use a recording rule
> to aggregate that data to just give you a metric that represents the
> overall CPU usage (not broken down by core/status).
> -- Stuart Clark
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/bc11f812-92b3-4b2d-81f8-e0720adc7510n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmrHE7qvB%2Be8_AozxGCosHc7f2sAVk-g_D%2B9U7Q0FF4kfg%40mail.gmail.com.


Re: [prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-02 Thread koly li
If using a recording rule to aggerate data, then I have to store both the 
per core samples and metric samples in the same prometheus, which costs 
lots of memory.

After some investigation on node_exporter sourcecode, I found:
1. updateStat(cpu_linux.go 
)
 
function reads the content of /proc/stat file and generate the 
node_cpu_seconds_total samples per core
2. updateStat function calls c.fs.Stat() to read and parse the content of 
/proc/stat file
3. fs.Stat() function parse the /proc/stat file and store the cpu total 
statics to Stat.CPUTotal(stat.go 
)
4. However, updateStat function ignores the Stat.CPUTotal, it only uses the 
stats.CPU which contains info per core

so, the question is why node_exporter developers don't use the CPUTotal to 
expose a total cpu statics? Should the new metrics about total usage 
statics be added to node-exporter?


On Thursday, February 2, 2023 at 2:40:34 PM UTC+8 Stuart Clark wrote:
On 02/02/2023 06:26, koly li wrote:
Hi, 

Currently, node_exporter exposes time series for each cpu core (an example 
below), which generates a lot of data in a large cluster (10k nodes 
cluster). However, we only care about total cpu usage instead of usage per 
core. So is there a way for node_exporter to only 
expose aggregated node_cpu_seconds_total?

we also notice there is an discussion here (reduce cardinality of 
node_cpu_seconds_total 
), but it 
seems no conclusion.

node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="
10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="system",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
 
9077.24 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="
10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="user",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
 
19298.57 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="
10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="idle",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
 
1.060892164e+07 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="
10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="iowait",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
 
4.37 1675059665571

You can't remove it as far as I'm aware, but you can use a recording rule 
to aggregate that data to just give you a metric that represents the 
overall CPU usage (not broken down by core/status).
-- Stuart Clark 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bc11f812-92b3-4b2d-81f8-e0720adc7510n%40googlegroups.com.


Re: [prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-01 Thread Stuart Clark

On 02/02/2023 06:26, koly li wrote:

Hi,

Currently, node_exporter exposes time series for each cpu core (an 
example below), which generates a lot of data in a large cluster (10k 
nodes cluster). However, we only care about total cpu usage instead of 
usage per core. So is there a way for node_exporter to only 
expose aggregated node_cpu_seconds_total?


we also notice there is an discussion here (reduce cardinality of 
node_cpu_seconds_total 
), 
but it seems no conclusion.


node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="system",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 
9077.24 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="user",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 
19298.57 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="idle",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 
1.060892164e+07 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="iowait",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 
4.37 1675059665571


You can't remove it as far as I'm aware, but you can use a recording 
rule to aggregate that data to just give you a metric that represents 
the overall CPU usage (not broken down by core/status).


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6f9d281c-76b1-f1bb-8bae-f3e0ed5ca471%40Jahingo.com.


[prometheus-users] can node_exporter expose aggregated node_cpu_seconds_total?

2023-02-01 Thread koly li
Hi,

Currently, node_exporter exposes time series for each cpu core (an example 
below), which generates a lot of data in a large cluster (10k nodes 
cluster). However, we only care about total cpu usage instead of usage per 
core. So is there a way for node_exporter to only 
expose aggregated node_cpu_seconds_total?

we also notice there is an discussion here (reduce cardinality of 
node_cpu_seconds_total 
), but it 
seems no conclusion.

node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="system",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
 
9077.24 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="user",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
 
19298.57 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="idle",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
 
1.060892164e+07 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="iowait",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"}
 
4.37 1675059665571

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3dcb9c3e-8db6-4841-8519-cec6855a7662n%40googlegroups.com.