You say you have 2.1 million metrics, but your head series is 7.6 million.
This means you have a huge amount of series / label churn. This is going to
hugely bloat the memory use.

You have several options here
* Increase your memory allocation
* Reduce the number of metrics per target
* Find if you have problematic labels that are causing series churn
* Shard Prometheus

Handling millions of series requires capacity planning. There is just no
way around this.

On Thu, Feb 10, 2022 at 10:17 AM Shubham Shrivastav <
[email protected]> wrote:

> Hey, I'm using Prometheus v2.29.1. My scrape interval is 15 seconds and
> I'm measuring RAM using "container_memory_working_set_bytes"(metrics used
> to check k8s pod usage)
>
> Using "Status" in the Prometheus web UI, I see the following Head Stats:
>
> Number of Series  7644889
> Number of Chunks  8266039
> Number of Label Pairs  9968
> Like I mentioned above, We're getting* the average Metrics Per node as
> 8257* and we have around 300 targets now, which makes our total metrics
> around 2,100,000.
> *Are you monitoring Kubernetes pods by any chance?  *I'm not monitoring
> any pods, I connect to certain nodes that send in custom metrics. Since I'm
> using a pod and not a node, the resources assigned to this pod are
> exclusive.
>
> On Thursday, 10 February 2022 at 00:20:04 UTC-8 Brian Candler wrote:
>
>> What prometheus version? How often are you polling? How are you measuring
>> the RAM utilisation?
>>
>> Let me give you a comparison.  I have a prometheus instance here which is
>> polling 161 node_exporter targets, 38 snmp_exporter targets, 46
>> blackbox_exporter targets, and a handful of others, with a 1 minute scrape
>> interval. It's running inside an lxd container, and uses a grand total of 
>> *2.5GB
>> RAM* (as reported by "free" inside the container, "used" column).  The
>> entire physical server has 16GB of RAM, and is running a bunch of other
>> monitoring tools in other containers as well.  The physical host has 9GB of
>> available RAM (as reported by "free" on the host, "available" column).
>>
>> This is with prometheus-2.33.0, under Ubuntu 18.04, although I haven't
>> noticed significantly higher RAM utilisation with older versions of
>> prometheus.
>>
>> Using "Status" in the Prometheus web UI, I see the following Head Stats:
>>
>> Number of Series  525141
>> Number of Chunks  525141
>> Number of Label Pairs  15305
>>
>> I can use a relatively expensive query to count the individual metrics at
>> the current instance in time (takes a few seconds):
>>     count by (job) ({__name__=~".+"})
>>
>> This shows 391,863 metrics for node(*), 99,175 metrics for snmp, 23,138
>> metrics for haproxy (keepalived), and roughly 10,000 other metrics in total.
>>
>> (*) Given that there are 161 node targets, that's an average of 2433
>> metrics per node (from node_exporter).
>>
>> In summary, I find prometheus to be extremely frugal in its use of RAM,
>> and therefore if you're getting OOM problems then there must be something
>> different about your system.
>>
>> Are you monitoring kubernetes pods by any chance?  Is there a lot of
>> churn in those pods (i.e. pods being created and destroyed)?  If you
>> generate large numbers of short-lived timeseries, then that will require a
>> lot more memory.  The Head Stats figures is the place to start.
>>
>> Aside: a week or two ago, there was an isolated incident where this
>> server started using more CPU and RAM.  Memory usage graphs showed the RAM
>> growing steadily over a period of about 5 hours; at that point, it was
>> under so much memory pressure I couldn't log in to diagnose, and was forced
>> to reboot.  However since node_exporter is only returning the overall RAM
>> on the host, not per-container, I can't tell which of the many containers
>> running on that host was the culprit.
>>
>> [image: ram.png]
>> This server is also running victoriametrics, nfsen, loki, smokeping,
>> oxidized, netdisco, nagios, and some other bits and bobs - so it could have
>> been any one of those.  In fact, given that Ubuntu does various daily
>> housecleaning activities at 06:25am, it could have been any of those as
>> well.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/cb2f43ff-eebf-48cc-a77a-637482430448n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/cb2f43ff-eebf-48cc-a77a-637482430448n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmroO69jZBpJ40eRSZSKNUzyq8f_3Enwtvry%3DYxp3B2OhA%40mail.gmail.com.

Reply via email to