You say you have 2.1 million metrics, but your head series is 7.6 million. This means you have a huge amount of series / label churn. This is going to hugely bloat the memory use.
You have several options here * Increase your memory allocation * Reduce the number of metrics per target * Find if you have problematic labels that are causing series churn * Shard Prometheus Handling millions of series requires capacity planning. There is just no way around this. On Thu, Feb 10, 2022 at 10:17 AM Shubham Shrivastav < [email protected]> wrote: > Hey, I'm using Prometheus v2.29.1. My scrape interval is 15 seconds and > I'm measuring RAM using "container_memory_working_set_bytes"(metrics used > to check k8s pod usage) > > Using "Status" in the Prometheus web UI, I see the following Head Stats: > > Number of Series 7644889 > Number of Chunks 8266039 > Number of Label Pairs 9968 > Like I mentioned above, We're getting* the average Metrics Per node as > 8257* and we have around 300 targets now, which makes our total metrics > around 2,100,000. > *Are you monitoring Kubernetes pods by any chance? *I'm not monitoring > any pods, I connect to certain nodes that send in custom metrics. Since I'm > using a pod and not a node, the resources assigned to this pod are > exclusive. > > On Thursday, 10 February 2022 at 00:20:04 UTC-8 Brian Candler wrote: > >> What prometheus version? How often are you polling? How are you measuring >> the RAM utilisation? >> >> Let me give you a comparison. I have a prometheus instance here which is >> polling 161 node_exporter targets, 38 snmp_exporter targets, 46 >> blackbox_exporter targets, and a handful of others, with a 1 minute scrape >> interval. It's running inside an lxd container, and uses a grand total of >> *2.5GB >> RAM* (as reported by "free" inside the container, "used" column). The >> entire physical server has 16GB of RAM, and is running a bunch of other >> monitoring tools in other containers as well. The physical host has 9GB of >> available RAM (as reported by "free" on the host, "available" column). >> >> This is with prometheus-2.33.0, under Ubuntu 18.04, although I haven't >> noticed significantly higher RAM utilisation with older versions of >> prometheus. >> >> Using "Status" in the Prometheus web UI, I see the following Head Stats: >> >> Number of Series 525141 >> Number of Chunks 525141 >> Number of Label Pairs 15305 >> >> I can use a relatively expensive query to count the individual metrics at >> the current instance in time (takes a few seconds): >> count by (job) ({__name__=~".+"}) >> >> This shows 391,863 metrics for node(*), 99,175 metrics for snmp, 23,138 >> metrics for haproxy (keepalived), and roughly 10,000 other metrics in total. >> >> (*) Given that there are 161 node targets, that's an average of 2433 >> metrics per node (from node_exporter). >> >> In summary, I find prometheus to be extremely frugal in its use of RAM, >> and therefore if you're getting OOM problems then there must be something >> different about your system. >> >> Are you monitoring kubernetes pods by any chance? Is there a lot of >> churn in those pods (i.e. pods being created and destroyed)? If you >> generate large numbers of short-lived timeseries, then that will require a >> lot more memory. The Head Stats figures is the place to start. >> >> Aside: a week or two ago, there was an isolated incident where this >> server started using more CPU and RAM. Memory usage graphs showed the RAM >> growing steadily over a period of about 5 hours; at that point, it was >> under so much memory pressure I couldn't log in to diagnose, and was forced >> to reboot. However since node_exporter is only returning the overall RAM >> on the host, not per-container, I can't tell which of the many containers >> running on that host was the culprit. >> >> [image: ram.png] >> This server is also running victoriametrics, nfsen, loki, smokeping, >> oxidized, netdisco, nagios, and some other bits and bobs - so it could have >> been any one of those. In fact, given that Ubuntu does various daily >> housecleaning activities at 06:25am, it could have been any of those as >> well. >> > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/cb2f43ff-eebf-48cc-a77a-637482430448n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/cb2f43ff-eebf-48cc-a77a-637482430448n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmroO69jZBpJ40eRSZSKNUzyq8f_3Enwtvry%3DYxp3B2OhA%40mail.gmail.com.

