50 nodes at 64Gi is 3200Gi of memory. Using 30Gi is 0.9% of the cluster. This is a little high, but not out of bounds for a normal deployment.
I would recommend starting to consider sharding by Kubernetes namespace. This is what we're working on to avoid single service namespaces from exploding the cluster monitoring too badly. On Mon, Aug 23, 2021 at 4:13 PM Yaron B <[email protected]> wrote: > we have around 50 nodes with 64 gig of ram. > > by the way, we found that our backend added a metric that spammed the > prometheus until it crashed :) > they removed the metric and the server seems to be stable. > still using around 30gb of ram but at least not crashing > > ב-יום שני, 23 באוגוסט 2021 בשעה 16:25:21 UTC+3, [email protected] כתב/ה: > >> Seems about correct for that many series. Kubernetes use includes a lot >> of label data/cardinality that requires extra memory for tracking. >> >> How big is your cluster in terms of total memory for all nodes? >> >> On Mon, Aug 23, 2021 at 2:18 PM Yaron B <[email protected]> wrote: >> >>> that makes sense but if I look at the numbers in the url you gave me: >>> Number of Series 2514033 >>> Number of Chunks 3098707 >>> Number of Label Pairs 1088507 >>> and use them in memory calculator I found, it shows me much less ram >>> than what I am using now. >>> >>> do you see any number here that should be a red light for me? something >>> that is not right? >>> ב-יום שני, 23 באוגוסט 2021 בשעה 14:58:36 UTC+3, [email protected] כתב/ה: >>> >>>> Prometheus needs memory to buffer incoming data before writing it to >>>> disk. The more you scrape, the more it needs. >>>> >>>> You can see a summary of this information on prometheus:9090/tsdb-status >>>> >>>> On Mon, Aug 23, 2021 at 1:55 PM Yaron B <[email protected]> wrote: >>>> >>>>> can anyone understand from this image why is the server is using so >>>>> much ? >>>>> production-prometheus-server-869bffc459-r92nh >>>>> 1186m 54937Mi >>>>> thats crazy! >>>>> ב-יום שני, 23 באוגוסט 2021 בשעה 13:35:18 UTC+3, Yaron B כתב/ה: >>>>> >>>>>> at the moment we did add some scrape jobs that bumped the memory >>>>>> usage from around 30gb to 40gb but we are not sure why the self scraping >>>>>> takes so much ram. >>>>>> its not a new implementation, we did notice it is using a lot of >>>>>> memory but it didn't crash on us so we let it run. today >>>>>> as you can see in the attached image, it crashed, skyrocket the >>>>>> memory usage to 60gb ,then we started to disable jobs until the server >>>>>> didn't crash anymore but it is using more than it used in the last 15 >>>>>> days >>>>>> >>>>>> >>>>>> ב-יום שני, 23 באוגוסט 2021 בשעה 13:29:59 UTC+3, Stuart Clark כתב/ה: >>>>>> >>>>>>> On 23/08/2021 11:23, Yaron B wrote: >>>>>>> >>>>>>> I am attaching the heap.svg if someone can help me figure out what >>>>>>> is using the memory >>>>>>> ב-יום שני, 23 באוגוסט 2021 בשעה 12:23:33 UTC+3, Yaron B כתב/ה: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> we are facing an issue with the prometheus server memory usage. >>>>>>>> when starting the server it starts with around 30GB of ram , even >>>>>>>> without any jobs configured other than the self one. >>>>>>>> in the image attached you can see the heap size usage for the >>>>>>>> prometheus job. >>>>>>>> is there a way to reduce this size? when we add our kubernetes >>>>>>>> scrape job we reach our node limit and get OOMKilled. >>>>>>>> >>>>>>> So at the moment it isn't scraping anything other than itself via >>>>>>> the /metrics endpoint? >>>>>>> >>>>>>> Is this a brand new service (i.e. no existing data stored on disk)? >>>>>>> >>>>>>> Is there anything querying the server (e.g. Grafana dashboards, >>>>>>> etc.)? >>>>>>> >>>>>>> -- >>>>>>> Stuart Clark >>>>>>> >>>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-users/0659c262-daeb-452e-8dc4-4df8df22021dn%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/prometheus-users/0659c262-daeb-452e-8dc4-4df8df22021dn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-users/b17b8d09-fe23-4c43-b85e-c2f4d7a87539n%40googlegroups.com >>> <https://groups.google.com/d/msgid/prometheus-users/b17b8d09-fe23-4c43-b85e-c2f4d7a87539n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/8fefbe26-6cf0-498a-96d2-0bb21f536ee5n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/8fefbe26-6cf0-498a-96d2-0bb21f536ee5n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmoGnD8e8bGRVN0WDpcSPOPU56AFpukq%3D1kQt8xYxFv_tg%40mail.gmail.com.

