Re: [prometheus-users] Scaling Prometheus

Karthik Vijayaraju Tue, 20 Oct 2020 10:33:14 -0700

Hi Aliaksandr,

Thank you! Those numbers look interesting; we will give it a shot as well.


Thanks,
Karthik

On Sat, Oct 17, 2020 at 1:42 PM Aliaksandr Valialkin <valy...@gmail.com>
wrote:

> Hi Karthik,
>
> There is another option - to substitute Prometheus with VictoriaMetrics
> stack <https://victoriametrics.github.io/>, which includes vmagent
> <https://victoriametrics.github.io/vmagent.html> for data scraping and
> vmalert <https://victoriametrics.github.io/vmalert.html> for alerting and
> recording rules. It is optimized for high load, so it should require lower
> amounts of resources compared to Prometheus. See, for example, this case
> study <https://victoriametrics.github.io/CaseStudies.html#wixcom>.
>
> On Tue, Oct 13, 2020 at 2:23 PM Karthik Vijayaraju <
> karthik.vijayar...@gmail.com> wrote:
>
>> Thank you!
>> I will try this out with a newer version and experiment with hashmod.
>>
>> On Mon, Oct 12, 2020 at 3:25 PM Ben Kochie <sup...@gmail.com> wrote:
>>
>>> Thanks, knowing what Prometheus version you're on helps a lot. There are
>>> two things that will help setups like yours quite a lot.
>>>
>>> First, Prometheus 2.19 introduced some new memory management
>>> improvements that mostly eliminates pod churn memory growth. It also
>>> greatly improves memory use for high scrape frequencies.
>>>
>>> Second, 2.18.2 was the first official Prometheus version to be built
>>> with Go 1.14. This introduced an issue affected the compression, and hence
>>> the memory use of Prometheus. See
>>> https://github.com/prometheus/prometheus/pull/7976.
>>>
>>> Once 2.22.0 is out, upgrading would be highly recommended.
>>>
>>> You might want to look at this Prometheus Operator issue about hashmod
>>> sharding:
>>> https://github.com/prometheus-operator/prometheus-operator/issues/2590
>>>
>>> On Sun, Oct 11, 2020 at 10:14 PM kvr <karthik.vijayar...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> There are different services and each could scale to 1000+ pods in a
>>>> given namespace.
>>>> But even then managing a Prometheus instance pair per set of apps is
>>>> not tenable. The management overhead would be too great when there are
>>>> several such apps.
>>>>
>>>> Version wise, we are keeping up, but not aggressively.
>>>> We are on 2.18.2 and the instance under test does not have Thanos. It
>>>> only scrapes and does some rule evaluation (the memory usage is the same
>>>> even when rule eval is disabled).
>>>> We are using prometheus operator to reload config.
>>>>
>>>> Yeah, I read that ~2GB of memory is sufficient per million metrics, so
>>>> I am surprised that it consumes such a large amount.  Will having a diverse
>>>> scrape intervals have such an effect?
>>>>
>>>> Our stats at peak:
>>>> ~15M head series
>>>> ~45M head chunks
>>>> ~475K samples/s ingested
>>>> ~7000 pods scraped
>>>>
>>>> Thanks!
>>>>
>>>> On Sunday, October 11, 2020 at 12:38:27 PM UTC+5:30 sup...@gmail.com
>>>> wrote:
>>>>
>>>>> If all of the 1000s of pods in a namespace are of the same thing, you
>>>>> can use the hashmod feature to horizontally scale.
>>>>>
>>>>> You can have several Prometheus instances per namespace, each
>>>>> responsible for a fraction of the pods.
>>>>>
>>>>> Just to be sure, are you keeping up to date on the latest releases?
>>>>> 200G of memory seems like a lot for 15M series.
>>>>>
>>>>> Are you using Thanos or a remote write service?
>>>>>
>>>>> On Sun, Oct 11, 2020, 07:14 kvr <karthik.v...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We are hitting some limits with our current setup of Prometheus. I
>>>>>> have read a lot of posts here as well as blogs and videos but still need
>>>>>> some guidance.
>>>>>>
>>>>>> Our current setup is at it's limit. Head series count is around 15M
>>>>>> during pod churn regularly. Each app exports between 5000 and 8000 
>>>>>> metrics
>>>>>> series. So a 1000 pods causes about 8M new series in the head block.
>>>>>> Prometheus currently has access to 300 GB of memory, but it can't use
>>>>>> past 200GB in reality. It starts degrading around the 150GB mark.
>>>>>> - Scrape time for Prometheus scraping itself is 5+ seconds and config
>>>>>> reloads fail.
>>>>>> - We verified that this is not due to a cardinality explosion from a
>>>>>> misbehaving app. So this has naturally degraded due to load.
>>>>>> - We eliminated bad queries as a cause by spinning up an additional
>>>>>> Prometheus which just scrapes targets and nothing else. So the bottleneck
>>>>>> is just ingestion.
>>>>>>
>>>>>> So the next step for us is to shard and use namespace level
>>>>>> Prometheis. But I expect a similar level of usage in about an year again 
>>>>>> at
>>>>>> the namespace level, with multiple apps in a single namespace scaling to
>>>>>> 1000s of pods exporting 5K metrics each. And I will not be able to shard
>>>>>> again because I don't want to go below  the NS granularity.
>>>>>>
>>>>>> How have others dealt with this situation where is the bottle neck is
>>>>>> going to be ingestion and not queries?
>>>>>>
>>>>>> Thanks for your time,
>>>>>> KVR
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Prometheus Users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to prometheus-use...@googlegroups.com.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/prometheus-users/cf15cc42-fe3e-4f4d-8489-3750fac7f81en%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to prometheus-users+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/prometheus-users/58c5326d-58c7-42b5-9ec4-1fc8c9eb27b3n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to prometheus-users+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com
>> <https://groups.google.com/d/msgid/prometheus-users/CABbevGkm9MFTxhX_HTF5kwcdjmUVmyhqO_-ebj-yBM_FKpFk8A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
> Best Regards,
>
> Aliaksandr Valialkin, CTO VictoriaMetrics
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbevGnk210qt6TeAAAqzqzQW1CzU6iG5fexOKh6ArVeEjEESA%40mail.gmail.com.

Re: [prometheus-users] Scaling Prometheus

Reply via email to