Re: [prometheus-users] Prometheus not able to scale vertically due to lock contention

Aliaksandr Valialkin Tue, 09 Mar 2021 11:22:42 -0800

On Fri, Mar 5, 2021 at 12:10 AM Dhruv Patel <dhruvpatel5...@gmail.com>
wrote:


> Hi Folks,
>   We are seeing an issue in our current Prometheus Setup where we are not
> able to ingest beyond 22 million metrics/min. We have run several Load Test
> at 25 Million, 29 Million and 35 Million but the ingestion rate remains
> constant around the same 22 million metrics/min. Moreover, we are also
> seeing that our CPU Usage is around 70% and have more than 50% memory
> available memory. Looking at this it feels like we are not hitting resource
> limitations but something to do with lock contention.
>
> *Prometheus Version:* 2.9.1
> *Host Shape:* x7-enclave-104 (It is a bare metal host with 104 processor
> units). More info can be obtained in below screenshots
> *Memory Info: *
>                        total        used        free         shared
> buff/cache   available
> Mem:           754G         88G        528G         67M        136G
> 719G
> Swap:          1.0G           0B           1.0G
> Total:           755G          88G        529G
>
> We also ran some profiling during our load test setup at 20Million, 22
> Million and 25 Million and have seen an increase in time taken taken for
> running runtime.mallocgc which leads to an increased usage in
> runtime.futex. Some how we are not able to figure out what could be the
> issue of the lock contention. I have attached our profiling results at
> different load test levels if thats any useful. Any ideas on what could be
> causing the high time taken in runtime malloc gc?
>

Prometheus is written in Go. The runtime.mallocgc function is called every
time Prometheus allocates a new object during its operation. It looks like
Prometheus 2.9.1 allocates a lot during the load test. The runtime.futex is
used internally by Go runtime during objects' allocation and subsequent
objects' deallocation (aka garbage collection). It looks like the Go
runtime used in Prometheus 2.9.1 isn't optimized well for programs with
frequent object allocations that run on systems with many CPU cores. This
should be improved in Go 1.15 - Allocation of small objects now performs
much better at high core counts, and has lower worst-case latency
<https://tip.golang.org/doc/go1.15#runtime> . So it is recommended
repeating the load test on to the latest available version of Prometheus,
which is hopefully built with at least Go 1.15 - see
https://github.com/prometheus/prometheus/releases .

Additionally, you can run the load test on VictoriaMetrics and compare its
scalability with Prometheus. See
https://victoriametrics.github.io/#how-to-scrape-prometheus-exporters-such-as-node-exporter
.


>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/abccd4c0-c69d-4869-8598-899b3de693f7n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/abccd4c0-c69d-4869-8598-899b3de693f7n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Best Regards,

Aliaksandr Valialkin, CTO VictoriaMetrics

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAPbKnmC5W-Q_Y5krMZNK-tnJsNUbjxcX2Cebqncrzq%3DQy%2BSa_Q%40mail.gmail.com.

Re: [prometheus-users] Prometheus not able to scale vertically due to lock contention

Reply via email to