Compared with samples, series can be ignored in size.
How old your query will be? 26w? If your query only care about recent data, maybe you can move old data to remote storage, which is cheap. Remote storage can also be queried, but with bad performance.

On Wednesday, Mar 03, 2021 at 01:07 +0800, Rufus Schäfing wrote:

Hey everyone!

We're running a quite large (at least for me) prometheus-monitored
environment. At the time of me writing this
'prometheus_tsdb_storage_blocks_bytes' reports 6.24 TiB of storage used.

That alone would not be a huge problem I guess - calculating an estimate for the storage needed with our scrape sizes and retention time (26w) our
figure seems to be in the ballpark.

Our main issue is that storage consumption is ever increasing and I can't
figure out exactly why or how to deal with it.

Looking at the 1-week-deriv of used blocks i get around 600 KiB/s which gives me 48 GiB/d - which seems quite excessive to me. Every 20 days ~500
GiB are shaved off but overall growth exceeds shrinkage.

Our current prometheus has data from the beginning of December 'til now and
it looks like it's been this way at least since then.

Now I've been graphing different metrics pertaining storage-use,
TSDB-behaviour, target-count and churn-rate.

In terms of churn I've noticed that the 'Kubernetes cAdvisor' seems to be generating new series all the time. I'm not sure how to interpret the numbers but compared to other jobs it's definitely noticable (around 1%-1.5% of all head series are created by cAdvisor). This also makes sense
since containers and everything keep changing.

If there is a connection between churn and storage use I don't quite get it. From reading TSDB-documentation I assume that more new series lead to an increased amount of series-metadata in the blocks and worse compression
performance?

I'd imagine that a big part of storage is used for actual sample-data though and not so much for labels and such. Please - if I seem to be
misunderstanding this point enlighten me!

Over the course of 3 months the amount of head-series has been quite stable between 4.2 Mil and 4.7 Mil. The last month's numbers are also lower than
January - seemingly decreasing.

In a 12-hour window head-series count follows a saw-tooth-pattern between
~4.2 Mil and ~4.5 Mil currently.

The one metric I've found to be correlating with our storage use is the number of loaded TSDB-blocks which has definitely been following an
upwards-trend since at least the beginning of December.

The inner workings concering chunks, blocks and compaction are quite mysterious to me at this point so I cannot really imagine an explanation
for this either.


I think/hope that's everything from my part. Another hope of mine is that I've explained everything in an understandable way. I'm quite new to these kinds of setups and performance issues and English is only my second
language.

I don't expect anyone to solve my problem (would be great of course). I'd definitely love to hear some opinions or insight from someone more experiecened. If there are any - some good resources or ideas for me to
learn debugging/analyzing this will be appreciated as well!


I leave my thanks in advance to anyone reading this!

- Rufus

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/603f5fc6.1c69fb81.e7100.ee4bSMTPIN_ADDED_BROKEN%40gmr-mx.google.com.

Attachment: signature.asc
Description: PGP signature

Reply via email to