Right: high cardinality is bad. But what matters is the number of timeseries which have ingested a point in the last ~2 hours. Whether each of those time series ingests 1 data point or 10,000 data points in 2 hours makes almost no difference. Therefore, scraping them less frequently doesn't fix the high cardinality problem at all.
You need to avoid those labels in Prometheus, so that the total number of timeseries (unique combinations of metric name + set of labels) is within a reasonable range. What I would suggest is that you push the high cardinality data into a different system like Loki. You can then either use LogQL queries to derive the lower-cardinality metrics to go into Prometheus, or export them separately. (Instead of Loki you could also use Elasticseach/Opensearch, or a SQL database, or whatever) Then you get the best of both worlds: fast timeseries data and querying from Prometheus, and the full data in Loki for deeper analysis. Note that Loki streams are defined by labels, and you can use the same low-cardinality labels that you will use for Prometheus. Hence doing searches across timespans of raw data for a given set of labels still performs better than a "brute force" log scan *à la* grep. For some use cases you may also find "exemplars" to be useful in Prometheus. These let you store *one* example detailed event which was counted in a bucket, against that bucket. There's a short 5 minute overview here: https://www.youtube.com/watch?v=TzFjweKACMY&t=1644s On Thursday, 30 March 2023 at 06:19:07 UTC+1 Kevin Z wrote: > This label scales as users interact with our server and create new > accounts. It is problematic right now because it currently is added to all > metrics. > > On Monday, March 27, 2023 at 1:39:57 AM UTC-7 Stuart Clark wrote: > >> On 2023-03-25 07:30, Kevin Z wrote: >> > Hi, >> > >> > We have a server that has a high cardinality of metrics, mainly due to >> > a label that is tagged on the majority of the metrics. However, most >> > of our dashboards/queries don't use this label, and just use aggregate >> > queries. There are specific scenarios where we would need to debug and >> > sort based on the label, but this doesn't happen that often. >> > >> > Is it a common design pattern to separate out two metrics endpoints, >> > one for aggregates, one for labelled metrics, with different scrape >> > intervals? This way we could limit the impact of the high cardinality >> > time series, by scraping the labelled metrics less frequently. >> > >> > Couple of follow-up questions: >> > - When a query that uses the aggregate metric comes in, does it matter >> > that the data is potentially duplicated between the two endpoints? How >> > do we ensure that it doesn't try loading all the different time series >> > with the label and then aggregating, and instead directly use the >> > aggregate metric itself? >> > - How could we make sure this new setup is more efficient than the old >> > one? What criteria/metrics would be best (query evaluation time? >> > amount of data ingested?) >> > >> >> You certainly could split things into two endpoints and scrape at >> different intervals, however it is unlikely to make little/any >> difference. From the Prometheus side data points within a time series >> are very low impact. So for your aggregate endpoint you might be >> scraping every 30 seconds and the full data every 2 minutes (the slowest >> available scrape interval) meaning there are 4x less data points, which >> has very little memory impact. >> >> You mention that there is a high cardinality - that is the thing which >> you need to fix, as that will be having the impact. You say there is a >> problematic label applied to most of the metrics. Can it be removed? >> What makes it problematic? >> >> -- >> Stuart Clark >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/96a0faf1-dfaf-4b76-92da-38befa8bc659n%40googlegroups.com.