On Tuesday, 28 February 2023 at 00:45:36 UTC Christoph Anton Mitterer wrote:
I want to Prometheus merely or monitoring a few hundred nodes (thus it seems a bit overkill to have something like Cortex, which sounds like a system for really large number of nodes) at the university Thanos may be simpler. Although I've not used it myself, it looks like it can be deployed incrementally starting with the sidecars. , though as indicated before, we'd need both: - details data for a like the last week or perhaps two - far less detailed data for much longer terms (like several years) I can offer a couple more options: (1) Use two servers with federation. - server 1 does the scraping and keeps the detailed data for 2 weeks - server 2 scrapes server 1 at lower interval, using the federation endpoint (2) Use recording rules to generate lower-resolution copies of the primary timeseries - but then you'd still have to remote-write them to a second server to get the longer retention, since this can't be set at timeseries level. Either case makes the querying more awkward. If you don't want separate dashboards for near-term and long-term data, then it might work to stick promxy in front of them. Apart from saving disk space (and disks are really, really cheap these days), I suspect the main benefit you're looking for is to get faster queries when running over long time periods. Indeed, I believe Thanos creates downsampled timeseries for exactly this reason, whilst still continuing to retain all the full-resolution data as well. Right now my Prometheus server runs in a medium sized VM, but when I visualise via Grafana and select a time span of a month, it already takes considerable time (like 10-15s) to render the graph. Ah right, then that is indeed your concern. Is this expected? That depends. What PromQL query does your graph use? How many timeseries does it touch? What's your scrape interval? Is your VM backed by SSDs? For example, I have a very low performance (Celeron N2820, SATA SSD, 8GB RAM) test box at home. I scrape data at 15 second intervals. Prometheus is running in an lxd container, alongside many other lxd containers. The query: rate(ifHCInOctets{instance="gw2",ifName="pppoe-out2"}[2m]) run over a 30 day range takes less than a second - but that only touches one timeseries. (With 2-hour chunks, I would expect a 30 day period to read 360 chunks, for a single timeseries). But it's possible that when I tested it, it already had the relevant data cached in RAM. If you are doing something like a Grafana dashboard, then you should determine exactly what queries it's doing. Enabling the query log <https://prometheus.io/docs/guides/query-log/> can also help you identify the slowest running queries. Another suggestion: running netdata <https://github.com/netdata/netdata> within the VM will give you performance metrics at 1 second intervals, which can help identify what's happening during those 10-15 seconds: e.g. are you bottlenecked on CPU, or disk I/O, or something else. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/43cea4c0-31a8-4dd6-8d98-3fed327ccf39n%40googlegroups.com.