On Tuesday, 28 February 2023 at 00:45:36 UTC Christoph Anton Mitterer wrote:

I want to Prometheus merely or monitoring a few hundred nodes (thus it 
seems a bit overkill to have something like Cortex, which sounds like a 
system for really large number of nodes) at the university


Thanos may be simpler. Although I've not used it myself, it looks like it 
can be deployed incrementally starting with the sidecars.

 

, though as 
indicated before, we'd need both: 
- details data for a like the last week or perhaps two 
- far less detailed data for much longer terms (like several years)


I can offer a couple more options:

(1) Use two servers with federation.
- server 1 does the scraping and keeps the detailed data for 2 weeks
- server 2 scrapes server 1 at lower interval, using the federation endpoint

(2) Use recording rules to generate lower-resolution copies of the primary 
timeseries - but then you'd still have to remote-write them to a second 
server to get the longer retention, since this can't be set at timeseries 
level.

Either case makes the querying more awkward.  If you don't want separate 
dashboards for near-term and long-term data, then it might work to stick 
promxy in front of them.

Apart from saving disk space (and disks are really, really cheap these 
days), I suspect the main benefit you're looking for is to get faster 
queries when running over long time periods.  Indeed, I believe Thanos 
creates downsampled timeseries for exactly this reason, whilst still 
continuing to retain all the full-resolution data as well.

Right now my Prometheus server runs in a medium sized VM, but when I 
visualise via Grafana and select a time span of a month, it already 
takes considerable time (like 10-15s) to render the graph.


Ah right, then that is indeed your concern.
 

Is this expected?


That depends.  What PromQL query does your graph use? How many timeseries 
does it touch? What's your scrape interval?  Is your VM backed by SSDs?

For example, I have a very low performance (Celeron N2820, SATA SSD, 8GB 
RAM) test box at home.  I scrape data at 15 second intervals. Prometheus is 
running in an lxd container, alongside many other lxd containers.  The 
query:

    rate(ifHCInOctets{instance="gw2",ifName="pppoe-out2"}[2m])

run over a 30 day range takes less than a second - but that only touches 
one timeseries. (With 2-hour chunks, I would expect a 30 day period to read 
360 chunks, for a single timeseries).  But it's possible that when I tested 
it, it already had the relevant data cached in RAM.

If you are doing something like a Grafana dashboard, then you should 
determine exactly what queries it's doing.  Enabling the query log 
<https://prometheus.io/docs/guides/query-log/> can also help you identify 
the slowest running queries.

Another suggestion: running netdata <https://github.com/netdata/netdata> 
within the VM will give you performance metrics at 1 second intervals, 
which can help identify what's happening during those 10-15 seconds: e.g. 
are you bottlenecked on CPU, or disk I/O, or something else.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/43cea4c0-31a8-4dd6-8d98-3fed327ccf39n%40googlegroups.com.

Reply via email to