On Sun, 17 May 2020 at 10:53, Rajesh Reddy Nachireddi <
[email protected]> wrote:

> Hi,
>
> Basically, we have large networking setup with 10k devices. we are hitting
> 1M metrics every second from 20 % of devices itself, so we have 5 prom
> instances and one global proemtheus which uses remote read to handle alert
> rule evaluations and thanos querier for visualisation on grafana.
>
> We have segregated devices with specific device ip ranges to each
> Prometheus instances.
>
> So, we have one aggregator which is using remote read from all the
> individual prom instances through remote read
>
> 1. will the remote read cause an issue w.r.t loading the large time series
> over wire every 1 min ?
> 2. Is it CPU or memory intensive ?
>
> What is best design strategy to handle these scale and alerting across the
> devices or metrics ?
>

Remote read is unlikely to be the best approach here, it's pulling tons of
raw data over the network on every evaluation which have to be buffered up
in RAM.

What you want to do here is do as much of the alerting&rules on the
scraping Prometheus servers as is possible. For things that you can't do
that way (e.g. 10% of devices are down globally), use federation to pass up
e.g. total number of devices down in each Prometheus to the global and
alert on that.

-- 
Brian Brazil
www.robustperception.io

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAHJKeLpi_RiMZKkSg%2BRcGc8Tg-5kfC5ZG65%3D%2B-rXLSBLH8AJgw%40mail.gmail.com.

Reply via email to