Typically the recommend way to scale is to shard by failure domain, then by function. For example, we have different Prometheus servers for different networks within our production environment. Within each network we also shard by function. One Prometheus for monitoring databases, one for application metrics, one catch all. This allows for better isolation between teams managing different services. We use Thanos as an aggregation layer on top of these distributed Prometheus servers.
Without more information on your network/monitoring architecture, I can't say specifically how to best shard your setup. Yes, the Rule service works just fine, we are using it in production. On Thu, May 28, 2020 at 6:29 AM Rajesh Reddy Nachireddi < [email protected]> wrote: > Thanks Aliaksandr and Ben. > > Do you have any suggestion for autoscaling of prometheus or > victoriametrics ? > > Currently, we have allocated 200GB of RAM for each prometheus instance, > but we are not sure when it gets filled up due to high cardinality OOM > issues. > > @Ben Kochie <[email protected]> - Regarding ruler in Thanos, Is it > production ready component ? > > @Aliaksandr Valialkin <[email protected]> - Do we have any document which > talks about pros and cons about remote_read vs remote_write in prometheus > vs Victoriametrics ? > > Regards, > Rajesh > > On Thu, May 28, 2020 at 12:39 AM Aliaksandr Valialkin <[email protected]> > wrote: > >> Take a look also at the following projects: >> >> * Promxy <https://github.com/jacksontj/promxy> - it allows executing >> alerts over multiple Prometheus instances. See these docs >> <https://github.com/jacksontj/promxy/blob/master/README.md#how-do-i-use-alertingrecording-rules-in-promxy> >> for details. >> * VictoriaMetrics <https://github.com/VictoriaMetrics/VictoriaMetrics>+ >> vmalert >> <https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/app/vmalert>. >> Multiple Prometheus instances may write data into a centralized >> VictoriaMetrics via remote_write API, then vmalert may be used for alerting >> on top all the collected metrics in VictoriaMetrics. >> >> On Wed, May 27, 2020 at 7:46 PM Rajesh Reddy Nachireddi < >> [email protected]> wrote: >> >>> Hi Ben, >>> >>> Does latest version of Cortex /Thanos supports the alerting with >>> multiple shards of prometheus ? >>> Thanos Ruler wasn't ready for production to evalute the expression >>> across the prometheus instances .. Do we have any docuemnet or blog about >>> this ? >>> >>> Thanks, >>> Rajesh >>> >>> On Tue, May 26, 2020 at 11:37 AM Ben Kochie <[email protected]> wrote: >>> >>>> This is probably a case where you would want to look into Thanos or >>>> Cortex to provide a larger aggregation layer on top of multiple Prometheus >>>> servers. >>>> >>>> On Sun, May 17, 2020 at 11:53 AM Rajesh Reddy Nachireddi < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Basically, we have large networking setup with 10k devices. we are >>>>> hitting 1M metrics every second from 20 % of devices itself, so we have 5 >>>>> prom instances and one global proemtheus which uses remote read to handle >>>>> alert rule evaluations and thanos querier for visualisation on grafana. >>>>> >>>>> We have segregated devices with specific device ip ranges to each >>>>> Prometheus instances. >>>>> >>>>> So, we have one aggregator which is using remote read from all the >>>>> individual prom instances through remote read >>>>> >>>>> 1. will the remote read cause an issue w.r.t loading the large time >>>>> series over wire every 1 min ? >>>>> 2. Is it CPU or memory intensive ? >>>>> >>>>> What is best design strategy to handle these scale and alerting across >>>>> the devices or metrics ? >>>>> >>>>> Regards, >>>>> >>>>> Rajesh >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-users/CAEyhnp%2BfG8YvciR4-30D%2BzsDzg_kF%2BKkJUavdbyGCxoz-97q_A%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/prometheus-users/CAEyhnp%2BfG8YvciR4-30D%2BzsDzg_kF%2BKkJUavdbyGCxoz-97q_A%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-users/CAEyhnpJt4QoMxzcMPvMa8qyDra8LLR9Je4nJqPZek8jSGYPbwA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/prometheus-users/CAEyhnpJt4QoMxzcMPvMa8qyDra8LLR9Je4nJqPZek8jSGYPbwA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> >> >> -- >> Best Regards, >> >> Aliaksandr Valialkin, CTO VictoriaMetrics >> >> >> -- >> Best Regards, >> >> Aliaksandr Valialkin, CTO VictoriaMetrics >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/CAPbKnmBzx7oHm_rg4dpq8aGJmbJN_ev5szRNa%2BN_pjp13HabXQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/prometheus-users/CAPbKnmBzx7oHm_rg4dpq8aGJmbJN_ev5szRNa%2BN_pjp13HabXQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmqd%3DpvKraD7ikE0-xCRohuYVCgSew6rhfyopJC8VTe4rQ%40mail.gmail.com.

