Re: [prometheus-developers] Introduce the concept of scrape Priority for Targets

Chris Marchbanks Thu, 30 Jul 2020 06:43:25 -0700

I do think we need a better way to detect when we are overloaded,
especially with respect to memory usage, and have defined behavior for how
we handle backpressure in those cases. The current experience of entering
OOM loops is frustrating, and makes it hard to debug as you can't query
anything to see what caused the extra load. HA is also not helpful in this
case as both instances will have similar data and OOM at the same time.


Perhaps after general overloading/backpressure is defined, higher level
ideas such as priority can be introduced, but I also agree that it might be
best to just run multiple instances.

Chris

On Thu, Jul 30, 2020 at 3:19 AM Julien Pivotto <roidelapl...@prometheus.io>
wrote:

> The problem is not that much priorities etc, it is all the questions and
> confusions around this:
>
> - When do we decide we are overloaded?
> - What do we do for the low priority targets?
>
> and more importantly:
>
> - When do we decide that we can scrape the low targets again?
>
> How to avoid:
>
> High load -> stop low scrapes
> -> Normal load (because we do not scrape low priorities) -> restart low
> scrapes
> -> High load -> stop low scrapes
> -> Normal load (because we do not scrape low priorities) -> restart low
> scrapes
> -> High load -> stop low scrapes
> -> Normal load (because we do not scrape low priorities) -> restart low
> scrapes
>
>
> Overall that does not seem easy questions.
>
> On 30 Jul 10:10, Bartłomiej Płotka wrote:
> > Yes, looks like having many scrapers would solve this, and having Thanos
> on
> > top for query aggregation can do. However, given the overhead of even
> > operating the TSDB instances like Prometheus (e.g maintaining persistence
> > volumes), I would still see some longer-term solution of better
> multitenant
> > support (isolation of tenants scrape) within scrape engine. Some
> > alternative is dynamic relabelling configured from outside as seen here
> >
> https://blog.freshtracks.io/bomb-squad-automatic-detection-and-suppression-of-prometheus-cardinality-explosions-62ca8e02fa32
> > -
> > I think with good monitoring of Prometheus health we could implement
> > "sidecar" applying such priorities dynamically as well. That would be
> good
> > for a star maybe (:
> >
> > In the meantime, the separate scraper looks like the way to go.
> >
> > Kind Regards,
> > Bartek
> >
> > On Thu, 30 Jul 2020 at 10:01, Lili Cosic <cosicl...@gmail.com> wrote:
> >
> > > Thanks, everyone for the replies! The official msg seems to be to use a
> > > Prometheus instance per tenant/priority if you want to have multiple
> > > tenants in your environment.
> > >
> > > Kind regards,
> > > Lili
> > >
> > > On Thursday, 30 July 2020 10:44:59 UTC+2, Ben Kochie wrote:
> > >>
> > >> I'm with Brian and Julian on this.
> > >>
> > >> Multi-tenancy is not really something we want to solve in Prometheus.
> > >> This is a concern for higher level systems like Kubernetes.
> Prometheus is
> > >> designed to be distributed. If you have targets with different needs,
> they
> > >> need to have separate Prometheus instances.
> > >>
> > >> This is also why we have things like Thanos and Cortex as aggregation
> > >> layers.
> > >>
> > >> Similar to why we have said we don't plan to implement IO limits,
> this is
> > >> a scheduling concern, out of scope for Prometheus.
> > >>
> > >> On Thu, Jul 30, 2020, 10:31 Frederic Branczyk <fbra...@gmail.com>
> wrote:
> > >>
> > >>> That's only effective in limiting the number of targets, the point
> here
> > >>> is that selectively scraping those with a higher priority based on
> > >>> backpressure of the system as a whole.
> > >>>
> > >>> On Wed, 22 Jul 2020 at 17:00, Julien Pivotto <
> roidel...@prometheus.io>
> > >>> wrote:
> > >>>
> > >>>> On 22 Jul 16:47, Frederic Branczyk wrote:
> > >>>> > In practice even that can still be problematic. You only know that
> > >>>> > Prometheus has a problem when everything fails, the point is to
> keep
> > >>>> things
> > >>>> > alive well enough for more critical components.
> > >>>> >
> > >>>> > On Wed, 22 Jul 2020 at 16:38, Julien Pivotto <
> roidel...@prometheus.io
> > >>>> >
> > >>>> > wrote:
> > >>>> >
> > >>>> > > On 22 Jul 16:36, Frederic Branczyk wrote:
> > >>>> > > > It's unclear how that helps, can you help me understand?
> > >>>> > >
> > >>>> > > - job: highprio
> > >>>> > >   relabel_configs:
> > >>>> > >   - target_label: job
> > >>>> > >     replacement: pods
> > >>>> > >   - source_labels: [__meta_pod_priority]
> > >>>> > >     regex: high
> > >>>> > >     action: keep
> > >>>>
> > >>>> highprio job will always be scraped.
> > >>>>
> > >>>> > > - job: lowprio
> > >>>> > >   relabel_configs:
> > >>>> > >   - target_label: job
> > >>>> > >     replacement: pods
> > >>>> > >   - source_labels: [__meta_pod_priority]
> > >>>> > >     regex: high
> > >>>> > >     action: drop
> > >>>> > >   target_limit: 1000
> > >>>> > >
> > >>>> > > >
> > >>>> > > > On Wed, 22 Jul 2020 at 16:34, Julien Pivotto <
> > >>>> roidel...@prometheus.io
> > >>>> > > >
> > >>>> > > > wrote:
> > >>>> > > >
> > >>>> > > > > On 22 Jul 16:32, Frederic Branczyk wrote:
> > >>>> > > > > > Can you explain what you mean by two jobs? Do you mean two
> > >>>> scrape
> > >>>> > > > > configs?
> > >>>> > > > >
> > >>>> > > > > Yes.
> > >>>> > > > >
> > >>>> > > > > >
> > >>>> > > > > > On Wed, 22 Jul 2020 at 11:40, Julien Pivotto <
> > >>>> > > roidel...@prometheus.io
> > >>>> > > > > >
> > >>>> > > > > > wrote:
> > >>>> > > > > >
> > >>>> > > > > > > On 22 Jul 02:35, Lili Cosic wrote:
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > > On Wednesday, 22 July 2020 11:23:00 UTC+2, Brian
> Brazil
> > >>>> wrote:
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > On Wed, 22 Jul 2020 at 10:18, Julien Pivotto <
> > >>>> > > > > roidel...@prometheus.io
> > >>>> > > > > > > > > <javascript:>> wrote:
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >> On 22 Jul 02:14, Lili Cosic wrote:
> > >>>> > > > > > > > >> > Only now seen in the docs that I am supposed to
> > >>>> start any
> > >>>> > > > > > > discussions
> > >>>> > > > > > > > >> here
> > >>>> > > > > > > > >> > first before opening an issue, sorry about that!
> :)
> > >>>> > > > > > > > >> >
> > >>>> > > > > > > > >> > Currently there is no way of a target to have
> higher
> > >>>> scrape
> > >>>> > > > > > > priority
> > >>>> > > > > > > > >> over
> > >>>> > > > > > > > >> > another, but if you have a setup and even if you
> set
> > >>>> target
> > >>>> > > > > limits
> > >>>> > > > > > > and
> > >>>> > > > > > > > >> > sample limits you can still overestimate your
> setup,
> > >>>> you
> > >>>> > > still
> > >>>> > > > > want
> > >>>> > > > > > > to
> > >>>> > > > > > > > >> have
> > >>>> > > > > > > > >> > a higher priority targets that are preferred over
> > >>>> the entire
> > >>>> > > > > > > Prometheus
> > >>>> > > > > > > > >> to
> > >>>> > > > > > > > >> > fail. It would need to be based on the inability
> to
> > >>>> ingest
> > >>>> > > into
> > >>>> > > > > > > tsdb on
> > >>>> > > > > > > > >> the
> > >>>> > > > > > > > >> > current rate we are scrapping, if that is hit the
> > >>>> priority
> > >>>> > > class
> > >>>> > > > > > > would
> > >>>> > > > > > > > >> take
> > >>>> > > > > > > > >> > affect and only the highest priority targets
> would be
> > >>>> > > scrapped
> > >>>> > > > > in
> > >>>> > > > > > > > >> favour of
> > >>>> > > > > > > > >> > lower priority. Another option which might be
> > >>>> simpler would
> > >>>> > > be
> > >>>> > > > > to
> > >>>> > > > > > > have
> > >>>> > > > > > > > >> a
> > >>>> > > > > > > > >> > global limit on how much prometheus can handle
> based
> > >>>> on perf
> > >>>> > > > > > > testing.
> > >>>> > > > > > > > >> >
> > >>>> > > > > > > > >> > This would be treated as a last resort, and there
> > >>>> would
> > >>>> > > > > definitely
> > >>>> > > > > > > be a
> > >>>> > > > > > > > >> > need for a high severity alert to inform the
> admin
> > >>>> that
> > >>>> > > > > something
> > >>>> > > > > > > went
> > >>>> > > > > > > > >> > terribly wrong, but because we would still be
> able
> > >>>> to ingest
> > >>>> > > > > > > Prometheus
> > >>>> > > > > > > > >> > metrics for example if they are higher priority
> class
> > >>>> > > alerting
> > >>>> > > > > > > would be
> > >>>> > > > > > > > >> > possible.
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> Hi,
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> I think that limiting the number of targets you
> scrape
> > >>>> is
> > >>>> > > already
> > >>>> > > > > a
> > >>>> > > > > > > last
> > >>>> > > > > > > > >> resort. I don't think we would need a second line
> of
> > >>>> defense.
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > I agree with Julien here. If you've gotten to this
> > >>>> point you're
> > >>>> > > > > > > already
> > >>>> > > > > > > > > seriously overloaded, and prioritising individual
> > >>>> targets is
> > >>>> > > just
> > >>>> > > > > > > > > rearranging the deckchairs at that point.
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> You can achieve this priority by setting 2 jobs,
> one
> > >>>> which is
> > >>>> > > > > limited
> > >>>> > > > > > > > >> and one which is not, and use relabeling to decinde
> > >>>> which
> > >>>> > > target
> > >>>> > > > > is
> > >>>> > > > > > > > >> going in which job.
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > Or more generally, one Prometheus for the important
> > >>>> targets and
> > >>>> > > > > > > another
> > >>>> > > > > > > > > for the less important and riskier targets.
> > >>>> > > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > > I get your point completely Brian, and agree to some
> > >>>> degree but
> > >>>> > > > > people
> > >>>> > > > > > > are
> > >>>> > > > > > > > still going to be setting up a multi tenant prometheus
> > >>>> which then
> > >>>> > > > > causes
> > >>>> > > > > > > > the above problems I mentioned. Even within the
> riskier
> > >>>> targets
> > >>>> > > there
> > >>>> > > > > > > will
> > >>>> > > > > > > > be some more important than others for users. I think
> we
> > >>>> should
> > >>>> > > still
> > >>>> > > > > > > > strive to making a single shared Prometheus as safe as
> > >>>> possible,
> > >>>> > > if
> > >>>> > > > > this
> > >>>> > > > > > > is
> > >>>> > > > > > > > not the priority class I suggested, open to other
> ideas!
> > >>>> > > > > > >
> > >>>> > > > > > > Then 2 jobs are the answer, one unlimited and one
> limited.
> > >>>> > > > > > >
> > >>>> > > > > > > The target_limit is already pretty advanced use case.
> > >>>> > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > Brian
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> >
> > >>>> > > > > > > > >> > We could model this on something like
> PriorityClass
> > >>>> > > > > > > > >> > <
> > >>>> > > > > > > > >>
> > >>>> > > > > > >
> > >>>> > > > >
> > >>>> > >
> > >>>>
> https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass
> > >>>> > > > > >
> > >>>> > > > > > >
> > >>>> > > > > > > > >> from
> > >>>> > > > > > > > >> > Kubernetes, but I am open to other suggestions.
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> That could be used in relabeling as I said.
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> >
> > >>>> > > > > > > > >> > I am open to other suggestions, or maybe there is
> > >>>> something
> > >>>> > > like
> > >>>> > > > > > > this
> > >>>> > > > > > > > >> but I
> > >>>> > > > > > > > >> > missed it. The main purpose is to ensure there
> are
> > >>>> > > protection
> > >>>> > > > > > > > >> mechanisms in
> > >>>> > > > > > > > >> > place, so any ideas and suggestions welcome!
> > >>>> > > > > > > > >> >
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> regards,
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> > Thanks and kind regards,
> > >>>> > > > > > > > >> > Lili
> > >>>> > > > > > > > >> >
> > >>>> > > > > > > > >> > --
> > >>>> > > > > > > > >> > You received this message because you are
> subscribed
> > >>>> to the
> > >>>> > > > > Google
> > >>>> > > > > > > > >> Groups "Prometheus Developers" group.
> > >>>> > > > > > > > >> > To unsubscribe from this group and stop receiving
> > >>>> emails
> > >>>> > > from
> > >>>> > > > > it,
> > >>>> > > > > > > send
> > >>>> > > > > > > > >> an email to
> > >>>> > > prometheus-developers+unsubscr...@googlegroups.com
> > >>>> > > > > > > > >> <javascript:>.
> > >>>> > > > > > > > >> > To view this discussion on the web visit
> > >>>> > > > > > > > >>
> > >>>> > > > > > >
> > >>>> > > > >
> > >>>> > >
> > >>>>
> https://groups.google.com/d/msgid/prometheus-developers/30df615e-5420-4bdf-9cb7-2790ef19d520o%40googlegroups.com
> > >>>> > > > > > > > >> .
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> --
> > >>>> > > > > > > > >> Julien Pivotto
> > >>>> > > > > > > > >> @roidelapluie
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >> --
> > >>>> > > > > > > > >> You received this message because you are
> subscribed
> > >>>> to the
> > >>>> > > Google
> > >>>> > > > > > > Groups
> > >>>> > > > > > > > >> "Prometheus Developers" group.
> > >>>> > > > > > > > >> To unsubscribe from this group and stop receiving
> > >>>> emails from
> > >>>> > > it,
> > >>>> > > > > > > send an
> > >>>> > > > > > > > >> email to
> > >>>> prometheus-developers+unsubscr...@googlegroups.com
> > >>>> > > > > > > <javascript:>
> > >>>> > > > > > > > >> .
> > >>>> > > > > > > > >> To view this discussion on the web visit
> > >>>> > > > > > > > >>
> > >>>> > > > > > >
> > >>>> > > > >
> > >>>> > >
> > >>>>
> https://groups.google.com/d/msgid/prometheus-developers/20200722091759.GA140540%40oxygen
> > >>>> > > > > > > > >> .
> > >>>> > > > > > > > >>
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > --
> > >>>> > > > > > > > > Brian Brazil
> > >>>> > > > > > > > > www.robustperception.io
> > >>>> > > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > > --
> > >>>> > > > > > > > You received this message because you are subscribed
> to
> > >>>> the
> > >>>> > > Google
> > >>>> > > > > > > Groups "Prometheus Developers" group.
> > >>>> > > > > > > > To unsubscribe from this group and stop receiving
> emails
> > >>>> from it,
> > >>>> > > > > send
> > >>>> > > > > > > an email to
> > >>>> prometheus-developers+unsubscr...@googlegroups.com.
> > >>>> > > > > > > > To view this discussion on the web visit
> > >>>> > > > > > >
> > >>>> > > > >
> > >>>> > >
> > >>>>
> https://groups.google.com/d/msgid/prometheus-developers/b0b9e5f7-239a-4cc7-9108-9e6e015a30d6o%40googlegroups.com
> > >>>> > > > > > > .
> > >>>> > > > > > >
> > >>>> > > > > > >
> > >>>> > > > > > > --
> > >>>> > > > > > > Julien Pivotto
> > >>>> > > > > > > @roidelapluie
> > >>>> > > > > > >
> > >>>> > > > > > > --
> > >>>> > > > > > > You received this message because you are subscribed to
> the
> > >>>> Google
> > >>>> > > > > Groups
> > >>>> > > > > > > "Prometheus Developers" group.
> > >>>> > > > > > > To unsubscribe from this group and stop receiving emails
> > >>>> from it,
> > >>>> > > send
> > >>>> > > > > an
> > >>>> > > > > > > email to
> prometheus-developers+unsubscr...@googlegroups.com
> > >>>> .
> > >>>> > > > > > > To view this discussion on the web visit
> > >>>> > > > > > >
> > >>>> > > > >
> > >>>> > >
> > >>>>
> https://groups.google.com/d/msgid/prometheus-developers/20200722094024.GA175281%40oxygen
> > >>>> > > > > > > .
> > >>>> > > > > > >
> > >>>> > > > > >
> > >>>> > > > > > --
> > >>>> > > > > > You received this message because you are subscribed to
> the
> > >>>> Google
> > >>>> > > > > Groups "Prometheus Developers" group.
> > >>>> > > > > > To unsubscribe from this group and stop receiving emails
> from
> > >>>> it,
> > >>>> > > send
> > >>>> > > > > an email to
> prometheus-developers+unsubscr...@googlegroups.com.
> > >>>> > > > > > To view this discussion on the web visit
> > >>>> > > > >
> > >>>> > >
> > >>>>
> https://groups.google.com/d/msgid/prometheus-developers/CAOs1Umx-uFZFPoeOMA-ev4oN5QoRUyODiCWnSZML3hessHkmBQ%40mail.gmail.com
> > >>>> > > > > .
> > >>>> > > > >
> > >>>> > > > > --
> > >>>> > > > > Julien Pivotto
> > >>>> > > > > @roidelapluie
> > >>>> > > > >
> > >>>> > > >
> > >>>> > > > --
> > >>>> > > > You received this message because you are subscribed to the
> Google
> > >>>> > > Groups "Prometheus Developers" group.
> > >>>> > > > To unsubscribe from this group and stop receiving emails from
> it,
> > >>>> send
> > >>>> > > an email to prometheus-developers+unsubscr...@googlegroups.com.
> > >>>> > > > To view this discussion on the web visit
> > >>>> > >
> > >>>>
> https://groups.google.com/d/msgid/prometheus-developers/CAOs1UmzgPKCrpmsDb4v3CrN9Oe%2Bmaka8bosCDuodmjmd-RAyLw%40mail.gmail.com
> > >>>> > > .
> > >>>> > >
> > >>>> > > --
> > >>>> > > Julien Pivotto
> > >>>> > > @roidelapluie
> > >>>> > >
> > >>>> >
> > >>>> > --
> > >>>> > You received this message because you are subscribed to the Google
> > >>>> Groups "Prometheus Developers" group.
> > >>>> > To unsubscribe from this group and stop receiving emails from it,
> > >>>> send an email to prometheus-developers+unsubscr...@googlegroups.com
> .
> > >>>> > To view this discussion on the web visit
> > >>>>
> https://groups.google.com/d/msgid/prometheus-developers/CAOs1UmyxR%3DQ%2B6_emwh12CVwkwemU%2B-tzenvgP1WQ%2BCHnw67UUQ%40mail.gmail.com
> > >>>> .
> > >>>>
> > >>>> --
> > >>>> Julien Pivotto
> > >>>> @roidelapluie
> > >>>>
> > >>> --
> > >>> You received this message because you are subscribed to the Google
> > >>> Groups "Prometheus Developers" group.
> > >>> To unsubscribe from this group and stop receiving emails from it,
> send
> > >>> an email to prometheus-developers+unsubscr...@googlegroups.com.
> > >>> To view this discussion on the web visit
> > >>>
> https://groups.google.com/d/msgid/prometheus-developers/CAOs1UmwjYgxU9ABkATe04febF_010n3%3DKVoEm8J_5XGnf0je%2Bg%40mail.gmail.com
> > >>> <
> https://groups.google.com/d/msgid/prometheus-developers/CAOs1UmwjYgxU9ABkATe04febF_010n3%3DKVoEm8J_5XGnf0je%2Bg%40mail.gmail.com?utm_medium=email&utm_source=footer
> >
> > >>> .
> > >>>
> > >> --
> > > You received this message because you are subscribed to the Google
> Groups
> > > "Prometheus Developers" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> an
> > > email to prometheus-developers+unsubscr...@googlegroups.com.
> > > To view this discussion on the web visit
> > >
> https://groups.google.com/d/msgid/prometheus-developers/4e4786ba-2ecd-497d-b900-18c8a30e9c75o%40googlegroups.com
> > > <
> https://groups.google.com/d/msgid/prometheus-developers/4e4786ba-2ecd-497d-b900-18c8a30e9c75o%40googlegroups.com?utm_medium=email&utm_source=footer
> >
> > > .
> > >
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "Prometheus Developers" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to prometheus-developers+unsubscr...@googlegroups.com.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/CAMssQwZT78NtfWCQCsrx%2B-B3u4RZKGoFmMGKEH_ypXWGoh3w%2Bw%40mail.gmail.com
> .
>
> --
> Julien Pivotto
> @roidelapluie
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-developers+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/20200730091922.GA156213%40oxygen
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CANVFovUBD8v-pWd_aMK_HuheTLCwZ1nzh%2BSRZDrs%2BP1EjBJc1A%40mail.gmail.com.

Re: [prometheus-developers] Introduce the concept of scrape Priority for Targets

Reply via email to