Re: [DISCUSS] Introduce FlowControl to metrics endpoint

Asaf Mesika Tue, 30 Aug 2022 05:02:39 -0700

>
> I'd like to add a new aspect to the proposal: there might be other
> solutions that would be useful in the case of large amount of topics in a
> Pulsar cluster.
> Rate limiting on the /metrics endpoint doesn't sound like the correct
> approach.
>
> When there's a huge about of metrics, instead of scraping the metrics, it
> could be more useful to ingest the metrics to Prometheus using the "Remote
> write API".
>


One thing I love about those mailing lists is that you keep learning about
new features - I didn't know Prometheus can now receive via remote write,
very nice.

The main issue raised in this discussion thread is that the large response
size (say 300MB) is sent over the network as a response, too fast,
therefore "jamming" the network for normal produce/consume of the broker
--> creating latency.  Remote write is semantically the same: you push
those 300MB from your server to Prometheus, so you wind up in the same
problem. So I guess rate limiting would be needed in both cases of pull vs
push.

I guess my main question here is: does the problem described happen to
readers of this mailing list in clusters with >10k-100k topics per broker?



On Mon, Aug 29, 2022 at 1:41 PM Lari Hotari <[email protected]> wrote:

> Good reiteration of the problem and good points, Asaf.
>
> I'd like to add a new aspect to the proposal: there might be other
> solutions that would be useful in the case of large amount of topics in a
> Pulsar cluster.
> Rate limiting on the /metrics endpoint doesn't sound like the correct
> approach.
>
> When there's a huge about of metrics, instead of scraping the metrics, it
> could be more useful to ingest the metrics to Prometheus using the "Remote
> write API".
> There's a recording of a talk explaining remote write in
> https://www.youtube.com/watch?v=vMeCyX3Y3HY .
> The specification is
>
> https://docs.google.com/document/d/1LPhVRSFkGNSuU1fBd81ulhsCPR4hkSZyyBj1SZ8fWOM/edit#
> .
> The benefit of this could be that /metrics endpoint wouldn't be a
> bottleneck and there wouldn't be a need to do any hacks to support a high
> number of metrics.
> There might be need to route the metrics for different namespaces/topics to
> different destinations. This could be handled in the implementation that
> uses the Remote write API for pushing metrics.
>
> Regards,
>
> -Lari
>
>
> On Mon, Aug 29, 2022 at 1:12 PM Asaf Mesika <[email protected]> wrote:
>
> > Hi Jiuming,
> >
> > I would reiterate the problem statement to make it clear (at least for
> me):
> >
> > There are cases where a very large amount of topics exists (> 10k per
> > broker) and are used in Pulsar. Those topics usually have multiple
> > producers and multiple consumers.
> > There are metrics that are in the granularity of topics and also in
> > topic/producers and topic/consumers granularity.
> > When that happens, the amount of unique metrics is severely high which
> > causes the response size of /metrics endpoint (the Prometheus Exposition
> > Format endpoint) to be substantially high - 200MB - 500MB.
> >
> > Every time the metrics are scraped (30sec, or 1 min), the network gets
> > surged up due to the /metrics response, thereby causing latency to
> messages
> > produced or consumed from that broker.
> >
> > The solution proposed is to throttle the /metrics response based on a
> > pre-configured rate limit.
> >
> > Points to consider for this discussion from the participants:
> >
> > 1. Did you happen to experience such difficulties in your clusters?
> > 2. When that happened, did you experience the bottleneck also on the TSDB
> > be it metrics ingestion or querying?
> >
> > Thanks,
> >
> > Asaf
> >
> >
> > On Thu, Aug 18, 2022 at 7:40 PM Jiuming Tao
> <[email protected]
> > >
> > wrote:
> >
> > > bump
> > >
> > > Jiuming Tao <[email protected]> 于2022年8月8日周一 18:19写道：
> > >
> > > > Hi Pulsar community,
> > > >
> > > > In the situation of expose metrics data which has a big size, it will
> > > lead
> > > > to:
> > > > 1. A sudden increase of network usage
> > > > 2. The latency of pub/sub rising
> > > >
> > > > For the purpose of resolving these problems, I’d opened a PR:
> > > > https://github.com/apache/pulsar/pull/16452
> > > >
> > > > Please feel free to help review/discuss about it.
> > > >
> > > > Thanks,
> > > > Tao Jiuming
> > > >
> > >
> >
>

Re: [DISCUSS] Introduce FlowControl to metrics endpoint

Reply via email to