Good reiteration of the problem and good points, Asaf.

I'd like to add a new aspect to the proposal: there might be other
solutions that would be useful in the case of large amount of topics in a
Pulsar cluster.
Rate limiting on the /metrics endpoint doesn't sound like the correct
approach.

When there's a huge about of metrics, instead of scraping the metrics, it
could be more useful to ingest the metrics to Prometheus using the "Remote
write API".
There's a recording of a talk explaining remote write in
https://www.youtube.com/watch?v=vMeCyX3Y3HY .
The specification is
https://docs.google.com/document/d/1LPhVRSFkGNSuU1fBd81ulhsCPR4hkSZyyBj1SZ8fWOM/edit#
.
The benefit of this could be that /metrics endpoint wouldn't be a
bottleneck and there wouldn't be a need to do any hacks to support a high
number of metrics.
There might be need to route the metrics for different namespaces/topics to
different destinations. This could be handled in the implementation that
uses the Remote write API for pushing metrics.

Regards,

-Lari


On Mon, Aug 29, 2022 at 1:12 PM Asaf Mesika <asaf.mes...@gmail.com> wrote:

> Hi Jiuming,
>
> I would reiterate the problem statement to make it clear (at least for me):
>
> There are cases where a very large amount of topics exists (> 10k per
> broker) and are used in Pulsar. Those topics usually have multiple
> producers and multiple consumers.
> There are metrics that are in the granularity of topics and also in
> topic/producers and topic/consumers granularity.
> When that happens, the amount of unique metrics is severely high which
> causes the response size of /metrics endpoint (the Prometheus Exposition
> Format endpoint) to be substantially high - 200MB - 500MB.
>
> Every time the metrics are scraped (30sec, or 1 min), the network gets
> surged up due to the /metrics response, thereby causing latency to messages
> produced or consumed from that broker.
>
> The solution proposed is to throttle the /metrics response based on a
> pre-configured rate limit.
>
> Points to consider for this discussion from the participants:
>
> 1. Did you happen to experience such difficulties in your clusters?
> 2. When that happened, did you experience the bottleneck also on the TSDB
> be it metrics ingestion or querying?
>
> Thanks,
>
> Asaf
>
>
> On Thu, Aug 18, 2022 at 7:40 PM Jiuming Tao <jm...@streamnative.io.invalid
> >
> wrote:
>
> > bump
> >
> > Jiuming Tao <jm...@streamnative.io> 于2022年8月8日周一 18:19写道:
> >
> > > Hi Pulsar community,
> > >
> > > In the situation of expose metrics data which has a big size, it will
> > lead
> > > to:
> > > 1. A sudden increase of network usage
> > > 2. The latency of pub/sub rising
> > >
> > > For the purpose of resolving these problems, I’d opened a PR:
> > > https://github.com/apache/pulsar/pull/16452
> > >
> > > Please feel free to help review/discuss about it.
> > >
> > > Thanks,
> > > Tao Jiuming
> > >
> >
>

Reply via email to