Re: [DISCUSS] PIP-264: Enhanced OTel-based metric system

Asaf Mesika Sun, 07 May 2023 09:53:21 -0700

On Sun, May 7, 2023 at 4:23 PM Yunze Xu <y...@streamnative.io.invalid>
wrote:


> I'm excited to learn much more about metrics when I started reading
> this proposal. But I became more and more frustrated when I found
> there is still too much content left even if I've already spent much
> time reading this proposal. I'm wondering how much time did you expect
> reviewers to read through this proposal? I just recalled the
> discussion you started before [1]. Did you expect each PMC member that
> gives his/her +1 to read only parts of this proposal?
>

I estimated around 2 hours needed for a reviewer.
I hate it being so long, but I simply couldn't find a way to downsize it
more. Furthermore, I consulted with my colleagues including Matteo, but we
couldn't see a way to scope it down.
Why? Because once you begin this journey, you need to know how it's going
to end.
What I ended up doing, is writing all the crucial details for review in the
High Level Design section.
It's still a big, hefty section, but I don't think I can step out or let
anyone else change Pulsar so invasively without the full extent of the
change.

I don't think it's wise to read parts.
I did my very best effort to minimize it, but the scope is simply big. Open
for suggestions, but it requires reading all the PIP :)

Thanks a lot Yunze for dedicating any time to it.




>
> Let's talk back to the proposal, for now, what I mainly learned and
> are concerned about mostly are:
> 1. Pulsar has many ways to expose metrics. It's not unified and confusing.
> 2. The current metrics system cannot support a large amount of topics.
> 3. It's hard for plugin authors to integrate metrics. (For example,
> KoP [2] integrates metrics by implementing the
> PrometheusRawMetricsProvider interface and it indeed needs much work)
>
> Regarding the 1st issue, this proposal chooses OpenTelemetry (OTel).
>
> Regarding the 2nd issue, I scrolled to the "Why OpenTelemetry?"
> section. It's still frustrating to see no answer. Eventually, I found
>

OpenTelemetry isn't the solution for large amount of topic.
The solution is described at
"Aggregate and Filtering to solve cardinality issues" section.



> the explanation in the "What we need to fix in OpenTelemetry -
> Performance" section. It seems that we still need some enhancements in
> OTel. In other words, currently OTel is not ready for resolving all
> these issues listed in the proposal but we believe it will.
>

Let me rephrase "believe" --> we work together with the maintainers to do
it, yes.
I am open for any other suggestion.



>
> As for the 3rd issue, from the "Integrating with Pulsar Plugins"
> section, the plugin authors still need to implement the new OTel
> interfaces. Is it much easier than using the existing ways to expose
> metrics? Could metrics still be easily integrated with Grafana?
>

Yes, it's way easier.
Basically you have a full fledged metrics library objects: Meter, Gauge,
Histogram, Counter.
No more Raw Metrics Provider, writing UTF-8 bytes in Prometheus format.
You get namespacing for free with Meter name and version.
It's way better than current solution and any other library.


>
> That's all I am concerned about at the moment. I understand, and
> appreciate that you've spent much time studying and explaining all
> these things. But, this proposal is still too huge.
>

I appreciate your effort a lot!



>
> [1] https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf
> [2]
> https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java
>
> Thanks,
> Yunze
>
> On Sun, May 7, 2023 at 5:53 PM Asaf Mesika <asaf.mes...@gmail.com> wrote:
> >
> > I'm very appreciative for feedback from multiple pulsar users and devs on
> > this PIP, since it has dramatic changes suggested and quite extensive
> > positive change for the users.
> >
> >
> > On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika <asaf.mes...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I'm very excited to release a PIP I've been working on in the past 11
> > > months, which I think will be immensely valuable to Pulsar, which I
> like so
> > > much.
> > >
> > > PIP: https://github.com/apache/pulsar/issues/20197
> > >
> > > I'm quoting here the preface:
> > >
> > > === QUOTE START ===
> > >
> > > Roughly 11 months ago, I started working on solving the biggest issue
> with
> > > Pulsar metrics: the lack of ability to monitor a pulsar broker with a
> large
> > > topic count: 10k, 100k, and future support of 1M. This started by
> mapping
> > > the existing functionality and then enumerating all the problems I saw
> (all
> > > documented in this doc
> > > <
> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing
> >
> > > ).
> > >
> > > This PIP is a parent PIP. It aims to gradually solve (using sub-PIPs)
> all
> > > the current metric system's problems and provide the ability to
> monitor a
> > > broker with a large topic count, which is currently lacking. As a
> parent
> > > PIP, it will describe each problem and its solution at a high level,
> > > leaving fine-grained details to the sub-PIPs. The parent PIP ensures
> all
> > > solutions align and does not contradict each other.
> > >
> > > The basic building block to solve the monitoring ability of large topic
> > > count is aggregating internally (to topic groups) and adding
> fine-grained
> > > filtering. We could have shoe-horned it into the existing metric
> system,
> > > but we thought adding that to a system already ingrained with many
> problems
> > > would be wrong and hard to do gradually, as so many things will break.
> This
> > > is why the second-biggest design decision presented here is
> consolidating
> > > all existing metric libraries into a single one - OpenTelemetry
> > > <https://opentelemetry.io/>. The parent PIP will explain why
> > > OpenTelemetry was chosen out of existing solutions and why it far
> exceeds
> > > all other options. I’ve been working closely with the OpenTelemetry
> > > community in the past eight months: brain-storming this integration,
> and
> > > raising issues, in an effort to remove serious blockers to make this
> > > migration successful.
> > >
> > > I made every effort to summarize this document so that it can be
> concise
> > > yet clear. I understand it is an effort to read it and, more so,
> provide
> > > meaningful feedback on such a large document; hence I’m very grateful
> for
> > > each individual who does so.
> > >
> > > I think this design will help improve the user experience immensely,
> so it
> > > is worth the time spent reading it.
> > >
> > >
> > > === QUOTE END ===
> > >
> > >
> > > Thanks!
> > >
> > > Asaf Mesika
> > >
>

Re: [DISCUSS] PIP-264: Enhanced OTel-based metric system

Reply via email to