Re: [DISCUSS] PIP-264: Enhanced OTel-based metric system

Asaf Mesika Wed, 10 May 2023 00:52:46 -0700

On Tue, May 9, 2023 at 9:33 PM Yunze Xu <y...@streamnative.io.invalid>
wrote:


> > Basically you have a full fledged metrics library objects: Meter, Gauge,
> Histogram, Counter.
>
> It sounds good, but not so attractive. Currently KoP implements its
> own metrics library objects. So after that, we need to leverage the
> similar classes from OTel.
>
Yes, well that's the cost associated with the benefit, not the benefit it
self :)
You can see we have many goals to solve in this PIP, among them some
serious pain people suffer today as users:
* Inability to observe 10k topics per broker and above (One of the key
advantages Pulsar has is many topics).
* Very expensive metrics. 100 UTS per topic. That's expensive, even for 1k
topics per broker.

So the PIP goal is to solve those (and many more).
The cost is that we need to make some heavy breaking changes, among them
Pulsar Plugin author like KoP (you) will need to spend time to migrate
their code. You are correct.

The attractive part is solving the pains of the user I described.
For existing plugin authors, OTel is not attractive, yes.
For future plugins authors, IMO, very attractive, since you remove a lot of
work they need to do for something so basic in today's world, such as
metrics.

 Is the cost worth it - this is what I'm trying to figure out by multiple
people's feedback.



> I want to talk a little more beyond that. IIUC, this proposal wants to
> replace the current metrics systems with the OTel. But for most
> developers and maintainers, the most important thing that they cared
> about might be how many changes could it bring?For example, currently
> the Grafana dashboards have been widely used. How many changes could
> it bring? Do users need to learn completely different dashboards? I
> asked this question before but it's not answered. Then I found the
> "Breaking changes" section. So many breaking changes are usually not
> acceptable.
>

The dashboards will not be changed in the way they look and their
semantics. Each panel remains.
The changes are internal to each panel, which means the queries will change
since the metric name will slightly change.

For the users, they will import the new dashboard, if they used it as is.
If someone created a custom dashboard, yes, they will have to invest some
time to upgrade it.
I think it's 1 hour top to make the fixes.

I will edit the PIP to clarify that.

Dashboards are mostly a user issue, so why do you think it's related to
developers and maintainers?

Regarding so many breaking changes are not acceptable - I'm new to this
community, hence I raise that here.
Do you find the amount of breaking changes not worth the huge benefit to
the users of Pulsar?
Do you have any suggestion to obtain same benefit and have smaller breaking
changes?

Please bear in mind, all changes are happening in a separate layer of Otel,
*co-existing* together with current metric system layer.
I'm not breaking anything until you make the switch.



> I see you listed a lot of problems for the current design. I think
> each of them needs a PIP or at least a PR to resolve if a breaking
> change could be made. Why not solve them one by one in Pulsar?
>
> That's precisely what I wrote in the PIP:
* It's a master PIP.
* Many sections will turn into sub PIPs

Meaning, each problem I mentioned would be solved one by one (in Pulsar, of
course).

The reason for this PIP (master pip) to be introduced, is to make sure we
first have an agreement from the community of developers and users before
we go and spend such a huge amount of work. 2nd reason is that the PIP was
done to ensure all general sub PIPs will align and nothing will surprise us
and find out after 1 year of work that we have stumbled into a wall which
we can't pass. The master gives you that guarantee.




> Thanks,
> Yunze
>
> On Mon, May 8, 2023 at 12:53 AM Asaf Mesika <asaf.mes...@gmail.com> wrote:
> >
> > On Sun, May 7, 2023 at 4:23 PM Yunze Xu <y...@streamnative.io.invalid>
> > wrote:
> >
> > > I'm excited to learn much more about metrics when I started reading
> > > this proposal. But I became more and more frustrated when I found
> > > there is still too much content left even if I've already spent much
> > > time reading this proposal. I'm wondering how much time did you expect
> > > reviewers to read through this proposal? I just recalled the
> > > discussion you started before [1]. Did you expect each PMC member that
> > > gives his/her +1 to read only parts of this proposal?
> > >
> >
> > I estimated around 2 hours needed for a reviewer.
> > I hate it being so long, but I simply couldn't find a way to downsize it
> > more. Furthermore, I consulted with my colleagues including Matteo, but
> we
> > couldn't see a way to scope it down.
> > Why? Because once you begin this journey, you need to know how it's going
> > to end.
> > What I ended up doing, is writing all the crucial details for review in
> the
> > High Level Design section.
> > It's still a big, hefty section, but I don't think I can step out or let
> > anyone else change Pulsar so invasively without the full extent of the
> > change.
> >
> > I don't think it's wise to read parts.
> > I did my very best effort to minimize it, but the scope is simply big.
> Open
> > for suggestions, but it requires reading all the PIP :)
> >
> > Thanks a lot Yunze for dedicating any time to it.
> >
> >
> >
> >
> > >
> > > Let's talk back to the proposal, for now, what I mainly learned and
> > > are concerned about mostly are:
> > > 1. Pulsar has many ways to expose metrics. It's not unified and
> confusing.
> > > 2. The current metrics system cannot support a large amount of topics.
> > > 3. It's hard for plugin authors to integrate metrics. (For example,
> > > KoP [2] integrates metrics by implementing the
> > > PrometheusRawMetricsProvider interface and it indeed needs much work)
> > >
> > > Regarding the 1st issue, this proposal chooses OpenTelemetry (OTel).
> > >
> > > Regarding the 2nd issue, I scrolled to the "Why OpenTelemetry?"
> > > section. It's still frustrating to see no answer. Eventually, I found
> > >
> >
> > OpenTelemetry isn't the solution for large amount of topic.
> > The solution is described at
> > "Aggregate and Filtering to solve cardinality issues" section.
> >
> >
> >
> > > the explanation in the "What we need to fix in OpenTelemetry -
> > > Performance" section. It seems that we still need some enhancements in
> > > OTel. In other words, currently OTel is not ready for resolving all
> > > these issues listed in the proposal but we believe it will.
> > >
> >
> > Let me rephrase "believe" --> we work together with the maintainers to do
> > it, yes.
> > I am open for any other suggestion.
> >
> >
> >
> > >
> > > As for the 3rd issue, from the "Integrating with Pulsar Plugins"
> > > section, the plugin authors still need to implement the new OTel
> > > interfaces. Is it much easier than using the existing ways to expose
> > > metrics? Could metrics still be easily integrated with Grafana?
> > >
> >
> > Yes, it's way easier.
> > Basically you have a full fledged metrics library objects: Meter, Gauge,
> > Histogram, Counter.
> > No more Raw Metrics Provider, writing UTF-8 bytes in Prometheus format.
> > You get namespacing for free with Meter name and version.
> > It's way better than current solution and any other library.
> >
> >
> > >
> > > That's all I am concerned about at the moment. I understand, and
> > > appreciate that you've spent much time studying and explaining all
> > > these things. But, this proposal is still too huge.
> > >
> >
> > I appreciate your effort a lot!
> >
> >
> >
> > >
> > > [1] https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf
> > > [2]
> > >
> https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java
> > >
> > > Thanks,
> > > Yunze
> > >
> > > On Sun, May 7, 2023 at 5:53 PM Asaf Mesika <asaf.mes...@gmail.com>
> wrote:
> > > >
> > > > I'm very appreciative for feedback from multiple pulsar users and
> devs on
> > > > this PIP, since it has dramatic changes suggested and quite extensive
> > > > positive change for the users.
> > > >
> > > >
> > > > On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika <asaf.mes...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'm very excited to release a PIP I've been working on in the past
> 11
> > > > > months, which I think will be immensely valuable to Pulsar, which I
> > > like so
> > > > > much.
> > > > >
> > > > > PIP: https://github.com/apache/pulsar/issues/20197
> > > > >
> > > > > I'm quoting here the preface:
> > > > >
> > > > > === QUOTE START ===
> > > > >
> > > > > Roughly 11 months ago, I started working on solving the biggest
> issue
> > > with
> > > > > Pulsar metrics: the lack of ability to monitor a pulsar broker
> with a
> > > large
> > > > > topic count: 10k, 100k, and future support of 1M. This started by
> > > mapping
> > > > > the existing functionality and then enumerating all the problems I
> saw
> > > (all
> > > > > documented in this doc
> > > > > <
> > >
> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing
> > > >
> > > > > ).
> > > > >
> > > > > This PIP is a parent PIP. It aims to gradually solve (using
> sub-PIPs)
> > > all
> > > > > the current metric system's problems and provide the ability to
> > > monitor a
> > > > > broker with a large topic count, which is currently lacking. As a
> > > parent
> > > > > PIP, it will describe each problem and its solution at a high
> level,
> > > > > leaving fine-grained details to the sub-PIPs. The parent PIP
> ensures
> > > all
> > > > > solutions align and does not contradict each other.
> > > > >
> > > > > The basic building block to solve the monitoring ability of large
> topic
> > > > > count is aggregating internally (to topic groups) and adding
> > > fine-grained
> > > > > filtering. We could have shoe-horned it into the existing metric
> > > system,
> > > > > but we thought adding that to a system already ingrained with many
> > > problems
> > > > > would be wrong and hard to do gradually, as so many things will
> break.
> > > This
> > > > > is why the second-biggest design decision presented here is
> > > consolidating
> > > > > all existing metric libraries into a single one - OpenTelemetry
> > > > > <https://opentelemetry.io/>. The parent PIP will explain why
> > > > > OpenTelemetry was chosen out of existing solutions and why it far
> > > exceeds
> > > > > all other options. I’ve been working closely with the OpenTelemetry
> > > > > community in the past eight months: brain-storming this
> integration,
> > > and
> > > > > raising issues, in an effort to remove serious blockers to make
> this
> > > > > migration successful.
> > > > >
> > > > > I made every effort to summarize this document so that it can be
> > > concise
> > > > > yet clear. I understand it is an effort to read it and, more so,
> > > provide
> > > > > meaningful feedback on such a large document; hence I’m very
> grateful
> > > for
> > > > > each individual who does so.
> > > > >
> > > > > I think this design will help improve the user experience
> immensely,
> > > so it
> > > > > is worth the time spent reading it.
> > > > >
> > > > >
> > > > > === QUOTE END ===
> > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Asaf Mesika
> > > > >
> > >
>

Re: [DISCUSS] PIP-264: Enhanced OTel-based metric system

Reply via email to