On Sun, May 7, 2023 at 4:23 PM Yunze Xu <y...@streamnative.io.invalid> wrote:
> I'm excited to learn much more about metrics when I started reading > this proposal. But I became more and more frustrated when I found > there is still too much content left even if I've already spent much > time reading this proposal. I'm wondering how much time did you expect > reviewers to read through this proposal? I just recalled the > discussion you started before [1]. Did you expect each PMC member that > gives his/her +1 to read only parts of this proposal? > I estimated around 2 hours needed for a reviewer. I hate it being so long, but I simply couldn't find a way to downsize it more. Furthermore, I consulted with my colleagues including Matteo, but we couldn't see a way to scope it down. Why? Because once you begin this journey, you need to know how it's going to end. What I ended up doing, is writing all the crucial details for review in the High Level Design section. It's still a big, hefty section, but I don't think I can step out or let anyone else change Pulsar so invasively without the full extent of the change. I don't think it's wise to read parts. I did my very best effort to minimize it, but the scope is simply big. Open for suggestions, but it requires reading all the PIP :) Thanks a lot Yunze for dedicating any time to it. > > Let's talk back to the proposal, for now, what I mainly learned and > are concerned about mostly are: > 1. Pulsar has many ways to expose metrics. It's not unified and confusing. > 2. The current metrics system cannot support a large amount of topics. > 3. It's hard for plugin authors to integrate metrics. (For example, > KoP [2] integrates metrics by implementing the > PrometheusRawMetricsProvider interface and it indeed needs much work) > > Regarding the 1st issue, this proposal chooses OpenTelemetry (OTel). > > Regarding the 2nd issue, I scrolled to the "Why OpenTelemetry?" > section. It's still frustrating to see no answer. Eventually, I found > OpenTelemetry isn't the solution for large amount of topic. The solution is described at "Aggregate and Filtering to solve cardinality issues" section. > the explanation in the "What we need to fix in OpenTelemetry - > Performance" section. It seems that we still need some enhancements in > OTel. In other words, currently OTel is not ready for resolving all > these issues listed in the proposal but we believe it will. > Let me rephrase "believe" --> we work together with the maintainers to do it, yes. I am open for any other suggestion. > > As for the 3rd issue, from the "Integrating with Pulsar Plugins" > section, the plugin authors still need to implement the new OTel > interfaces. Is it much easier than using the existing ways to expose > metrics? Could metrics still be easily integrated with Grafana? > Yes, it's way easier. Basically you have a full fledged metrics library objects: Meter, Gauge, Histogram, Counter. No more Raw Metrics Provider, writing UTF-8 bytes in Prometheus format. You get namespacing for free with Meter name and version. It's way better than current solution and any other library. > > That's all I am concerned about at the moment. I understand, and > appreciate that you've spent much time studying and explaining all > these things. But, this proposal is still too huge. > I appreciate your effort a lot! > > [1] https://lists.apache.org/thread/04jxqskcwwzdyfghkv4zstxxmzn154kf > [2] > https://github.com/streamnative/kop/blob/master/kafka-impl/src/main/java/io/streamnative/pulsar/handlers/kop/stats/PrometheusMetricsProvider.java > > Thanks, > Yunze > > On Sun, May 7, 2023 at 5:53 PM Asaf Mesika <asaf.mes...@gmail.com> wrote: > > > > I'm very appreciative for feedback from multiple pulsar users and devs on > > this PIP, since it has dramatic changes suggested and quite extensive > > positive change for the users. > > > > > > On Thu, Apr 27, 2023 at 7:32 PM Asaf Mesika <asaf.mes...@gmail.com> > wrote: > > > > > Hi all, > > > > > > I'm very excited to release a PIP I've been working on in the past 11 > > > months, which I think will be immensely valuable to Pulsar, which I > like so > > > much. > > > > > > PIP: https://github.com/apache/pulsar/issues/20197 > > > > > > I'm quoting here the preface: > > > > > > === QUOTE START === > > > > > > Roughly 11 months ago, I started working on solving the biggest issue > with > > > Pulsar metrics: the lack of ability to monitor a pulsar broker with a > large > > > topic count: 10k, 100k, and future support of 1M. This started by > mapping > > > the existing functionality and then enumerating all the problems I saw > (all > > > documented in this doc > > > < > https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing > > > > > ). > > > > > > This PIP is a parent PIP. It aims to gradually solve (using sub-PIPs) > all > > > the current metric system's problems and provide the ability to > monitor a > > > broker with a large topic count, which is currently lacking. As a > parent > > > PIP, it will describe each problem and its solution at a high level, > > > leaving fine-grained details to the sub-PIPs. The parent PIP ensures > all > > > solutions align and does not contradict each other. > > > > > > The basic building block to solve the monitoring ability of large topic > > > count is aggregating internally (to topic groups) and adding > fine-grained > > > filtering. We could have shoe-horned it into the existing metric > system, > > > but we thought adding that to a system already ingrained with many > problems > > > would be wrong and hard to do gradually, as so many things will break. > This > > > is why the second-biggest design decision presented here is > consolidating > > > all existing metric libraries into a single one - OpenTelemetry > > > <https://opentelemetry.io/>. The parent PIP will explain why > > > OpenTelemetry was chosen out of existing solutions and why it far > exceeds > > > all other options. I’ve been working closely with the OpenTelemetry > > > community in the past eight months: brain-storming this integration, > and > > > raising issues, in an effort to remove serious blockers to make this > > > migration successful. > > > > > > I made every effort to summarize this document so that it can be > concise > > > yet clear. I understand it is an effort to read it and, more so, > provide > > > meaningful feedback on such a large document; hence I’m very grateful > for > > > each individual who does so. > > > > > > I think this design will help improve the user experience immensely, > so it > > > is worth the time spent reading it. > > > > > > > > > === QUOTE END === > > > > > > > > > Thanks! > > > > > > Asaf Mesika > > > >