Apurva007 commented on code in PR #21635: URL: https://github.com/apache/pulsar/pull/21635#discussion_r1419708455
########## pip/pip-320.md: ########## @@ -0,0 +1,241 @@ +# PIP-320 OpenTelemetry Scaffolding + +# Background knowledge + +## PIP-264 - parent PIP titled "Enhanced OTel-based metric system" +[PIP-264](https://github.com/apache/pulsar/pull/21080), which can also be viewed [here](pip-264.md), describes in high +level a plan to greatly enhance Pulsar metric system by replacing it with [OpenTelemetry](https://opentelemetry.io/). +You can read in the PIP the numerous existing problems PIP-264 solves. Among them are: +- Control which metrics to export per topic/group/namespace via the introduction of a metric filter configuration +- Reduce the immense metrics cardinality due to high topic count (One of Pulsar great features), by introducing +the concept of Metric Group - a group of topics for metric purposes. Metric reporting will also be done to a +group granularity. 100k topics can be downsized to 1k groups. The dynamic metric filter configuration would allow +the user to control which metric group to un-filter. +- Proper histogram exporting +- Clean-up codebase clutter, by relying on a single industry standard API, SDK and metrics protocol (OTLP) instead of +existing mix of home-brew libraries and hard coded Prometheus exporter. +- any many more + +You can [here](pip-264.md#why-opentelemetry) why OpenTelemetry was chosen. + +## OpenTelemetry +Since OpenTelemetry (a.k.a. OTel) is an emerging industry standard, there are plenty of good articles, videos and +documentation about it. In this very short paragraph I'll describe what you need to know about OTel from this PIP +perspective. + +OpenTelemetry is a project aimed to standardize the way we instrument, collect and ship metrics from applications +to telemetry backends, be it databases (e.g. Prometheus, Cortex, Thanos) or vendors (e.g. Datadog, Logz.io). +It is divided into API, SDK and Collector: +- API: interfaces to use to instrument: define a counter, record values to a histogram, etc. +- SDK: a library, available in many languages, implementing the API, and other important features such as +reading the metrics and exporting it out to a telemetry backend or OTel Collector. +- Collector: a lightweight process (application) which can receive or retrieve telemetry, transform it (e.g. +filter, drop, aggregate) and export it (e.g. send it to various backends). The SDK supports out-of-the-box +exporting metrics as Prometheus HTTP endpoint or sending them out using OTLP protocol. Many times companies choose to +ship to the Collector and there ship to their preferred vendors, since each vendor already published their exporter +plugin to OTel Collector. This makes the SDK exporters very light-weight as they don't need to support any +vendor. It's also easier for the DevOps team as they can make OTel Collector their responsibility, and have +application developers only focus on shipping metrics to that collector. + +Just to have some context: Pulsar codebase will use the OTel API to create counters / histograms and records values to +them. So will the Pulsar plugins and Pulsar Function authors. Pulsar itself will be the one creating the SDK +and using that to hand over an implementation of the API where ever needed in Pulsar. Collector is up to the choice +of the user, as OTel provides a way to expose the metrics as `/metrics` endpoint on a configured port, so Prometheus +compatible scrapers can grab it from it directly. They can also send it via OTLP to OTel collector. + +## Telemetry layers +PIP-264 clearly outlined there will be two layers of metrics, collected and exported, side by side: OpenTelemetry +and the existing metric system - currently exporting in Prometheus. This PIP will explain in detail how it will work. +The basic premise is that you will be able to enable or disable OTel metrics, alongside the existing Prometheus Review Comment: If both prometheus and OTEL can coexist if Otel is enabled, then will it cause a memory increase? if yes, then please can you clarify if after an initial verification, is it possible to disable prometheus while Otel is enabled? There is a config called "exposeBundlesMetricsInPrometheus", but I am not sure if it disables all metrics collection irrespective of prometheus and Otel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pulsar.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org