lhotari commented on code in PR #24857: URL: https://github.com/apache/pulsar/pull/24857#discussion_r2435570680
########## pip/pip-446.md: ########## @@ -0,0 +1,532 @@ +# PIP-446: Support Native OpenTelemetry Tracing in Pulsar Java Client + +# Background knowledge + +## OpenTelemetry + +OpenTelemetry is a vendor-neutral observability framework that provides APIs, SDKs, and tools for collecting distributed traces, metrics, and logs. It has become the industry standard for observability, adopted by major cloud providers and APM vendors. + +## Distributed Tracing + +Distributed tracing tracks requests as they flow through distributed systems. A **trace** represents the entire journey of a request, composed of multiple **spans**. Each span represents a single operation (e.g., sending a message, processing a request). Spans form parent-child relationships, creating a trace tree that visualizes request flow across services. + +## W3C Trace Context + +The W3C Trace Context specification defines a standard way to propagate trace context across service boundaries using HTTP headers or message properties: +- `traceparent`: Contains trace ID, span ID, and trace flags +- `tracestate`: Contains vendor-specific trace information + +## Pulsar Interceptors + +Pulsar client interceptors allow users to intercept and modify messages before sending (producer) or after receiving (consumer). They provide hooks for cross-cutting concerns like tracing, metrics, and security. + +## Cumulative Acknowledgment + +In Pulsar, cumulative acknowledgment allows consumers to acknowledge all messages up to a specific message ID in one operation. This is only available for Failover and Exclusive subscription types where message order is guaranteed. When a message is cumulatively acknowledged, all previous messages on that partition are implicitly acknowledged. + +# Motivation + +Currently, the Pulsar Java client lacks native support for distributed tracing with OpenTelemetry. While the OpenTelemetry Java Agent can automatically instrument Pulsar clients, there are several limitations: + +1. **Agent-only approach**: Users must use the Java Agent, which may not be suitable for all deployment scenarios (e.g., serverless, embedded applications) +2. **Limited control**: Users cannot easily customize tracing behavior or selectively enable tracing for specific producers/consumers +3. **Missing first-class support**: Other Apache projects (Kafka, Camel) provide native OpenTelemetry support, making Pulsar less competitive +4. **Complex setup**: Users must understand agent configuration and classpath setup + +Native OpenTelemetry support would: +- Provide a programmatic API for tracing configuration +- Enable selective tracing without agent overhead +- Improve observability in production systems +- Align Pulsar with modern observability practices +- Make it easier to diagnose performance issues and message flow + +# Goals + +## In Scope + +1. **Producer tracing**: Create spans for message send operations with automatic trace context injection +2. **Consumer tracing**: Create spans for message receive/process operations with automatic trace context extraction +3. **Trace context propagation**: Inject and extract W3C Trace Context via message properties +4. **Programmatic API**: Enable tracing via `ClientBuilder` API +5. **Interceptor-based design**: Implement using Pulsar's existing interceptor mechanism +6. **Cumulative acknowledgment support**: Properly handle span lifecycle for cumulative acks +7. **Multi-topic consumer support**: Track spans across multiple topic partitions +8. **Agent compatibility**: Ensure compatibility with OpenTelemetry Java Agent +9. **Semantic conventions**: Follow OpenTelemetry messaging semantic conventions +10. **Zero overhead when disabled**: No performance impact when tracing is not enabled Review Comment: One additional point to consider in the design is the behavior when sampling is configured (I believe that's the default) and a specific span isn't recorded at all. In those cases, an implementation can choose to reduce the overhead by omitting certain calls to the OTel tracing API. This is explained in the [Span IsRecording](https://opentelemetry.io/docs/specs/otel/trace/api/#isrecording) and [Sampling](https://opentelemetry.io/docs/specs/otel/trace/sdk/#sampling) documentation. Although a span isn't sampled, it's still necessary to create child spans to propagate the span context to downstream, so optimizations based on span.isRecording() might not be very useful in the end. I also noticed that the current Otel API that we have in Pulsar doesn't contain isRecording. We should also make sure to upgrade to latest stable versions of Otel libraries when implementing this PIP. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
