lhotari commented on code in PR #24857:
URL: https://github.com/apache/pulsar/pull/24857#discussion_r2435570680


##########
pip/pip-446.md:
##########
@@ -0,0 +1,532 @@
+# PIP-446: Support Native OpenTelemetry Tracing in Pulsar Java Client
+
+# Background knowledge
+
+## OpenTelemetry
+
+OpenTelemetry is a vendor-neutral observability framework that provides APIs, 
SDKs, and tools for collecting distributed traces, metrics, and logs. It has 
become the industry standard for observability, adopted by major cloud 
providers and APM vendors.
+
+## Distributed Tracing
+
+Distributed tracing tracks requests as they flow through distributed systems. 
A **trace** represents the entire journey of a request, composed of multiple 
**spans**. Each span represents a single operation (e.g., sending a message, 
processing a request). Spans form parent-child relationships, creating a trace 
tree that visualizes request flow across services.
+
+## W3C Trace Context
+
+The W3C Trace Context specification defines a standard way to propagate trace 
context across service boundaries using HTTP headers or message properties:
+- `traceparent`: Contains trace ID, span ID, and trace flags
+- `tracestate`: Contains vendor-specific trace information
+
+## Pulsar Interceptors
+
+Pulsar client interceptors allow users to intercept and modify messages before 
sending (producer) or after receiving (consumer). They provide hooks for 
cross-cutting concerns like tracing, metrics, and security.
+
+## Cumulative Acknowledgment
+
+In Pulsar, cumulative acknowledgment allows consumers to acknowledge all 
messages up to a specific message ID in one operation. This is only available 
for Failover and Exclusive subscription types where message order is 
guaranteed. When a message is cumulatively acknowledged, all previous messages 
on that partition are implicitly acknowledged.
+
+# Motivation
+
+Currently, the Pulsar Java client lacks native support for distributed tracing 
with OpenTelemetry. While the OpenTelemetry Java Agent can automatically 
instrument Pulsar clients, there are several limitations:
+
+1. **Agent-only approach**: Users must use the Java Agent, which may not be 
suitable for all deployment scenarios (e.g., serverless, embedded applications)
+2. **Limited control**: Users cannot easily customize tracing behavior or 
selectively enable tracing for specific producers/consumers
+3. **Missing first-class support**: Other Apache projects (Kafka, Camel) 
provide native OpenTelemetry support, making Pulsar less competitive
+4. **Complex setup**: Users must understand agent configuration and classpath 
setup
+
+Native OpenTelemetry support would:
+- Provide a programmatic API for tracing configuration
+- Enable selective tracing without agent overhead
+- Improve observability in production systems
+- Align Pulsar with modern observability practices
+- Make it easier to diagnose performance issues and message flow
+
+# Goals
+
+## In Scope
+
+1. **Producer tracing**: Create spans for message send operations with 
automatic trace context injection
+2. **Consumer tracing**: Create spans for message receive/process operations 
with automatic trace context extraction
+3. **Trace context propagation**: Inject and extract W3C Trace Context via 
message properties
+4. **Programmatic API**: Enable tracing via `ClientBuilder` API
+5. **Interceptor-based design**: Implement using Pulsar's existing interceptor 
mechanism
+6. **Cumulative acknowledgment support**: Properly handle span lifecycle for 
cumulative acks
+7. **Multi-topic consumer support**: Track spans across multiple topic 
partitions
+8. **Agent compatibility**: Ensure compatibility with OpenTelemetry Java Agent
+9. **Semantic conventions**: Follow OpenTelemetry messaging semantic 
conventions
+10. **Zero overhead when disabled**: No performance impact when tracing is not 
enabled

Review Comment:
   One additional point to consider in the design is the behavior when sampling 
is configured (I believe that's the default) and a specific span isn't recorded 
at all. In those cases, an implementation can choose to reduce the overhead by 
omitting certain calls to the OTel tracing API. 
   This is explained in the [Span 
IsRecording](https://opentelemetry.io/docs/specs/otel/trace/api/#isrecording) 
and [Sampling](https://opentelemetry.io/docs/specs/otel/trace/sdk/#sampling) 
documentation.
   Although a span isn't sampled, it's still necessary to create child spans to 
propagate the span context to downstream, so optimizations based on 
span.isRecording() might not be very useful in the end.
   
   I also noticed that the current Otel API that we have in Pulsar doesn't 
contain isRecording. We should also make sure to upgrade to latest stable 
versions of Otel libraries when implementing this PIP.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to