[ 
https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056669#comment-18056669
 ] 

Xinyu Tan commented on RATIS-2389:
----------------------------------

[~taklwu] Thank you for the proposal — it looks very promising overall.

At the moment, the tracing seems to focus on two high-level spans (client and 
server). As a possible follow-up improvement, it might be worth considering 
modeling retries on the client side as child spans, and capturing more detailed 
critical paths in the server-side read and write workflows. This could help 
provide deeper insights into potential read/write bottlenecks and guide future 
optimizations.

Regarding trace data delivery and configuration, it may be helpful if the 
documentation could further clarify how these aspects are defined and 
configured, such as whether trace data is sent synchronously or asynchronously, 
the batching strategy and batch size, the target host, and whether RPC 
compression can be enabled.

For the current POC, before moving the PR into a formal review state, would it 
be possible to visualize the tracing results using Jaeger and include some 
screenshots or links (for example, showing the read and write paths) in the PR 
or related documentation? This could help reviewers gain a clearer and more 
concrete understanding of the POC’s behavior and effectiveness.

> Implementing Opentelemetry Tracing in Apache Ratis
> --------------------------------------------------
>
>                 Key: RATIS-2389
>                 URL: https://issues.apache.org/jira/browse/RATIS-2389
>             Project: Ratis
>          Issue Type: New Feature
>          Components: client, server
>    Affects Versions: 3.3.0
>            Reporter: Tak-Lon (Stephen) Wu
>            Assignee: Tak-Lon (Stephen) Wu
>            Priority: Minor
>         Attachments: PoC-result-span-detail.png, PoC-result.png
>
>
> This proposal outlines the addition of OpenTelemetry support to Ratis. By 
> instrumenting the full client-side request path, we can empower users and 
> maintainers with the granular data necessary for both long-term performance 
> optimization and proactive daily monitoring.
>  * 1-pager proposal: 
> [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n]
>  * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129]
> Subtasks
>  * Define the Metadata Field: Modify RaftRpcMessage.proto to include an 
> optional SpanContext field.
>  * Add TraceUtil: Land the utility class in ratis-common based on the code 
> you see in HBase.
>  * Create the client span: Introduce the span supplier and CLIENT span hook.
>  * Instrument GRPC on the Server: Start with the GRPC module as it is the 
> most common transport. Instrument the onNext methods (or within the caller) 
> to start/stop spans.
>  * Come up with the user guide as part of the release. 
> Reference
> 1. HBase Tracing with Opentelemetry, 
> [HBASE-22120|https://issues.apache.org/jira/browse/HBASE-22120]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to