[ 
https://issues.apache.org/jira/browse/TIKA-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-4513:
---------------------------------------
    Description: 
Currently, tika-server lacks standardized observability instrumentation, 
relying on basic logging or custom metrics, which limits our ability to 
diagnose performance bottlenecks, track request latencies, or correlate 
failures across distributed deployments (which is readily available via 
tika-helm).

This initiative will Implement [OpenTelemetry Java (OTEL) 
|https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache 
Tika Server to enable comprehensive collection of traces, metrics, and logs. 
This will improve system observability, allowing for better monitoring of 
request processing, resource usage, and error rates in a production environment.

The s stable across all major components (traces, metrics and logs), as per the 
official documentation.

What's also nice about OTEL is that it integrates with tools like Jaeger 
(tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would 
also facilitate rich visualizations via tools like Grafana.
h4. Rationale
 * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows 
(e.g., from HTTP ingestion to parser execution), metrics will track throughput 
and error rates, and structured logs will provide context for debugging.
 * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure 
compatibility with evolving observability backends without vendor lock-in.
 * {*}Low Overhead{*}: We can experiment with 
[zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
 which will initially minimize code changes to develop a baseline. We can build 
on this to better observe custom Tika logic (e.g., parser chaining).
 * {*}Community Benefits{*}: Enhances Tika's appeal for microservices 
architectures, where observability is critical.

h4. Goals
 * Instrument core Tika Server endpoints (e.g., 
[/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
 
[/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
 
[/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
 to emit telemetry data.
 * Support configurable exporters for traces/metrics/logs to common backends 
(e.g., OTLP to a collector).
 * Ensure instrumentation does not degrade performance (<5% overhead target) 
and handles high-load scenarios gracefully.
 * Document setup for users deploying Tika Server.

h4. Acceptance Criteria
 * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via 
-javaagent:opentelemetry-javaagent.jar).
 * Sample requests generate traces with spans for key operations (e.g., 
document parsing, MIME detection); verifiable via a tracer exporter (like 
Jaeger).
 * Metrics expose at least: request count, latency histograms, error rates, and 
resource usage (CPU/memory via JVM metrics).
 * Logs are structured and correlated with traces (e.g., via trace/span IDs).
 * Unit/integration tests cover instrumentation (e.g., assert span attributes 
like http.method and content.type).
 * Configuration options added to tika-server.properties for enabling/disabling 
telemetry and setting exporter endpoints.
 * Documentation updated in Tika wiki with setup guide, including Docker 
integration... and then TIka Helm.
 * Performance benchmarks show <5% overhead under load (e.g., using JMeter or 
k6).
 * No regressions in existing Tika Server functionality.

h4. Tasks

*1. Research and Setup*
 # Review OpenTelemetry Java getting-started guide and instrumentation registry 
for Tika-relevant libraries (e.g., auto-instrumentation for Jetty HTTP server, 
Apache HttpClient).
 # Set up a local dev environment with Tika Server, OpenTelemetry Java agent 
(latest stable release), and a test collector (e.g., [Grafana 
Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
 # Prototype basic trace export for a sample /tika request.{*}{{*}}

*2. Core Instrumentation*
 # Enable auto-instrumentation for HTTP handling and core Tika parsers.
 # Add manual spans for custom logic (e.g., in TikaResource for request 
routing, Parser chain execution).
 # Implement metrics using the Meter API (e.g., counters for processed 
documents, gauges for active parsers).
 # Bridge logs to OpenTelemetry (e.g., via 
io.opentelemetry.instrumentation.logback-appender-otel).{*}{{*}}

*3. Configuration and Exporters*
 # Integrate environment variables or properties for exporter config (e.g., 
OTLP endpoint, sampling rate).
 # Support batching and sampling to handle scale.

*4. Testing and Validation*
 # Write tests using OpenTelemetry SDK's in-memory exporter to assert telemetry 
output.
 # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
 # Edge case testing: error handling, large files, concurrent requests.

*5. Documentation and Release*
 # Update TikaServer README and wiki with instrumentation guide.
 # Submit PR and ensure existing CI/CD remains stable.

h4. Risks and Dependencies
 * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g., 
SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation 
without disruption.
 * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via 
Maven Central).
 * {*}Risk{*}: Performance impact in parser-heavy workloads. {_}Mitigation{_}: 
Profile with async spans and configurable sampling.

Two further points
 # If anyone is interested and would like to work me with on this, please let 
me know :)
 # I'll likely create a sub issue for each task, that way we can incrementally 
prove and deliver this larger observability initiative.  

  was:
Currently, tika-server lacks standardized observability instrumentation, 
relying on basic logging or custom metrics, which limits our ability to 
diagnose performance bottlenecks, track request latencies, or correlate 
failures across distributed deployments (which is readily available via 
tika-helm).

This initiative will Implement [OpenTelemetry Java (OTEL) 
|https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache 
Tika Server to enable comprehensive collection of traces, metrics, and logs. 
This will improve system observability, allowing for better monitoring of 
request processing, resource usage, and error rates in a production environment.

The s stable across all major components (traces, metrics and logs), as per the 
official documentation.

What's also nice about OTEL is that it integrates with tools like Jaeger 
(tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would 
also facilitate rich visualizations via tools like Grafana.
h4. Rationale
 * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows 
(e.g., from HTTP ingestion to parser execution), metrics will track throughput 
and error rates, and structured logs will provide context for debugging.
 * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure 
compatibility with evolving observability backends without vendor lock-in.
 * {*}Low Overhead{*}: We can experiment with 
[zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
 which will initially minimize code changes to develop a baseline. We can build 
on this to better observe custom Tika logic (e.g., parser chaining).
 * {*}Community Benefits{*}: Enhances Tika's appeal for microservices 
architectures, where observability is critical.

h4. Goals
 * Instrument core Tika Server endpoints (e.g., 
[/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
 
[/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
 
[/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
 to emit telemetry data.
 * Support configurable exporters for traces/metrics/logs to common backends 
(e.g., OTLP to a collector).
 * Ensure instrumentation does not degrade performance (<5% overhead target) 
and handles high-load scenarios gracefully.
 * Document setup for users deploying Tika Server.

h4. Acceptance Criteria
 * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via 
-javaagent:opentelemetry-javaagent.jar).
 * Sample requests generate traces with spans for key operations (e.g., 
document parsing, MIME detection); verifiable via a tracer exporter (like 
Jaeger).
 * Metrics expose at least: request count, latency histograms, error rates, and 
resource usage (CPU/memory via JVM metrics).
 * Logs are structured and correlated with traces (e.g., via trace/span IDs).
 * Unit/integration tests cover instrumentation (e.g., assert span attributes 
like http.method and content.type).
 * Configuration options added to tika-server.properties for enabling/disabling 
telemetry and setting exporter endpoints.
 * Documentation updated in Tika wiki with setup guide, including Docker 
integration... and then TIka Helm.
 * Performance benchmarks show <5% overhead under load (e.g., using JMeter or 
k6).
 * No regressions in existing Tika Server functionality.

h4. Tasks

*1. Research and Setup*
 # Review OpenTelemetry Java getting-started guide and instrumentation registry 
for Tika-relevant libraries (e.g., auto-instrumentation for Jetty HTTP server, 
Apache HttpClient).
 # Set up a local dev environment with Tika Server, OpenTelemetry Java agent 
(latest stable release), and a test collector (e.g., [Grafana 
Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
 # Prototype basic trace export for a sample /tika request.{*}{{*}}

*2. Core Instrumentation*
 # Enable auto-instrumentation for HTTP handling and core Tika parsers.
 # Add manual spans for custom logic (e.g., in TikaResource for request 
routing, Parser chain execution).
 # Implement metrics using the Meter API (e.g., counters for processed 
documents, gauges for active parsers).
 # Bridge logs to OpenTelemetry (e.g., via 
io.opentelemetry.instrumentation.logback-appender-otel).{*}{{*}}

*3. Configuration and Exporters*
 # Integrate environment variables or properties for exporter config (e.g., 
OTLP endpoint, sampling rate).
 # Support batching and sampling to handle scale.

*4. Testing and Validation*
 # Write tests using OpenTelemetry SDK's in-memory exporter to assert telemetry 
output.
 # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
 # Edge case testing: error handling, large files, concurrent requests.

*5. Documentation and Release*
 # Update TikaServer README and wiki with instrumentation guide.
 # Prepare changelog entry and verify build inclusion (e.g., optional 
dependency).

h4. Risks and Dependencies
 * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g., 
SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation 
without disruption.
 * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via 
Maven Central).
 * {*}Risk{*}: Performance impact in parser-heavy workloads. {_}Mitigation{_}: 
Profile with async spans and configurable sampling.

Two further points
 # If anyone is interested and would like to work me with on this, please let 
me know :)
 # I'll likely create a sub issue for each task, that way we can incrementally 
prove and deliver this larger observability initiative.  


> Instrument tika-server
> ----------------------
>
>                 Key: TIKA-4513
>                 URL: https://issues.apache.org/jira/browse/TIKA-4513
>             Project: Tika
>          Issue Type: Improvement
>          Components: instrumentation, tika-server
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>
> Currently, tika-server lacks standardized observability instrumentation, 
> relying on basic logging or custom metrics, which limits our ability to 
> diagnose performance bottlenecks, track request latencies, or correlate 
> failures across distributed deployments (which is readily available via 
> tika-helm).
> This initiative will Implement [OpenTelemetry Java (OTEL) 
> |https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache 
> Tika Server to enable comprehensive collection of traces, metrics, and logs. 
> This will improve system observability, allowing for better monitoring of 
> request processing, resource usage, and error rates in a production 
> environment.
> The s stable across all major components (traces, metrics and logs), as per 
> the official documentation.
> What's also nice about OTEL is that it integrates with tools like Jaeger 
> (tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would 
> also facilitate rich visualizations via tools like Grafana.
> h4. Rationale
>  * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows 
> (e.g., from HTTP ingestion to parser execution), metrics will track 
> throughput and error rates, and structured logs will provide context for 
> debugging.
>  * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure 
> compatibility with evolving observability backends without vendor lock-in.
>  * {*}Low Overhead{*}: We can experiment with 
> [zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
>  which will initially minimize code changes to develop a baseline. We can 
> build on this to better observe custom Tika logic (e.g., parser chaining).
>  * {*}Community Benefits{*}: Enhances Tika's appeal for microservices 
> architectures, where observability is critical.
> h4. Goals
>  * Instrument core Tika Server endpoints (e.g., 
> [/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
>  
> [/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
>  
> [/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
>  to emit telemetry data.
>  * Support configurable exporters for traces/metrics/logs to common backends 
> (e.g., OTLP to a collector).
>  * Ensure instrumentation does not degrade performance (<5% overhead target) 
> and handles high-load scenarios gracefully.
>  * Document setup for users deploying Tika Server.
> h4. Acceptance Criteria
>  * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via 
> -javaagent:opentelemetry-javaagent.jar).
>  * Sample requests generate traces with spans for key operations (e.g., 
> document parsing, MIME detection); verifiable via a tracer exporter (like 
> Jaeger).
>  * Metrics expose at least: request count, latency histograms, error rates, 
> and resource usage (CPU/memory via JVM metrics).
>  * Logs are structured and correlated with traces (e.g., via trace/span IDs).
>  * Unit/integration tests cover instrumentation (e.g., assert span attributes 
> like http.method and content.type).
>  * Configuration options added to tika-server.properties for 
> enabling/disabling telemetry and setting exporter endpoints.
>  * Documentation updated in Tika wiki with setup guide, including Docker 
> integration... and then TIka Helm.
>  * Performance benchmarks show <5% overhead under load (e.g., using JMeter or 
> k6).
>  * No regressions in existing Tika Server functionality.
> h4. Tasks
> *1. Research and Setup*
>  # Review OpenTelemetry Java getting-started guide and instrumentation 
> registry for Tika-relevant libraries (e.g., auto-instrumentation for Jetty 
> HTTP server, Apache HttpClient).
>  # Set up a local dev environment with Tika Server, OpenTelemetry Java agent 
> (latest stable release), and a test collector (e.g., [Grafana 
> Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
>  # Prototype basic trace export for a sample /tika request.{*}{{*}}
> *2. Core Instrumentation*
>  # Enable auto-instrumentation for HTTP handling and core Tika parsers.
>  # Add manual spans for custom logic (e.g., in TikaResource for request 
> routing, Parser chain execution).
>  # Implement metrics using the Meter API (e.g., counters for processed 
> documents, gauges for active parsers).
>  # Bridge logs to OpenTelemetry (e.g., via 
> io.opentelemetry.instrumentation.logback-appender-otel).{*}{{*}}
> *3. Configuration and Exporters*
>  # Integrate environment variables or properties for exporter config (e.g., 
> OTLP endpoint, sampling rate).
>  # Support batching and sampling to handle scale.
> *4. Testing and Validation*
>  # Write tests using OpenTelemetry SDK's in-memory exporter to assert 
> telemetry output.
>  # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
>  # Edge case testing: error handling, large files, concurrent requests.
> *5. Documentation and Release*
>  # Update TikaServer README and wiki with instrumentation guide.
>  # Submit PR and ensure existing CI/CD remains stable.
> h4. Risks and Dependencies
>  * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g., 
> SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation 
> without disruption.
>  * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via 
> Maven Central).
>  * {*}Risk{*}: Performance impact in parser-heavy workloads. 
> {_}Mitigation{_}: Profile with async spans and configurable sampling.
> Two further points
>  # If anyone is interested and would like to work me with on this, please let 
> me know :)
>  # I'll likely create a sub issue for each task, that way we can 
> incrementally prove and deliver this larger observability initiative.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to