[
https://issues.apache.org/jira/browse/TIKA-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated TIKA-4513:
---------------------------------------
Description:
Currently, tika-server lacks standardized observability instrumentation,
relying on basic logging or custom metrics, which limits our ability to
diagnose performance bottlenecks, track request latencies, or correlate
failures across distributed deployments (which is readily available via
tika-helm).
This initiative will Implement [OpenTelemetry Java (OTEL)
|https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache
Tika Server to enable comprehensive collection of traces, metrics, and logs.
This will improve system observability, allowing for better monitoring of
request processing, resource usage, and error rates in a production environment.
The s stable across all major components (traces, metrics and logs), as per the
official documentation.
What's also nice about OTEL is that it integrates with tools like Jaeger
(tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would
also facilitate rich visualizations via tools like Grafana.
h4. Rationale
* {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows
(e.g., from HTTP ingestion to parser execution), metrics will track throughput
and error rates, and structured logs will provide context for debugging.
* {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure
compatibility with evolving observability backends without vendor lock-in.
* {*}Low Overhead{*}: We can experiment with
[zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
which will initially minimize code changes to develop a baseline. We can build
on this to better observe custom Tika logic (e.g., parser chaining).
* {*}Community Benefits{*}: Enhances Tika's appeal for microservices
architectures, where observability is critical.
h4. Goals
* Instrument core Tika Server endpoints (e.g.,
[/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
[/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
[/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
to emit telemetry data.
* Support configurable exporters for traces/metrics/logs to common backends
(e.g., OTLP to a collector).
* Ensure instrumentation does not degrade performance (<5% overhead target)
and handles high-load scenarios gracefully.
* Document setup for users deploying Tika Server.
h4. Acceptance Criteria
* Tika Server builds and runs with OpenTelemetry agent attached (e.g., via
-javaagent:opentelemetry-javaagent.jar).
* Sample requests generate traces with spans for key operations (e.g.,
document parsing, MIME detection); verifiable via a tracer exporter (like
Jaeger).
* Metrics expose at least: request count, latency histograms, error rates, and
resource usage (CPU/memory via JVM metrics).
* Logs are structured and correlated with traces (e.g., via trace/span IDs).
* Unit/integration tests cover instrumentation (e.g., assert span attributes
like http.method and content.type).
* Configuration options added to tika-server.properties for enabling/disabling
telemetry and setting exporter endpoints.
* Documentation updated in Tika wiki with setup guide, including Docker
integration... and then TIka Helm.
* Performance benchmarks show <5% overhead under load (e.g., using JMeter or
k6).
* No regressions in existing Tika Server functionality.
h4. Tasks
*1. Research and Setup*
# Review OpenTelemetry Java getting-started guide and instrumentation registry
for Tika-relevant libraries (e.g., auto-instrumentation for Jetty HTTP server,
Apache HttpClient).
# Set up a local dev environment with Tika Server, OpenTelemetry Java agent
(latest stable release), and a test collector (e.g., [Grafana
Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
# Prototype basic trace export for a sample /tika request.{*}{{*}}
*2. Core Instrumentation*
# Enable auto-instrumentation for HTTP handling and core Tika parsers.
# Add manual spans for custom logic (e.g., in TikaResource for request
routing, Parser chain execution).
# Implement metrics using the Meter API (e.g., counters for processed
documents, gauges for active parsers).
# Bridge logs to OpenTelemetry (e.g., via
io.opentelemetry.instrumentation.logback-appender-otel).{*}{{*}}
*3. Configuration and Exporters*
# Integrate environment variables or properties for exporter config (e.g.,
OTLP endpoint, sampling rate).
# Support batching and sampling to handle scale.
*4. Testing and Validation*
# Write tests using OpenTelemetry SDK's in-memory exporter to assert telemetry
output.
# Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
# Edge case testing: error handling, large files, concurrent requests.
*5. Documentation and Release*
# Update TikaServer README and wiki with instrumentation guide.
# Prepare changelog entry and verify build inclusion (e.g., optional
dependency).
h4. Risks and Dependencies
* {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g.,
SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation
without disruption.
* {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via
Maven Central).
* {*}Risk{*}: Performance impact in parser-heavy workloads. {_}Mitigation{_}:
Profile with async spans and configurable sampling.
Two further points
# If anyone is interested and would like to work me with on this, please let
me know :)
# I'll likely create a sub issue for each task, that way we can incrementally
prove and deliver this larger observability initiative.
was:
Currently, tika-server lacks standardized observability instrumentation,
relying on basic logging or custom metrics, which limits our ability to
diagnose performance bottlenecks, track request latencies, or correlate
failures across distributed deployments (which is readily available via
tika-helm).
This initiative will Implement [OpenTelemetry Java (OTEL)
|https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache
Tika Server to enable comprehensive collection of traces, metrics, and logs.
This will improve system observability, allowing for better monitoring of
request processing, resource usage, and error rates in a production environment.
The s stable across all major components (traces, metrics and logs), as per the
official documentation.
What's also nice about OTEL is that it integrates with tools like Jaeger
(tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would
also facilitate rich visualizations via tools like Grafana.
h4. Rationale
* {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows
(e.g., from HTTP ingestion to parser execution), metrics will track throughput
and error rates, and structured logs will provide context for debugging.
* {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure
compatibility with evolving observability backends without vendor lock-in.
* {*}Low Overhead{*}: We can experiment with
[zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
which will initially minimize code changes to develop a baseline. We can build
on this to better observe custom Tika logic (e.g., parser chaining).
* {*}Community Benefits{*}: Enhances Tika's appeal for microservices
architectures, where observability is critical.
h4. Goals
* Instrument core Tika Server endpoints (e.g.,
[/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
[/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
[/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
to emit telemetry data.
* Support configurable exporters for traces/metrics/logs to common backends
(e.g., OTLP to a collector).
* Ensure instrumentation does not degrade performance (<5% overhead target)
and handles high-load scenarios gracefully.
* Document setup for users deploying Tika Server.
h4. Acceptance Criteria
* Tika Server builds and runs with OpenTelemetry agent attached (e.g., via
-javaagent:opentelemetry-javaagent.jar).
* Sample requests generate traces with spans for key operations (e.g.,
document parsing, MIME detection); verifiable via a tracer exporter (like
Jaeger).
* Metrics expose at least: request count, latency histograms, error rates, and
resource usage (CPU/memory via JVM metrics).
* Logs are structured and correlated with traces (e.g., via trace/span IDs).
* Unit/integration tests cover instrumentation (e.g., assert span attributes
like http.method and content.type).
* Configuration options added to tika-server.properties for enabling/disabling
telemetry and setting exporter endpoints.
* Documentation updated in Tika wiki with setup guide, including Docker
integration... and then TIka Helm.
* Performance benchmarks show <5% overhead under load (e.g., using JMeter or
k6).
* No regressions in existing Tika Server functionality.
h4. Tasks
*1. Research and Setup*
# Review OpenTelemetry Java getting-started guide and instrumentation registry
for Tika-relevant libraries (e.g., auto-instrumentation for Jetty HTTP server,
Apache HttpClient).
# Set up a local dev environment with Tika Server, OpenTelemetry Java agent
(latest stable release), and a test collector (e.g., [Grafana
Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
# Prototype basic trace export for a sample /tika request.{*}{*}
*2. Core Instrumentation*
# Enable auto-instrumentation for HTTP handling and core Tika parsers.
# Add manual spans for custom logic (e.g., in TikaResource for request
routing, Parser chain execution).
# Implement metrics using the Meter API (e.g., counters for processed
documents, gauges for active parsers).
# Bridge logs to OpenTelemetry (e.g., via
io.opentelemetry.instrumentation.logback-appender-otel).{*}{*}
*3. Configuration and Exporters*
# Integrate environment variables or properties for exporter config (e.g.,
OTLP endpoint, sampling rate).
# Support batching and sampling to handle scale.
*4. Testing and Validation*
# Write tests using OpenTelemetry SDK's in-memory exporter to assert telemetry
output.
# Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
# Edge case testing: error handling, large files, concurrent requests.
*5. Documentation and Release*
# Update TikaServer README and wiki with instrumentation guide.
# Prepare changelog entry and verify build inclusion (e.g., optional
dependency).
h4. Risks and Dependencies
* {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g.,
SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation
without disruption.
* {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via
Maven Central).
* {*}Risk{*}: Performance impact in parser-heavy workloads. {_}Mitigation{_}:
Profile with async spans and configurable sampling.
> Instrument tika-server
> ----------------------
>
> Key: TIKA-4513
> URL: https://issues.apache.org/jira/browse/TIKA-4513
> Project: Tika
> Issue Type: Improvement
> Components: instrumentation, tika-server
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
>
> Currently, tika-server lacks standardized observability instrumentation,
> relying on basic logging or custom metrics, which limits our ability to
> diagnose performance bottlenecks, track request latencies, or correlate
> failures across distributed deployments (which is readily available via
> tika-helm).
> This initiative will Implement [OpenTelemetry Java (OTEL)
> |https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache
> Tika Server to enable comprehensive collection of traces, metrics, and logs.
> This will improve system observability, allowing for better monitoring of
> request processing, resource usage, and error rates in a production
> environment.
> The s stable across all major components (traces, metrics and logs), as per
> the official documentation.
> What's also nice about OTEL is that it integrates with tools like Jaeger
> (tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would
> also facilitate rich visualizations via tools like Grafana.
> h4. Rationale
> * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows
> (e.g., from HTTP ingestion to parser execution), metrics will track
> throughput and error rates, and structured logs will provide context for
> debugging.
> * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure
> compatibility with evolving observability backends without vendor lock-in.
> * {*}Low Overhead{*}: We can experiment with
> [zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
> which will initially minimize code changes to develop a baseline. We can
> build on this to better observe custom Tika logic (e.g., parser chaining).
> * {*}Community Benefits{*}: Enhances Tika's appeal for microservices
> architectures, where observability is critical.
> h4. Goals
> * Instrument core Tika Server endpoints (e.g.,
> [/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
>
> [/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
>
> [/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
> to emit telemetry data.
> * Support configurable exporters for traces/metrics/logs to common backends
> (e.g., OTLP to a collector).
> * Ensure instrumentation does not degrade performance (<5% overhead target)
> and handles high-load scenarios gracefully.
> * Document setup for users deploying Tika Server.
> h4. Acceptance Criteria
> * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via
> -javaagent:opentelemetry-javaagent.jar).
> * Sample requests generate traces with spans for key operations (e.g.,
> document parsing, MIME detection); verifiable via a tracer exporter (like
> Jaeger).
> * Metrics expose at least: request count, latency histograms, error rates,
> and resource usage (CPU/memory via JVM metrics).
> * Logs are structured and correlated with traces (e.g., via trace/span IDs).
> * Unit/integration tests cover instrumentation (e.g., assert span attributes
> like http.method and content.type).
> * Configuration options added to tika-server.properties for
> enabling/disabling telemetry and setting exporter endpoints.
> * Documentation updated in Tika wiki with setup guide, including Docker
> integration... and then TIka Helm.
> * Performance benchmarks show <5% overhead under load (e.g., using JMeter or
> k6).
> * No regressions in existing Tika Server functionality.
> h4. Tasks
> *1. Research and Setup*
> # Review OpenTelemetry Java getting-started guide and instrumentation
> registry for Tika-relevant libraries (e.g., auto-instrumentation for Jetty
> HTTP server, Apache HttpClient).
> # Set up a local dev environment with Tika Server, OpenTelemetry Java agent
> (latest stable release), and a test collector (e.g., [Grafana
> Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
> # Prototype basic trace export for a sample /tika request.{*}{{*}}
> *2. Core Instrumentation*
> # Enable auto-instrumentation for HTTP handling and core Tika parsers.
> # Add manual spans for custom logic (e.g., in TikaResource for request
> routing, Parser chain execution).
> # Implement metrics using the Meter API (e.g., counters for processed
> documents, gauges for active parsers).
> # Bridge logs to OpenTelemetry (e.g., via
> io.opentelemetry.instrumentation.logback-appender-otel).{*}{{*}}
> *3. Configuration and Exporters*
> # Integrate environment variables or properties for exporter config (e.g.,
> OTLP endpoint, sampling rate).
> # Support batching and sampling to handle scale.
> *4. Testing and Validation*
> # Write tests using OpenTelemetry SDK's in-memory exporter to assert
> telemetry output.
> # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
> # Edge case testing: error handling, large files, concurrent requests.
> *5. Documentation and Release*
> # Update TikaServer README and wiki with instrumentation guide.
> # Prepare changelog entry and verify build inclusion (e.g., optional
> dependency).
> h4. Risks and Dependencies
> * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g.,
> SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation
> without disruption.
> * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via
> Maven Central).
> * {*}Risk{*}: Performance impact in parser-heavy workloads.
> {_}Mitigation{_}: Profile with async spans and configurable sampling.
> Two further points
> # If anyone is interested and would like to work me with on this, please let
> me know :)
> # I'll likely create a sub issue for each task, that way we can
> incrementally prove and deliver this larger observability initiative.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)