[jira] [Commented] (TIKA-4513) Instrument tika-server

Lewis John McGibbney (Jira) Fri, 17 Oct 2025 07:48:12 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030655#comment-18030655
 ]


Lewis John McGibbney commented on TIKA-4513:
--------------------------------------------

Thanks for the suggestions [~tallison] 
{quote}... It might make sense to hold a virtual meetup/google meet to discuss 
staging the work?
{quote}
 
Great to hear that other work is going on with tika-server :). Yes a virtual 
meeting sounds great so we can share info and get on the same page. Any 
suggestions [~ndipiazza] on a time and date? Thank you.
 
{quote}Would we actually have to change anything in our code? Maybe more 
logging?{quote}
This is a good question and it's really why I plan on creating a video. Let me 
try to provide some clarity.
 
Open Telemetry (OTEL) is a comprehensive *_observability_* ecosystem, framework 
and toolkit which facilitates the * *generation* e.g. application 
instrumentation, logging (using the existing slf4j-over-log4j logging 
framework), etc.
 * {*}export{*}: the mechanics of routing egress of the generated telemetry 
data to some external resource (a collector), and
 * {*}collection{*}: essentially a proxy that can receive, process, and export 
telemetry data.

OTEL handles telemetry data including
 * {*}logs{*}: which is what we already have in Tika),
 * {*}metrics{*}: which can represent custom instrumented metrics (counters, 
gauges, histograms, etc.) and/or generated from extracting unstructured or 
structured data from log streams, and finally
 * {*}traces{*}: which give us the big picture of what happens when a request 
is made to an application. I also attached the example screenshots I shared on 
GitHub so we have them here. In this case, we can literally trace a single 
request to [http://localhost:9998/tika] and see the resulting spans of that 
trace. This granularity of telemetry data has the potential to unlock many 
benefits for tika-server users.

I think our next step should be to meet virtually and definitely record it. 
Here's a tentative agenda.
 * [~tallison] and [~ndipiazza] educate [~lewismc] on current tika-server 
development initiatives e.g. TIKA-3082 and others.
 * [~lewismc] go through TIKA-4513 at an abstract level. The why, purpose, 
intended outcomes, etc.
 * ... if we have time I would be more than happy to demo the PR I submitted.

> Instrument tika-server
> ----------------------
>
>                 Key: TIKA-4513
>                 URL: https://issues.apache.org/jira/browse/TIKA-4513
>             Project: Tika
>          Issue Type: Improvement
>          Components: instrumentation, tika-server
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 4.0.0
>
>
> Currently, tika-server lacks standardized observability instrumentation, 
> relying on basic logging or custom metrics, which limits our ability to 
> diagnose performance bottlenecks, track request latencies, or correlate 
> failures across distributed deployments (which is readily available via 
> tika-helm).
> This initiative will Implement [OpenTelemetry Java (OTEL) 
> |https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache 
> Tika Server to enable comprehensive collection of traces, metrics, and logs. 
> This will improve system observability, allowing for better monitoring of 
> request processing, resource usage, and error rates in a production 
> environment.
> The s stable across all major components (traces, metrics and logs), as per 
> the official documentation.
> What's also nice about OTEL is that it integrates with tools like Jaeger 
> (tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would 
> also facilitate rich visualizations via tools like Grafana.
> h4. Rationale
>  * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows 
> (e.g., from HTTP ingestion to parser execution), metrics will track 
> throughput and error rates, and structured logs will provide context for 
> debugging.
>  * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure 
> compatibility with evolving observability backends without vendor lock-in.
>  * {*}Low Overhead{*}: We can experiment with 
> [zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
>  which will initially minimize code changes to develop a baseline. We can 
> build on this to better observe custom Tika logic (e.g., parser chaining).
>  * {*}Community Benefits{*}: Enhances Tika's appeal for microservices 
> architectures, where observability is critical.
> h4. Goals
>  * Instrument core Tika Server endpoints (e.g., 
> [/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
>  
> [/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
>  
> [/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
>  to emit telemetry data.
>  * Support configurable exporters for traces/metrics/logs to common backends 
> (e.g., OTLP to a collector).
>  * Ensure instrumentation does not degrade performance (<5% overhead target) 
> and handles high-load scenarios gracefully.
>  * Document setup for users deploying Tika Server.
> h4. Acceptance Criteria
>  * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via 
> -javaagent:opentelemetry-javaagent.jar).
>  * Sample requests generate traces with spans for key operations (e.g., 
> document parsing, MIME detection); verifiable via a tracer exporter (like 
> Jaeger).
>  * Metrics expose at least: request count, latency histograms, error rates, 
> and resource usage (CPU/memory via JVM metrics).
>  * Logs are structured and correlated with traces (e.g., via trace/span IDs).
>  * Unit/integration tests cover instrumentation (e.g., assert span attributes 
> like http.method and content.type).
>  * Configuration options added to tika-server.properties for 
> enabling/disabling telemetry and setting exporter endpoints.
>  * Documentation updated in Tika wiki with setup guide, including Docker 
> integration... and then TIka Helm.
>  * Performance benchmarks show <5% overhead under load (e.g., using JMeter or 
> k6).
>  * No regressions in existing Tika Server functionality.
> h4. Tasks
> *1. Research and Setup*
>  # Review OpenTelemetry Java getting-started guide and instrumentation 
> registry for Tika-relevant libraries (e.g., auto-instrumentation for Jetty 
> HTTP server, Apache HttpClient).
>  # Set up a local dev environment with Tika Server, OpenTelemetry Java agent 
> (latest stable release), and a test collector (e.g., [Grafana 
> Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
>  # Prototype basic trace export for a sample /tika request.
> *2. Core Instrumentation*
>  # Enable auto-instrumentation for HTTP handling and core Tika parsers.
>  # Add manual spans for custom logic (e.g., in TikaResource for request 
> routing, Parser chain execution).
>  # Implement metrics using the Meter API (e.g., counters for processed 
> documents, gauges for active parsers).
>  # Bridge logs to OpenTelemetry (e.g., via 
> io.opentelemetry.instrumentation.logback-appender-otel).{*}{*}
> *3. Configuration and Exporters*
>  # Integrate environment variables or properties for exporter config (e.g., 
> OTLP endpoint, sampling rate).
>  # Support batching and sampling to handle scale.
> *4. Testing and Validation*
>  # Write tests using OpenTelemetry SDK's in-memory exporter to assert 
> telemetry output.
>  # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
>  # Edge case testing: error handling, large files, concurrent requests.
> *5. Documentation and Release*
>  # Update TikaServer README and wiki with instrumentation guide.
>  # Submit PR and ensure existing CI/CD remains stable.
> h4. Risks and Dependencies
>  * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g., 
> SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation 
> without disruption.
>  * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via 
> Maven Central).
>  * {*}Risk{*}: Performance impact in parser-heavy workloads. 
> {_}Mitigation{_}: Profile with async spans and configurable sampling.
> Two further points
>  # If anyone is interested and would like to work me with on this, please let 
> me know :)
>  # I'll likely create a sub issue for each task, that way we can 
> incrementally prove and deliver this larger observability initiative.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4513) Instrument tika-server

Reply via email to