[
https://issues.apache.org/jira/browse/TIKA-4513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030655#comment-18030655
]
Lewis John McGibbney commented on TIKA-4513:
--------------------------------------------
Thanks for the suggestions [~tallison]
{quote}... It might make sense to hold a virtual meetup/google meet to discuss
staging the work?
{quote}
Great to hear that other work is going on with tika-server :). Yes a virtual
meeting sounds great so we can share info and get on the same page. Any
suggestions [~ndipiazza] on a time and date? Thank you.
{quote}Would we actually have to change anything in our code? Maybe more
logging?{quote}
This is a good question and it's really why I plan on creating a video. Let me
try to provide some clarity.
Open Telemetry (OTEL) is a comprehensive *_observability_* ecosystem, framework
and toolkit which facilitates the * *generation* e.g. application
instrumentation, logging (using the existing slf4j-over-log4j logging
framework), etc.
* {*}export{*}: the mechanics of routing egress of the generated telemetry
data to some external resource (a collector), and
* {*}collection{*}: essentially a proxy that can receive, process, and export
telemetry data.
OTEL handles telemetry data including
* {*}logs{*}: which is what we already have in Tika),
* {*}metrics{*}: which can represent custom instrumented metrics (counters,
gauges, histograms, etc.) and/or generated from extracting unstructured or
structured data from log streams, and finally
* {*}traces{*}: which give us the big picture of what happens when a request
is made to an application. I also attached the example screenshots I shared on
GitHub so we have them here. In this case, we can literally trace a single
request to [http://localhost:9998/tika] and see the resulting spans of that
trace. This granularity of telemetry data has the potential to unlock many
benefits for tika-server users.
I think our next step should be to meet virtually and definitely record it.
Here's a tentative agenda.
* [~tallison] and [~ndipiazza] educate [~lewismc] on current tika-server
development initiatives e.g. TIKA-3082 and others.
* [~lewismc] go through TIKA-4513 at an abstract level. The why, purpose,
intended outcomes, etc.
* ... if we have time I would be more than happy to demo the PR I submitted.
> Instrument tika-server
> ----------------------
>
> Key: TIKA-4513
> URL: https://issues.apache.org/jira/browse/TIKA-4513
> Project: Tika
> Issue Type: Improvement
> Components: instrumentation, tika-server
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 4.0.0
>
>
> Currently, tika-server lacks standardized observability instrumentation,
> relying on basic logging or custom metrics, which limits our ability to
> diagnose performance bottlenecks, track request latencies, or correlate
> failures across distributed deployments (which is readily available via
> tika-helm).
> This initiative will Implement [OpenTelemetry Java (OTEL)
> |https://opentelemetry.io/docs/languages/java/]instrumentation in the Apache
> Tika Server to enable comprehensive collection of traces, metrics, and logs.
> This will improve system observability, allowing for better monitoring of
> request processing, resource usage, and error rates in a production
> environment.
> The s stable across all major components (traces, metrics and logs), as per
> the official documentation.
> What's also nice about OTEL is that it integrates with tools like Jaeger
> (tracing), Prometheus (metrics), or ELK (logs) and loads of others. It would
> also facilitate rich visualizations via tools like Grafana.
> h4. Rationale
> * {*}Improved Diagnostics{*}: Traces will capture end-to-end request flows
> (e.g., from HTTP ingestion to parser execution), metrics will track
> throughput and error rates, and structured logs will provide context for
> debugging.
> * {*}Future-Proofing{*}: OpenTelemetry's semantic conventions ensure
> compatibility with evolving observability backends without vendor lock-in.
> * {*}Low Overhead{*}: We can experiment with
> [zero-core/auto-instrumentation|https://opentelemetry.io/docs/languages/java/instrumentation/#zero-code-java-agent]
> which will initially minimize code changes to develop a baseline. We can
> build on this to better observe custom Tika logic (e.g., parser chaining).
> * {*}Community Benefits{*}: Enhances Tika's appeal for microservices
> architectures, where observability is critical.
> h4. Goals
> * Instrument core Tika Server endpoints (e.g.,
> [/tika|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-TikaResource],
>
> [/detect|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-DetectorResource],
>
> [/meta|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource])
> to emit telemetry data.
> * Support configurable exporters for traces/metrics/logs to common backends
> (e.g., OTLP to a collector).
> * Ensure instrumentation does not degrade performance (<5% overhead target)
> and handles high-load scenarios gracefully.
> * Document setup for users deploying Tika Server.
> h4. Acceptance Criteria
> * Tika Server builds and runs with OpenTelemetry agent attached (e.g., via
> -javaagent:opentelemetry-javaagent.jar).
> * Sample requests generate traces with spans for key operations (e.g.,
> document parsing, MIME detection); verifiable via a tracer exporter (like
> Jaeger).
> * Metrics expose at least: request count, latency histograms, error rates,
> and resource usage (CPU/memory via JVM metrics).
> * Logs are structured and correlated with traces (e.g., via trace/span IDs).
> * Unit/integration tests cover instrumentation (e.g., assert span attributes
> like http.method and content.type).
> * Configuration options added to tika-server.properties for
> enabling/disabling telemetry and setting exporter endpoints.
> * Documentation updated in Tika wiki with setup guide, including Docker
> integration... and then TIka Helm.
> * Performance benchmarks show <5% overhead under load (e.g., using JMeter or
> k6).
> * No regressions in existing Tika Server functionality.
> h4. Tasks
> *1. Research and Setup*
> # Review OpenTelemetry Java getting-started guide and instrumentation
> registry for Tika-relevant libraries (e.g., auto-instrumentation for Jetty
> HTTP server, Apache HttpClient).
> # Set up a local dev environment with Tika Server, OpenTelemetry Java agent
> (latest stable release), and a test collector (e.g., [Grafana
> Alloy|https://grafana.com/docs/alloy/latest/] in Docker).
> # Prototype basic trace export for a sample /tika request.
> *2. Core Instrumentation*
> # Enable auto-instrumentation for HTTP handling and core Tika parsers.
> # Add manual spans for custom logic (e.g., in TikaResource for request
> routing, Parser chain execution).
> # Implement metrics using the Meter API (e.g., counters for processed
> documents, gauges for active parsers).
> # Bridge logs to OpenTelemetry (e.g., via
> io.opentelemetry.instrumentation.logback-appender-otel).{*}{*}
> *3. Configuration and Exporters*
> # Integrate environment variables or properties for exporter config (e.g.,
> OTLP endpoint, sampling rate).
> # Support batching and sampling to handle scale.
> *4. Testing and Validation*
> # Write tests using OpenTelemetry SDK's in-memory exporter to assert
> telemetry output.
> # Load test with [k6|https://grafana.com/docs/k6/latest/]; measure overhead.
> # Edge case testing: error handling, large files, concurrent requests.
> *5. Documentation and Release*
> # Update TikaServer README and wiki with instrumentation guide.
> # Submit PR and ensure existing CI/CD remains stable.
> h4. Risks and Dependencies
> * {*}Risk{*}: Instrumentation conflicts with existing Tika logging (e.g.,
> SLF4J). {_}Mitigation{_}: Use OpenTelemetry's log appenders for correlation
> without disruption.
> * {*}Dependency{*}: Access to latest OpenTelemetry Java release (check via
> Maven Central).
> * {*}Risk{*}: Performance impact in parser-heavy workloads.
> {_}Mitigation{_}: Profile with async spans and configurable sampling.
> Two further points
> # If anyone is interested and would like to work me with on this, please let
> me know :)
> # I'll likely create a sub issue for each task, that way we can
> incrementally prove and deliver this larger observability initiative.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)