This is an automated email from the ASF dual-hosted git repository.
tallison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 0f51fcc6da TIKA-4706 (#2732)
0f51fcc6da is described below
commit 0f51fcc6da8286a6f1dce818bb3d4ccb4eb0a131
Author: Tim Allison <[email protected]>
AuthorDate: Fri Apr 3 17:24:22 2026 -0400
TIKA-4706 (#2732)
---
.skills/dev.md | 106 +++++++++++++++++++++++++
.skills/tika-eval-compare.md | 185 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 291 insertions(+)
diff --git a/.skills/dev.md b/.skills/dev.md
new file mode 100644
index 0000000000..fff9b748e1
--- /dev/null
+++ b/.skills/dev.md
@@ -0,0 +1,106 @@
+# Tika Development Skill
+
+Guidelines and checklist for developing against the Apache Tika codebase.
+
+## Git Policy
+
+Unless otherwise directed, the user wants to commit and push changes
+themselves. Do not run `git commit` or `git push`. Stage files and
+provide the suggested commit message for the user to execute.
+
+## Session Start Checklist
+
+1. **Local Maven repo** — Ask the user if they want to use an in-repo
+ `.local_m2_repo` (via `-Dmaven.repo.local=$(pwd)/.local_m2_repo`).
+ This isolates builds from the system `~/.m2/repository` and avoids
+ polluting or being affected by other projects.
+
+2. **Maven wrapper** — Use `./mvnw` (or the fallback
+ `/apache/apache-maven-3.9.12/bin/mvn` if the wrapper is absent).
+
+3. **Merge conflicts** — Check `git status` for `UU` files and resolve
+ before building.
+
+## Maven Rules
+
+- **Always include `clean`** in every `./mvnw` invocation.
+ Stale classes in `target/` cause hard-to-debug failures.
+ ```bash
+ ./mvnw clean compile -pl <module> ... # not just: mvnw compile
+ ./mvnw clean test -pl <module> ... # not just: mvnw test
+ ./mvnw clean install -pl <module> ... # not just: mvnw install
+ ```
+
+- **Always use absolute path for local repo**:
+ ```bash
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+ ```
+
+- **Fast builds with `-Pfast`** — use the `fast` profile to skip
+ tests, checkstyle, and spotless in one flag. Prefer this over
+ individual `-D` skip flags when you want a quick build (e.g.,
+ installing for downstream consumers or eval runs):
+ ```bash
+ ./mvnw clean install -pl <module> -am -Pfast \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+ ```
+ Run **without** `-Pfast` before final commit to catch formatting
+ and style issues.
+
+- **Forked JVM tests** — Integration tests in `tika-pipes` fork new
+ JVMs that load classes from the local Maven repo, not from
+ `target/classes`. You must `./mvnw clean install -Pfast` the
+ changed modules before running integration tests that fork.
+
+## Building Specific Modules
+
+```bash
+# Single module (with dependencies)
+./mvnw clean compile -pl <module> -am \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+
+# Run a single test class
+./mvnw clean test -pl <module> -Dtest=<TestClass> \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo -Dcheckstyle.skip=true
+
+# Install for downstream consumers (tika-app, integration tests)
+./mvnw clean install -pl <module> -am -Pfast \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+```
+
+## Common Module Paths
+
+| Module | Path |
+|--------|------|
+| tika-core | `tika-core` |
+| tika-app | `tika-app` |
+| tika-server | `tika-server/tika-server-core` |
+| tika-eval | `tika-eval/tika-eval-app` |
+| Pipes core | `tika-pipes/tika-pipes-core` |
+| Pipes API | `tika-pipes/tika-pipes-api` |
+| Async CLI | `tika-pipes/tika-async-cli` |
+
+## Code Conventions
+
+- ASF License 2.0 header on all Java files
+- Spotless formatter runs during build — don't fight it
+- Tests use `@TempDir Path tmp` for temp directories
+- No emojis in code or comments
+
+## Testing an End-to-End Change
+
+When a change affects parsing output (e.g., new parser behavior,
+encoding fix), run a before/after comparison using tika-eval.
+See `.skills/tika-eval-compare.md` for the full procedure.
+
+## Pre-Commit Checks
+
+```bash
+# Full compile with checkstyle (catches formatting issues)
+./mvnw clean compile -pl <module> -am \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+
+# Run module tests
+./mvnw clean test -pl <module> \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+```
diff --git a/.skills/tika-eval-compare.md b/.skills/tika-eval-compare.md
new file mode 100644
index 0000000000..4fc628cc06
--- /dev/null
+++ b/.skills/tika-eval-compare.md
@@ -0,0 +1,185 @@
+# tika-eval: Compare Before/After Extracts
+
+Compare the output of two versions of Tika against a corpus of files
+to detect regressions in content extraction, encoding, exceptions, and
+embedded document handling.
+
+## Before You Start
+
+Ask the user for:
+
+1. **Working directory** — where to put builds, extracts, eval db, and
+ reports (e.g., `~/data/commoncrawl/my-eval`). All artifacts go here.
+2. **Number of threads** (`-n`) — default is 2. Use `-n 6` for faster
+ runs when parse time comparison is not needed. When comparing parse
+ times between A and B, use the same `-n` for both.
+3. **Run reports?** — whether to auto-generate the HTML/Excel reports
+ and `summary.md` at the end (the `-r` flag on tika-eval Compare).
+
+## Prerequisites
+
+- Two tika-app builds (a "before" and an "after"), each as an unzipped
+ zip archive containing `tika-app-*.jar`, `lib/`, and `plugins/`.
+- A corpus of input files (a directory tree).
+- tika-eval-app, built from `tika-eval/tika-eval-app` (use the zip).
+- **Enable MD5 digesting** in both configs so tika-eval can match
+ embedded documents by content hash (not just index position).
+ Add to the config JSON:
+ ```json
+ "parse-context": {
+ "commons-digester-factory": {
+ "digests": [
+ { "algorithm": "MD5" }
+ ]
+ }
+ }
+ ```
+ Note: `parse-context` is a JSON **object**, not an array.
+
+## Step 1 — Generate Extracts
+
+Run each tika-app version against the same input corpus. The batch
+mode is triggered automatically when the first positional argument is
+a directory:
+
+```bash
+java -jar <before>/tika-app-*.jar <input-dir> <extracts-a-dir>
+java -jar <after>/tika-app-*.jar <input-dir> <extracts-b-dir>
+```
+
+To use a custom config (e.g., SAX vs DOM parsers, deleted content,
+macros), pass `--config=<file.json>` **before** the input/output dirs:
+
+```bash
+java -jar <tika-app>/tika-app-*.jar --config=dom-config.json <input-dir>
<extracts-a-dir>
+java -jar <tika-app>/tika-app-*.jar --config=sax-config.json <input-dir>
<extracts-b-dir>
+```
+
+Each run walks the input directory recursively and writes one
+`.json` file per input file (recursive metadata + XHTML content,
+equivalent to `tika-app -J`). The directory structure mirrors
+the input.
+
+### Notes
+
+- Do NOT pass `-n <N>` as a trailing argument — it confuses the
+ async mode auto-detection. If you need to control parallelism,
+ use `-i <input-dir> -o <output-dir> -n <N>` with explicit flags.
+- Default parallelism is 2 forked JVM clients. Use `-n 6` for faster
+ runs when parse time comparison is not needed. When comparing parse
+ times, keep `-n` the same for both A and B.
+- Default timeout is 30 000 ms per file.
+
+## Step 2 — Run tika-eval Compare
+
+Unzip the tika-eval-app zip, then run:
+
+```bash
+java -jar <tika-eval>/tika-eval-app-*.jar Compare \
+ -a <extracts-a-dir> \
+ -b <extracts-b-dir> \
+ -d <db-path> \
+ -r \
+ -rd <reports-dir>
+```
+
+| Flag | Description |
+|------|-------------|
+| `-a` | Directory of "before" extracts (required) |
+| `-b` | Directory of "after" extracts (required) |
+| `-d` | H2 database path (temp file if omitted) |
+| `-r` | Auto-run Report + tar.gz after Compare |
+| `-rd` | Reports output directory (default: `reports`) |
+| `-n` | Number of worker threads |
+
+## Step 3 — Review Results
+
+Reports are written as Excel `.xlsx` files under the reports
+directory, plus a `summary.md` with key metrics:
+
+- **Content Quality (Dice Coefficient)** — similarity between A and B
+ per mime type. Mean dice < 0.95 warrants investigation.
+- **OOV / Languageness Changes** — increased out-of-vocabulary rate
+ or decreased languageness z-score may indicate encoding regressions.
+- **Content Length Ratio Outliers** — files where B is >2× or <0.5×
+ the length of A.
+- **Exception Changes** — new exceptions in B or fixed exceptions.
+- **Embedded Document Count Changes** — gained/lost attachments.
+- **Content Regressions** — lowest-dice files with token-level diffs.
+- **Content Lost / Gained** — files that went empty↔non-empty.
+
+### Interpreting Results
+
+| Metric | Good | Investigate |
+|--------|------|-------------|
+| Mean dice (same mime) | ≥ 0.95 | < 0.90 |
+| New exceptions in B | 0 | > 0 — every one needs explanation |
+| Embedded doc count losses | 0 | > 0 — investigate by mime type |
+| OOV delta | < 0.05 | > 0.10 |
+| Content length ratio | 0.5–2.0 | > 5× or < 0.2× |
+| Exception count | ≤ A | > A |
+| Total files (B) vs (A) | equal or higher | lower — missing embedded docs |
+
+### CRITICAL: Review Checklist
+
+The purpose of tika-eval is to find regressions BEFORE a release. After
+reading summary.md, report each of these to the user:
+
+1. **New exceptions**: Exact count. If > 0, investigate stack traces.
+ Every new exception is a bug.
+2. **Total files delta**: "Total files (A)" vs "(B)". If B < A, embedded
+ documents are being lost — aggregate the losses by child mime type.
+3. **Embedded doc count changes**: Report losers from the summary table.
+4. **Dice scores**: Flag any mean < 0.99 for OOXML or < 0.95 for others.
+5. **Content length outliers**: Flag ratio > 3x or < 0.3x.
+6. **Fixed exceptions**: Report the count.
+
+**Do not summarize results as "looks good" based on dice scores alone.**
+Dice measures text similarity, not attachment completeness. Always check
+the Total files delta.
+
+**After fixing regressions, re-run the full eval.** A fix for one format
+may not cover another. Verify the numbers moved and no new issues appeared.
+
+## Building from Source
+
+```bash
+# tika-app
+./mvnw clean install -pl tika-app -am -Pfast \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+
+# tika-eval-app
+./mvnw clean install -pl tika-eval/tika-eval-app -am -Pfast \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+```
+
+The zip artifacts are in `<module>/target/<module>-*.zip`.
+
+## Example: Comparing MSG parsing changes
+
+```bash
+# Download the "before" snapshot
+curl -o /tmp/tika-app-before.zip <snapshot-url>
+unzip -qo /tmp/tika-app-before.zip -d /tmp/tika-app-before
+
+# Build the "after" from local changes
+./mvnw clean install -pl tika-app -am -Pfast \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -qo tika-app/target/tika-app-*.zip -d /tmp/tika-app-after
+
+# Generate extracts
+java -jar /tmp/tika-app-before/tika-app-*.jar ~/data/msgs /tmp/extracts-a
+java -jar /tmp/tika-app-after/tika-app-*.jar ~/data/msgs /tmp/extracts-b
+
+# Build and run tika-eval
+./mvnw clean install -pl tika-eval/tika-eval-app -am -Pfast \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -qo tika-eval/tika-eval-app/target/tika-eval-app-*.zip -d /tmp/tika-eval
+
+java -jar /tmp/tika-eval/tika-eval-app-*.jar Compare \
+ -a /tmp/extracts-a -b /tmp/extracts-b \
+ -d /tmp/eval-db -r -rd /tmp/eval-reports
+
+# Review
+cat /tmp/eval-reports/summary.md
+```