This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4706 in repository https://gitbox.apache.org/repos/asf/tika.git
commit adce55dc29b66d8010e02debc8dac4aca9477878 Author: tallison <[email protected]> AuthorDate: Fri Apr 3 16:42:52 2026 -0400 TIKA-4706 --- .skills/dev.md | 106 +++++++++++++++++++++++++ .skills/tika-eval-compare.md | 185 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 291 insertions(+) diff --git a/.skills/dev.md b/.skills/dev.md new file mode 100644 index 0000000000..fff9b748e1 --- /dev/null +++ b/.skills/dev.md @@ -0,0 +1,106 @@ +# Tika Development Skill + +Guidelines and checklist for developing against the Apache Tika codebase. + +## Git Policy + +Unless otherwise directed, the user wants to commit and push changes +themselves. Do not run `git commit` or `git push`. Stage files and +provide the suggested commit message for the user to execute. + +## Session Start Checklist + +1. **Local Maven repo** — Ask the user if they want to use an in-repo + `.local_m2_repo` (via `-Dmaven.repo.local=$(pwd)/.local_m2_repo`). + This isolates builds from the system `~/.m2/repository` and avoids + polluting or being affected by other projects. + +2. **Maven wrapper** — Use `./mvnw` (or the fallback + `/apache/apache-maven-3.9.12/bin/mvn` if the wrapper is absent). + +3. **Merge conflicts** — Check `git status` for `UU` files and resolve + before building. + +## Maven Rules + +- **Always include `clean`** in every `./mvnw` invocation. + Stale classes in `target/` cause hard-to-debug failures. + ```bash + ./mvnw clean compile -pl <module> ... # not just: mvnw compile + ./mvnw clean test -pl <module> ... # not just: mvnw test + ./mvnw clean install -pl <module> ... # not just: mvnw install + ``` + +- **Always use absolute path for local repo**: + ```bash + -Dmaven.repo.local=$(pwd)/.local_m2_repo + ``` + +- **Fast builds with `-Pfast`** — use the `fast` profile to skip + tests, checkstyle, and spotless in one flag. Prefer this over + individual `-D` skip flags when you want a quick build (e.g., + installing for downstream consumers or eval runs): + ```bash + ./mvnw clean install -pl <module> -am -Pfast \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo + ``` + Run **without** `-Pfast` before final commit to catch formatting + and style issues. + +- **Forked JVM tests** — Integration tests in `tika-pipes` fork new + JVMs that load classes from the local Maven repo, not from + `target/classes`. You must `./mvnw clean install -Pfast` the + changed modules before running integration tests that fork. + +## Building Specific Modules + +```bash +# Single module (with dependencies) +./mvnw clean compile -pl <module> -am \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo + +# Run a single test class +./mvnw clean test -pl <module> -Dtest=<TestClass> \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo -Dcheckstyle.skip=true + +# Install for downstream consumers (tika-app, integration tests) +./mvnw clean install -pl <module> -am -Pfast \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo +``` + +## Common Module Paths + +| Module | Path | +|--------|------| +| tika-core | `tika-core` | +| tika-app | `tika-app` | +| tika-server | `tika-server/tika-server-core` | +| tika-eval | `tika-eval/tika-eval-app` | +| Pipes core | `tika-pipes/tika-pipes-core` | +| Pipes API | `tika-pipes/tika-pipes-api` | +| Async CLI | `tika-pipes/tika-async-cli` | + +## Code Conventions + +- ASF License 2.0 header on all Java files +- Spotless formatter runs during build — don't fight it +- Tests use `@TempDir Path tmp` for temp directories +- No emojis in code or comments + +## Testing an End-to-End Change + +When a change affects parsing output (e.g., new parser behavior, +encoding fix), run a before/after comparison using tika-eval. +See `.skills/tika-eval-compare.md` for the full procedure. + +## Pre-Commit Checks + +```bash +# Full compile with checkstyle (catches formatting issues) +./mvnw clean compile -pl <module> -am \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo + +# Run module tests +./mvnw clean test -pl <module> \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo +``` diff --git a/.skills/tika-eval-compare.md b/.skills/tika-eval-compare.md new file mode 100644 index 0000000000..4fc628cc06 --- /dev/null +++ b/.skills/tika-eval-compare.md @@ -0,0 +1,185 @@ +# tika-eval: Compare Before/After Extracts + +Compare the output of two versions of Tika against a corpus of files +to detect regressions in content extraction, encoding, exceptions, and +embedded document handling. + +## Before You Start + +Ask the user for: + +1. **Working directory** — where to put builds, extracts, eval db, and + reports (e.g., `~/data/commoncrawl/my-eval`). All artifacts go here. +2. **Number of threads** (`-n`) — default is 2. Use `-n 6` for faster + runs when parse time comparison is not needed. When comparing parse + times between A and B, use the same `-n` for both. +3. **Run reports?** — whether to auto-generate the HTML/Excel reports + and `summary.md` at the end (the `-r` flag on tika-eval Compare). + +## Prerequisites + +- Two tika-app builds (a "before" and an "after"), each as an unzipped + zip archive containing `tika-app-*.jar`, `lib/`, and `plugins/`. +- A corpus of input files (a directory tree). +- tika-eval-app, built from `tika-eval/tika-eval-app` (use the zip). +- **Enable MD5 digesting** in both configs so tika-eval can match + embedded documents by content hash (not just index position). + Add to the config JSON: + ```json + "parse-context": { + "commons-digester-factory": { + "digests": [ + { "algorithm": "MD5" } + ] + } + } + ``` + Note: `parse-context` is a JSON **object**, not an array. + +## Step 1 — Generate Extracts + +Run each tika-app version against the same input corpus. The batch +mode is triggered automatically when the first positional argument is +a directory: + +```bash +java -jar <before>/tika-app-*.jar <input-dir> <extracts-a-dir> +java -jar <after>/tika-app-*.jar <input-dir> <extracts-b-dir> +``` + +To use a custom config (e.g., SAX vs DOM parsers, deleted content, +macros), pass `--config=<file.json>` **before** the input/output dirs: + +```bash +java -jar <tika-app>/tika-app-*.jar --config=dom-config.json <input-dir> <extracts-a-dir> +java -jar <tika-app>/tika-app-*.jar --config=sax-config.json <input-dir> <extracts-b-dir> +``` + +Each run walks the input directory recursively and writes one +`.json` file per input file (recursive metadata + XHTML content, +equivalent to `tika-app -J`). The directory structure mirrors +the input. + +### Notes + +- Do NOT pass `-n <N>` as a trailing argument — it confuses the + async mode auto-detection. If you need to control parallelism, + use `-i <input-dir> -o <output-dir> -n <N>` with explicit flags. +- Default parallelism is 2 forked JVM clients. Use `-n 6` for faster + runs when parse time comparison is not needed. When comparing parse + times, keep `-n` the same for both A and B. +- Default timeout is 30 000 ms per file. + +## Step 2 — Run tika-eval Compare + +Unzip the tika-eval-app zip, then run: + +```bash +java -jar <tika-eval>/tika-eval-app-*.jar Compare \ + -a <extracts-a-dir> \ + -b <extracts-b-dir> \ + -d <db-path> \ + -r \ + -rd <reports-dir> +``` + +| Flag | Description | +|------|-------------| +| `-a` | Directory of "before" extracts (required) | +| `-b` | Directory of "after" extracts (required) | +| `-d` | H2 database path (temp file if omitted) | +| `-r` | Auto-run Report + tar.gz after Compare | +| `-rd` | Reports output directory (default: `reports`) | +| `-n` | Number of worker threads | + +## Step 3 — Review Results + +Reports are written as Excel `.xlsx` files under the reports +directory, plus a `summary.md` with key metrics: + +- **Content Quality (Dice Coefficient)** — similarity between A and B + per mime type. Mean dice < 0.95 warrants investigation. +- **OOV / Languageness Changes** — increased out-of-vocabulary rate + or decreased languageness z-score may indicate encoding regressions. +- **Content Length Ratio Outliers** — files where B is >2× or <0.5× + the length of A. +- **Exception Changes** — new exceptions in B or fixed exceptions. +- **Embedded Document Count Changes** — gained/lost attachments. +- **Content Regressions** — lowest-dice files with token-level diffs. +- **Content Lost / Gained** — files that went empty↔non-empty. + +### Interpreting Results + +| Metric | Good | Investigate | +|--------|------|-------------| +| Mean dice (same mime) | ≥ 0.95 | < 0.90 | +| New exceptions in B | 0 | > 0 — every one needs explanation | +| Embedded doc count losses | 0 | > 0 — investigate by mime type | +| OOV delta | < 0.05 | > 0.10 | +| Content length ratio | 0.5–2.0 | > 5× or < 0.2× | +| Exception count | ≤ A | > A | +| Total files (B) vs (A) | equal or higher | lower — missing embedded docs | + +### CRITICAL: Review Checklist + +The purpose of tika-eval is to find regressions BEFORE a release. After +reading summary.md, report each of these to the user: + +1. **New exceptions**: Exact count. If > 0, investigate stack traces. + Every new exception is a bug. +2. **Total files delta**: "Total files (A)" vs "(B)". If B < A, embedded + documents are being lost — aggregate the losses by child mime type. +3. **Embedded doc count changes**: Report losers from the summary table. +4. **Dice scores**: Flag any mean < 0.99 for OOXML or < 0.95 for others. +5. **Content length outliers**: Flag ratio > 3x or < 0.3x. +6. **Fixed exceptions**: Report the count. + +**Do not summarize results as "looks good" based on dice scores alone.** +Dice measures text similarity, not attachment completeness. Always check +the Total files delta. + +**After fixing regressions, re-run the full eval.** A fix for one format +may not cover another. Verify the numbers moved and no new issues appeared. + +## Building from Source + +```bash +# tika-app +./mvnw clean install -pl tika-app -am -Pfast \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo + +# tika-eval-app +./mvnw clean install -pl tika-eval/tika-eval-app -am -Pfast \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo +``` + +The zip artifacts are in `<module>/target/<module>-*.zip`. + +## Example: Comparing MSG parsing changes + +```bash +# Download the "before" snapshot +curl -o /tmp/tika-app-before.zip <snapshot-url> +unzip -qo /tmp/tika-app-before.zip -d /tmp/tika-app-before + +# Build the "after" from local changes +./mvnw clean install -pl tika-app -am -Pfast \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo +unzip -qo tika-app/target/tika-app-*.zip -d /tmp/tika-app-after + +# Generate extracts +java -jar /tmp/tika-app-before/tika-app-*.jar ~/data/msgs /tmp/extracts-a +java -jar /tmp/tika-app-after/tika-app-*.jar ~/data/msgs /tmp/extracts-b + +# Build and run tika-eval +./mvnw clean install -pl tika-eval/tika-eval-app -am -Pfast \ + -Dmaven.repo.local=$(pwd)/.local_m2_repo +unzip -qo tika-eval/tika-eval-app/target/tika-eval-app-*.zip -d /tmp/tika-eval + +java -jar /tmp/tika-eval/tika-eval-app-*.jar Compare \ + -a /tmp/extracts-a -b /tmp/extracts-b \ + -d /tmp/eval-db -r -rd /tmp/eval-reports + +# Review +cat /tmp/eval-reports/summary.md +```
