This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/main by this push:
     new 0f51fcc6da TIKA-4706 (#2732)
0f51fcc6da is described below

commit 0f51fcc6da8286a6f1dce818bb3d4ccb4eb0a131
Author: Tim Allison <[email protected]>
AuthorDate: Fri Apr 3 17:24:22 2026 -0400

    TIKA-4706 (#2732)
---
 .skills/dev.md               | 106 +++++++++++++++++++++++++
 .skills/tika-eval-compare.md | 185 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 291 insertions(+)

diff --git a/.skills/dev.md b/.skills/dev.md
new file mode 100644
index 0000000000..fff9b748e1
--- /dev/null
+++ b/.skills/dev.md
@@ -0,0 +1,106 @@
+# Tika Development Skill
+
+Guidelines and checklist for developing against the Apache Tika codebase.
+
+## Git Policy
+
+Unless otherwise directed, the user wants to commit and push changes
+themselves.  Do not run `git commit` or `git push`.  Stage files and
+provide the suggested commit message for the user to execute.
+
+## Session Start Checklist
+
+1. **Local Maven repo** — Ask the user if they want to use an in-repo
+   `.local_m2_repo` (via `-Dmaven.repo.local=$(pwd)/.local_m2_repo`).
+   This isolates builds from the system `~/.m2/repository` and avoids
+   polluting or being affected by other projects.
+
+2. **Maven wrapper** — Use `./mvnw` (or the fallback
+   `/apache/apache-maven-3.9.12/bin/mvn` if the wrapper is absent).
+
+3. **Merge conflicts** — Check `git status` for `UU` files and resolve
+   before building.
+
+## Maven Rules
+
+- **Always include `clean`** in every `./mvnw` invocation.
+  Stale classes in `target/` cause hard-to-debug failures.
+  ```bash
+  ./mvnw clean compile -pl <module> ...   # not just: mvnw compile
+  ./mvnw clean test -pl <module> ...      # not just: mvnw test
+  ./mvnw clean install -pl <module> ...   # not just: mvnw install
+  ```
+
+- **Always use absolute path for local repo**:
+  ```bash
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+  ```
+
+- **Fast builds with `-Pfast`** — use the `fast` profile to skip
+  tests, checkstyle, and spotless in one flag.  Prefer this over
+  individual `-D` skip flags when you want a quick build (e.g.,
+  installing for downstream consumers or eval runs):
+  ```bash
+  ./mvnw clean install -pl <module> -am -Pfast \
+    -Dmaven.repo.local=$(pwd)/.local_m2_repo
+  ```
+  Run **without** `-Pfast` before final commit to catch formatting
+  and style issues.
+
+- **Forked JVM tests** — Integration tests in `tika-pipes` fork new
+  JVMs that load classes from the local Maven repo, not from
+  `target/classes`.  You must `./mvnw clean install -Pfast` the
+  changed modules before running integration tests that fork.
+
+## Building Specific Modules
+
+```bash
+# Single module (with dependencies)
+./mvnw clean compile -pl <module> -am \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+
+# Run a single test class
+./mvnw clean test -pl <module> -Dtest=<TestClass> \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo -Dcheckstyle.skip=true
+
+# Install for downstream consumers (tika-app, integration tests)
+./mvnw clean install -pl <module> -am -Pfast \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+```
+
+## Common Module Paths
+
+| Module | Path |
+|--------|------|
+| tika-core | `tika-core` |
+| tika-app | `tika-app` |
+| tika-server | `tika-server/tika-server-core` |
+| tika-eval | `tika-eval/tika-eval-app` |
+| Pipes core | `tika-pipes/tika-pipes-core` |
+| Pipes API | `tika-pipes/tika-pipes-api` |
+| Async CLI | `tika-pipes/tika-async-cli` |
+
+## Code Conventions
+
+- ASF License 2.0 header on all Java files
+- Spotless formatter runs during build — don't fight it
+- Tests use `@TempDir Path tmp` for temp directories
+- No emojis in code or comments
+
+## Testing an End-to-End Change
+
+When a change affects parsing output (e.g., new parser behavior,
+encoding fix), run a before/after comparison using tika-eval.
+See `.skills/tika-eval-compare.md` for the full procedure.
+
+## Pre-Commit Checks
+
+```bash
+# Full compile with checkstyle (catches formatting issues)
+./mvnw clean compile -pl <module> -am \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+
+# Run module tests
+./mvnw clean test -pl <module> \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+```
diff --git a/.skills/tika-eval-compare.md b/.skills/tika-eval-compare.md
new file mode 100644
index 0000000000..4fc628cc06
--- /dev/null
+++ b/.skills/tika-eval-compare.md
@@ -0,0 +1,185 @@
+# tika-eval: Compare Before/After Extracts
+
+Compare the output of two versions of Tika against a corpus of files
+to detect regressions in content extraction, encoding, exceptions, and
+embedded document handling.
+
+## Before You Start
+
+Ask the user for:
+
+1. **Working directory** — where to put builds, extracts, eval db, and
+   reports (e.g., `~/data/commoncrawl/my-eval`).  All artifacts go here.
+2. **Number of threads** (`-n`) — default is 2.  Use `-n 6` for faster
+   runs when parse time comparison is not needed.  When comparing parse
+   times between A and B, use the same `-n` for both.
+3. **Run reports?** — whether to auto-generate the HTML/Excel reports
+   and `summary.md` at the end (the `-r` flag on tika-eval Compare).
+
+## Prerequisites
+
+- Two tika-app builds (a "before" and an "after"), each as an unzipped
+  zip archive containing `tika-app-*.jar`, `lib/`, and `plugins/`.
+- A corpus of input files (a directory tree).
+- tika-eval-app, built from `tika-eval/tika-eval-app` (use the zip).
+- **Enable MD5 digesting** in both configs so tika-eval can match
+  embedded documents by content hash (not just index position).
+  Add to the config JSON:
+  ```json
+  "parse-context": {
+    "commons-digester-factory": {
+      "digests": [
+        { "algorithm": "MD5" }
+      ]
+    }
+  }
+  ```
+  Note: `parse-context` is a JSON **object**, not an array.
+
+## Step 1 — Generate Extracts
+
+Run each tika-app version against the same input corpus.  The batch
+mode is triggered automatically when the first positional argument is
+a directory:
+
+```bash
+java -jar <before>/tika-app-*.jar <input-dir> <extracts-a-dir>
+java -jar <after>/tika-app-*.jar  <input-dir> <extracts-b-dir>
+```
+
+To use a custom config (e.g., SAX vs DOM parsers, deleted content,
+macros), pass `--config=<file.json>` **before** the input/output dirs:
+
+```bash
+java -jar <tika-app>/tika-app-*.jar --config=dom-config.json <input-dir> 
<extracts-a-dir>
+java -jar <tika-app>/tika-app-*.jar --config=sax-config.json <input-dir> 
<extracts-b-dir>
+```
+
+Each run walks the input directory recursively and writes one
+`.json` file per input file (recursive metadata + XHTML content,
+equivalent to `tika-app -J`).  The directory structure mirrors
+the input.
+
+### Notes
+
+- Do NOT pass `-n <N>` as a trailing argument — it confuses the
+  async mode auto-detection.  If you need to control parallelism,
+  use `-i <input-dir> -o <output-dir> -n <N>` with explicit flags.
+- Default parallelism is 2 forked JVM clients.  Use `-n 6` for faster
+  runs when parse time comparison is not needed.  When comparing parse
+  times, keep `-n` the same for both A and B.
+- Default timeout is 30 000 ms per file.
+
+## Step 2 — Run tika-eval Compare
+
+Unzip the tika-eval-app zip, then run:
+
+```bash
+java -jar <tika-eval>/tika-eval-app-*.jar Compare \
+  -a <extracts-a-dir> \
+  -b <extracts-b-dir> \
+  -d <db-path> \
+  -r \
+  -rd <reports-dir>
+```
+
+| Flag | Description |
+|------|-------------|
+| `-a` | Directory of "before" extracts (required) |
+| `-b` | Directory of "after" extracts (required) |
+| `-d` | H2 database path (temp file if omitted) |
+| `-r` | Auto-run Report + tar.gz after Compare |
+| `-rd` | Reports output directory (default: `reports`) |
+| `-n` | Number of worker threads |
+
+## Step 3 — Review Results
+
+Reports are written as Excel `.xlsx` files under the reports
+directory, plus a `summary.md` with key metrics:
+
+- **Content Quality (Dice Coefficient)** — similarity between A and B
+  per mime type.  Mean dice < 0.95 warrants investigation.
+- **OOV / Languageness Changes** — increased out-of-vocabulary rate
+  or decreased languageness z-score may indicate encoding regressions.
+- **Content Length Ratio Outliers** — files where B is >2× or <0.5×
+  the length of A.
+- **Exception Changes** — new exceptions in B or fixed exceptions.
+- **Embedded Document Count Changes** — gained/lost attachments.
+- **Content Regressions** — lowest-dice files with token-level diffs.
+- **Content Lost / Gained** — files that went empty↔non-empty.
+
+### Interpreting Results
+
+| Metric | Good | Investigate |
+|--------|------|-------------|
+| Mean dice (same mime) | ≥ 0.95 | < 0.90 |
+| New exceptions in B | 0 | > 0 — every one needs explanation |
+| Embedded doc count losses | 0 | > 0 — investigate by mime type |
+| OOV delta | < 0.05 | > 0.10 |
+| Content length ratio | 0.5–2.0 | > 5× or < 0.2× |
+| Exception count | ≤ A | > A |
+| Total files (B) vs (A) | equal or higher | lower — missing embedded docs |
+
+### CRITICAL: Review Checklist
+
+The purpose of tika-eval is to find regressions BEFORE a release. After
+reading summary.md, report each of these to the user:
+
+1. **New exceptions**: Exact count. If > 0, investigate stack traces.
+   Every new exception is a bug.
+2. **Total files delta**: "Total files (A)" vs "(B)". If B < A, embedded
+   documents are being lost — aggregate the losses by child mime type.
+3. **Embedded doc count changes**: Report losers from the summary table.
+4. **Dice scores**: Flag any mean < 0.99 for OOXML or < 0.95 for others.
+5. **Content length outliers**: Flag ratio > 3x or < 0.3x.
+6. **Fixed exceptions**: Report the count.
+
+**Do not summarize results as "looks good" based on dice scores alone.**
+Dice measures text similarity, not attachment completeness. Always check
+the Total files delta.
+
+**After fixing regressions, re-run the full eval.** A fix for one format
+may not cover another. Verify the numbers moved and no new issues appeared.
+
+## Building from Source
+
+```bash
+# tika-app
+./mvnw clean install -pl tika-app -am -Pfast \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+
+# tika-eval-app
+./mvnw clean install -pl tika-eval/tika-eval-app -am -Pfast \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+```
+
+The zip artifacts are in `<module>/target/<module>-*.zip`.
+
+## Example: Comparing MSG parsing changes
+
+```bash
+# Download the "before" snapshot
+curl -o /tmp/tika-app-before.zip <snapshot-url>
+unzip -qo /tmp/tika-app-before.zip -d /tmp/tika-app-before
+
+# Build the "after" from local changes
+./mvnw clean install -pl tika-app -am -Pfast \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -qo tika-app/target/tika-app-*.zip -d /tmp/tika-app-after
+
+# Generate extracts
+java -jar /tmp/tika-app-before/tika-app-*.jar ~/data/msgs /tmp/extracts-a
+java -jar /tmp/tika-app-after/tika-app-*.jar  ~/data/msgs /tmp/extracts-b
+
+# Build and run tika-eval
+./mvnw clean install -pl tika-eval/tika-eval-app -am -Pfast \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -qo tika-eval/tika-eval-app/target/tika-eval-app-*.zip -d /tmp/tika-eval
+
+java -jar /tmp/tika-eval/tika-eval-app-*.jar Compare \
+  -a /tmp/extracts-a -b /tmp/extracts-b \
+  -d /tmp/eval-db -r -rd /tmp/eval-reports
+
+# Review
+cat /tmp/eval-reports/summary.md
+```

Reply via email to