zamf opened a new pull request, #2769: URL: https://github.com/apache/tika/pull/2769
## Summary Add a new parser module (`tika-parser-ocr-encode-module`) that base64-encodes image content instead of performing OCR text extraction. This is useful when image data needs to be preserved in the parsed output for downstream processing by an external OCR/vision service (e.g., a cloud-based OCR API or vision LLM). The module: - Handles the same media types as `TesseractOCRParser` (`ocr-png`, `ocr-jpeg`, `ocr-tiff`, `ocr-bmp`, `ocr-gif`, `jp2`, `jpx`, `x-portable-pixmap`) - Wraps base64 output in `<<<---IMAGE-BASE64-ENCODED-BEGIN--->>>` / `<<<---IMAGE-BASE64-ENCODED-END--->>>` markers within a `<div class="ocr">` element - Supports configurable file size limits (`minFileSizeToOcr`, `maxFileSizeToOcr`) and per-parse image count limits (`maxImagesToOcr`) via `EncodeOCRConfig` - Supports `skipOCR` to disable at runtime - Supports inline content mode for embedded images ## Changes - **New module**: `tika-parser-ocr-encode-module` under `tika-parsers-standard-modules` - **Module registration**: Added to `tika-parsers-standard-modules/pom.xml`, `tika-bom/pom.xml`, and `tika-parsers-standard-package/pom.xml` - **27 unit tests** covering: encoding (PNG, JPEG), skip-OCR, file size filtering, image limits, supported types, config clone-and-update, base64 round-trip validation ## Test plan - [x] `mvn test -pl tika-parsers/.../tika-parser-ocr-encode-module` - 27 tests pass - [x] `mvn clean install -am -DskipTests` - full build succeeds - [x] Checkstyle passes (no violations with project config) - [ ] CI pipeline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
