zamf opened a new pull request, #2769:
URL: https://github.com/apache/tika/pull/2769

   ## Summary
   
   Add a new parser module (`tika-parser-ocr-encode-module`) that 
base64-encodes image content instead of performing OCR text extraction. This is 
useful when image data needs to be preserved in the parsed output for 
downstream processing by an external OCR/vision service (e.g., a cloud-based 
OCR API or vision LLM).
   
   The module:
   - Handles the same media types as `TesseractOCRParser` (`ocr-png`, 
`ocr-jpeg`, `ocr-tiff`, `ocr-bmp`, `ocr-gif`, `jp2`, `jpx`, `x-portable-pixmap`)
   - Wraps base64 output in `<<<---IMAGE-BASE64-ENCODED-BEGIN--->>>` / 
`<<<---IMAGE-BASE64-ENCODED-END--->>>` markers within a `<div class="ocr">` 
element
   - Supports configurable file size limits (`minFileSizeToOcr`, 
`maxFileSizeToOcr`) and per-parse image count limits (`maxImagesToOcr`) via 
`EncodeOCRConfig`
   - Supports `skipOCR` to disable at runtime
   - Supports inline content mode for embedded images
   
   ## Changes
   
   - **New module**: `tika-parser-ocr-encode-module` under 
`tika-parsers-standard-modules`
   - **Module registration**: Added to `tika-parsers-standard-modules/pom.xml`, 
`tika-bom/pom.xml`, and `tika-parsers-standard-package/pom.xml`
   - **27 unit tests** covering: encoding (PNG, JPEG), skip-OCR, file size 
filtering, image limits, supported types, config clone-and-update, base64 
round-trip validation
   
   ## Test plan
   
   - [x] `mvn test -pl tika-parsers/.../tika-parser-ocr-encode-module` - 27 
tests pass
   - [x] `mvn clean install -am -DskipTests` - full build succeeds
   - [x] Checkstyle passes (no violations with project config)
   - [ ] CI pipeline


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to