On Thu, 3 Aug 2023, Cristian Zamfir wrote:
I am interested in trying out Tika with a different OCR engine and wondering how Tesseract is integrated.

Largely as "just another parser", but IIRC with a bit of logic to allow the "normal" image parsers to also have a go at the file to grab metadata

It's all in tika-parser-ocr-module:
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module

Is it possible to write a plugin to call a different engine?

Largely would be a case of writing your own parser, registering it for the appropriate mime types, and disabling the Tesseract one if you have the tesseract binary on your path

for scanned PDFs, I assume there is some bi-directional communication between Tika and Tesseract to detect inline images. Is that correct?

Nope, the PDF parser will detect any embedded resources (eg images), and if enabled will call the appropriate parser for each one

Nick

Reply via email to