Re: Using Tika with another OCR engine

Nick Burch Tue, 08 Aug 2023 08:17:05 -0700

On Thu, 3 Aug 2023, Cristian Zamfir wrote:

I am interested in trying out Tika with a different OCR engine andwondering how Tesseract is integrated.

Largely as "just another parser", but IIRC with a bit of logic to allowthe "normal" image parsers to also have a go at the file to grab metadata


It's all in tika-parser-ocr-module:
https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module

Is it possible to write a plugin to call a different engine?

Largely would be a case of writing your own parser, registering it for theappropriate mime types, and disabling the Tesseract one if you have thetesseract binary on your path

for scanned PDFs, I assume there is some bi-directional communicationbetween Tika and Tesseract to detect inline images. Is that correct?

Nope, the PDF parser will detect any embedded resources (eg images), andif enabled will call the appropriate parser for each one


Nick

Re: Using Tika with another OCR engine

Reply via email to