OCR PDF files but not image files

Gregory Lepore via user Thu, 21 Mar 2024 07:13:50 -0700

Is there a way using tika-config.xml to allow PDF files to be OCR'ed (with
extractinlineimages=true) but not perform OCR on either specific formats
(JPG, GIF) or to disallow OCR on all image/* mime types?


I tried
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
      <mime-exclude>image/*</mime-exclude>
</parser>

but no luck.

Thanks.

-- 
Greg Lepore
Information Technology Specialist
National Archives at College Park

OCR PDF files but not image files

Reply via email to