[jira] [Created] (TIKA-4202) Add page count of OCR'd pages in PDF's metadata

Tim Allison (Jira) Fri, 23 Feb 2024 09:24:30 -0800

Tim Allison created TIKA-4202:
---------------------------------

             Summary: Add page count of OCR'd pages in PDF's metadata
                 Key: TIKA-4202
                 URL: https://issues.apache.org/jira/browse/TIKA-4202
             Project: Tika
          Issue Type: New Feature
            Reporter: Tim Allison



It would be useful to store the number of pages that triggered OCR in PDFs. 

PDFs are treated differently than other files because the default is to render 
the page and then run OCR "inline", whereas for other file formats, we run OCR 
on embedded images, which are treated as embedded files. We can count tesseract 
as the parser for embedded images in regular files, but we can't do that with 
PDFs ... yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4202) Add page count of OCR'd pages in PDF's metadata

Reply via email to