[jira] [Commented] (TIKA-4202) Add page count of OCR'd pages in metadata for PDF files

Tim Allison (Jira) Wed, 28 Feb 2024 07:06:46 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821703#comment-17821703
 ]


Tim Allison commented on TIKA-4202:
-----------------------------------

The most recent commit actually increments the counter. I've also moved the 
counter a bit higher in the logic so that it will trigger even if tesseract is 
not available. In other words, in AUTO mode, if tesseract is available, it will 
count the number of pages actually OCR'd; if tesseract is not available, it 
will count the number of pages that would have been OCR'd. The idea is that 
this can help for those users who want to run a two pass parse -- the first 
without OCR, and then a follow on/update parse for those PDFs that require 
OCR'ing.

> Add page count of OCR'd pages in metadata for PDF files
> -------------------------------------------------------
>
>                 Key: TIKA-4202
>                 URL: https://issues.apache.org/jira/browse/TIKA-4202
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Minor
>
> It would be useful to store the number of pages that triggered OCR in PDFs. 
> PDFs are treated differently than other files because the default is to 
> render the page and then run OCR "inline", whereas for other file formats, we 
> run OCR on embedded images, which are treated as embedded files. We can count 
> tesseract as the parser for embedded images in regular files, but we can't do 
> that with PDFs ... yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4202) Add page count of OCR'd pages in metadata for PDF files

Reply via email to