[ https://issues.apache.org/jira/browse/TIKA-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821706#comment-17821706 ]
ASF GitHub Bot commented on TIKA-4202: -------------------------------------- tballison merged PR #1630: URL: https://github.com/apache/tika/pull/1630 > Add page count of OCR'd pages in metadata for PDF files > ------------------------------------------------------- > > Key: TIKA-4202 > URL: https://issues.apache.org/jira/browse/TIKA-4202 > Project: Tika > Issue Type: New Feature > Reporter: Tim Allison > Priority: Minor > > It would be useful to store the number of pages that triggered OCR in PDFs. > PDFs are treated differently than other files because the default is to > render the page and then run OCR "inline", whereas for other file formats, we > run OCR on embedded images, which are treated as embedded files. We can count > tesseract as the parser for embedded images in regular files, but we can't do > that with PDFs ... yet. -- This message was sent by Atlassian Jira (v8.20.10#820010)