[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809855#comment-16809855 ]
Tim Allison edited comment on TIKA-2749 at 4/4/19 5:12 PM: ----------------------------------------------------------- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file|See for example PDFBOX-2475's [rotation.pdf|https://issues.apache.org/jira/secure/attachment/12933778/rotation.pdf]. If we render the page, '2222', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | was (Author: talli...@mitre.org): There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file|See for example [^rotation.pdf]. If we render the page, '2222', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > --------------------------------------------- > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)