[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Tim Allison (JIRA) Tue, 02 Apr 2019 04:09:42 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807661#comment-16807661
 ]


Tim Allison commented on TIKA-2749:
-----------------------------------

A recent question on the user list has me returning to something I looked 
into/asked about a few years ago... 

One idea I had is to run tika-eval's out of vocabulary (OOV) calculation on a 
page, and if it is high, trigger OCR.  If we could also access font information 
to determine if fonts lack a unicode mapping, we could circumvent OOV 
calculation.

It looks like we might consider making the {{noUnicode }}set retrievable in 
{{PDType0Font}} and {{PDSimpleFont}} and we should catch {{TrueTypeFont}}'s 
{{IOException}}?

 

I _think_ we could do this by grabbing the font on each {{TextPosition}} in our 
subclass of TextStripper...perhaps override {{startPage}} and iterate through 
{{charactersByArticle}}...or maybe just override {{processTextPosition}}?

Is this basically reasonable?  What is the best way to figure out if a font is 
broken without cluttering PDFBox's API?

Are there other types of broken/signals from fonts or anything else during the 
parse we could use as a heuristic to judge "bad text"?

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Reply via email to