[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

Tim Allison (Jira) Mon, 04 Jan 2021 11:00:43 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258428#comment-17258428
 ]


Tim Allison commented on TIKA-3258:
-----------------------------------

[~tilman] and others...this issue is intended for discussion before abandonment 
or implementation.  Thank you for the feedback!

As I see it...

In favor:
a) users have to have tesseract installed and runnable as "tesseract" (on their 
path).  This means there's a certain bar to entry...users without tesseract 
won't notice a thing.
b) auto mode will not trigger tesseract unless < 10 words (characters?) are 
pulled out per page
c) auto mode will render the page and then run tesseract once...rendering is 
expensive, but this is better than running tesseract on potentially thousands 
of images per page.
d) this will treat PDFs like all other complex file formats, e.g. ppt, pptx, 
xls, etc.  If you have tesseract on your path, it will be run.  We'll still 
need documentation on the other options, but this will make PDF processing much 
more like the other file formats.
e) Tika 2.0.0 is a natural break point to introduce a change that could 
potentially have a large effect.
f) making this less fiddly will mean more folks will be getting more text (flip 
side...more users will be using AUTO mode so that we can improve that).  If 
users _think_ they are OCR'ing PDFs, but they haven't configured something 
correctly, it is possible they will silently get no text.  

Against:
a) higher resource consumption...potentially really, really bad for some users. 
 

The good news is that they'll see higher resource usage...what many people are 
probably not seeing right now is a failure to run OCR on PDFs that require it.

Are there other points in the against list?  Does the against list outweigh the 
for list?

> Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0
> ---------------------------------------------------------
>
>                 Key: TIKA-3258
>                 URL: https://issues.apache.org/jira/browse/TIKA-3258
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>
> In Tika 1.x we currently have the fiddly mess that users have to configure 
> OCR of PDFs...it doesn't just work out of the box.  We did this initially 
> because of concerns (well, reality) of crazy resource consumption for some 
> PDFs that can have thousands of images per page that are stitched together to 
> make a reasonable composite.
> Since then, we've added option 2, which renders each page and then runs OCR 
> on that composite image rather than running OCR on each inline image...so 
> we'll only call tesseract once per page.  Second, we've added an 'auto' mode 
> that runs OCR only on pages that didn't have much text extracted.  While 
> there is plenty of room for improvement in the 'auto' heuristic, I think we 
> should move to running OCR automatically on PDFs as default in 2.0.0. 
> Under this proposal, users will now have to disable OCR if they have 
> tesseract installed but don't want to run it on PDFs.
> This will be a breaking change, and we'll make sure to document it early and 
> often in the "Breaking Changes" sections of the readme.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

Reply via email to