[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

Annie Didier (Jira) Wed, 06 Jan 2021 07:30:08 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259807#comment-17259807
 ]


Annie Didier commented on TIKA-3258:
------------------------------------

Other than resource consumption, I don't see major downsides to this. If a user 
really cares about resource consumption, they can turn it off. I like the idea 
that auto mode is triggered at a certain threshold and doesn't run for every 
page, but I think that the proposed threshold will not be sufficient. It's 
possible to have a figure with a very long caption, or a page that is half 
text/half images that could benefit from OCR. It's been a while since I've used 
tika or looked at the source code, so my naive question is would it make sense 
to trigger tesseract based on whether or not there is an image in the page? 

> Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0
> ---------------------------------------------------------
>
>                 Key: TIKA-3258
>                 URL: https://issues.apache.org/jira/browse/TIKA-3258
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> In Tika 1.x we currently have the fiddly mess that users have to configure 
> OCR of PDFs...it doesn't just work out of the box.  We did this initially 
> because of concerns (well, reality) of crazy resource consumption for some 
> PDFs that can have thousands of images per page that are stitched together to 
> make a reasonable composite.
> Since then, we've added option 2, which renders each page and then runs OCR 
> on that composite image rather than running OCR on each inline image...so 
> we'll only call tesseract once per page.  Second, we've added an 'auto' mode 
> that runs OCR only on pages that didn't have much text extracted.  While 
> there is plenty of room for improvement in the 'auto' heuristic, I think we 
> should move to running OCR automatically on PDFs as default in 2.0.0. 
> Under this proposal, users will now have to disable OCR if they have 
> tesseract installed but don't want to run it on PDFs.
> This will be a breaking change, and we'll make sure to document it early and 
> often in the "Breaking Changes" sections of the readme.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

Reply via email to