[
https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260545#comment-17260545
]
Tim Allison edited comment on TIKA-3258 at 1/7/21, 3:20 PM:
------------------------------------------------------------
Reports on 10k random PDFs from our corpus are available here:
[https://corpora.tika.apache.org/base/reports/reports-10kpdfs.tgz]
I ran extraction with 10 threads in tika batch mode on our regression server
{{-Xmx12g}}, timeout default at 5 minutes per file, no image preprocessing.
Thank you, [~lewismc]! The numbers support [~tilman]'s point rather
dramatically:
* OCR was triggered on at least one page in 1,575 of the 10k files [0]
* 2% increase in "common" tokens
* 160x the amount of time (6.5 minutes vs 17 hours)
* timeout on 162 files in the auto OCR group
[0] not available in reports, but greppable (now) {{grep -l -R "pdf:page-ocrd"
/data1/extracts/tika_1_25_10k_pdfs_auto_ocr/ | wc -l}}
was (Author: [email protected]):
Reports on 10k random PDFs from our corpus are available here:
[https://corpora.tika.apache.org/base/reports/reports-10kpdfs.tgz]
I ran extraction with 10 threads in tika batch mode on our regression server
{{-Xmx12g}}, timeout default at 5 minutes per file, no image preprocessing.
Thank you, [~lewismc]! The numbers support [~tilman]'s point rather
dramatically:
* OCR was triggered on at least one page in 1,575 of the 10k files [0]
* 2% increase in "common" tokens
* 160x the amount of time (6.5 minutes vs 17 hours)
* timeout on 162 files in the auto OCR group
[0] not available in reports, but greppable (now) {{grep -l -R "pdf:page-ocrd"
/data1/extracts/tika_1_25_10k_pdfs_auto_ocr/ | wc -l}}
> Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0
> ---------------------------------------------------------
>
> Key: TIKA-3258
> URL: https://issues.apache.org/jira/browse/TIKA-3258
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
>
> In Tika 1.x we currently have the fiddly mess that users have to configure
> OCR of PDFs...it doesn't just work out of the box. We did this initially
> because of concerns (well, reality) of crazy resource consumption for some
> PDFs that can have thousands of images per page that are stitched together to
> make a reasonable composite.
> Since then, we've added option 2, which renders each page and then runs OCR
> on that composite image rather than running OCR on each inline image...so
> we'll only call tesseract once per page. Second, we've added an 'auto' mode
> that runs OCR only on pages that didn't have much text extracted. While
> there is plenty of room for improvement in the 'auto' heuristic, I think we
> should move to running OCR automatically on PDFs as default in 2.0.0.
> Under this proposal, users will now have to disable OCR if they have
> tesseract installed but don't want to run it on PDFs.
> This will be a breaking change, and we'll make sure to document it early and
> often in the "Breaking Changes" sections of the readme.txt.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)