[ 
https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-3270:
-------------------------------

Have to rework the logic a bit.  The rendering strategy default is "render with 
no text and then run OCR" as the default.  However, we should make the default 
a bit smarter...an AUTO rendering mode.

If you're in AUTO (OCR mode)  and OCR is triggered because of missing unicode 
code points, then you'd want to run OCR on everything. If it is triggered 
because of too few characters, then you'd still want to run OCR on everything.

If you're in OCR_ONLY mode, you'd want to run OCR on everything (or maybe 
_only_ the text?)

If you're in TEXT_AND_OCR mode, you'd want OCR on the not-text bits.



> Render non-text in PDFs for OCR
> -------------------------------
>
>                 Key: TIKA-3270
>                 URL: https://issues.apache.org/jira/browse/TIKA-3270
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of 
> the contents of the page, including text that may be available via regular 
> extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be 
> duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
> the xhtml output, we do mark a separate "div" for OCR so that users can 
> distinguish, but still, it might be useful not to have to run OCR on text 
> that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
> technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
> and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to