Re: OCRing extracted inline images vs. fully rendered pages?

John Hewson Tue, 17 May 2016 09:27:32 -0700

> On 17 May 2016, at 05:25, Allison, Timothy B. <[email protected]> wrote:
> 
> All,
>  On Tika, users can choose to run OCR on inline images (and attached images, 
> of course).  Would it be better for us to render each full page and then run 
> OCR on that?


We have an experimental integration with Tesseract which was created a while 
ago by a GSoC student. Because it requires building C++ we’ve not integrated it 
into trunk, but do have it on the todo list for 2.1. The advantage of this 
approach is that we can keep any embedded text in the PDF and embellish it with 
the output.

https://github.com/DImuthuUpe/OCR-Plugin

— John

>         Best,
> 
>                  Tim
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: OCRing extracted inline images vs. fully rendered pages?

Reply via email to