> On 17 May 2016, at 05:25, Allison, Timothy B. <[email protected]> wrote: > > All, > On Tika, users can choose to run OCR on inline images (and attached images, > of course). Would it be better for us to render each full page and then run > OCR on that?
We have an experimental integration with Tesseract which was created a while ago by a GSoC student. Because it requires building C++ we’ve not integrated it into trunk, but do have it on the todo list for 2.1. The advantage of this approach is that we can keep any embedded text in the PDF and embellish it with the output. https://github.com/DImuthuUpe/OCR-Plugin — John > Best, > > Tim > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

