Hi, > Tesseract is by itself a project that written on C/C++ and should be compiled differently for each platform. Good point! We should figure out a way to fail gracefully when Tesseract isn't installed, right? Unless there is, in fact, some pure Java OCR implementation.
Another thought, we should add OCR as a command line option -- one option for extracting images, one for running OCR (which always enables image extraction). Tyler On Thu, May 29, 2014 at 1:26 PM, Oleg Tikhonov <[email protected]> wrote: > Guys, > Tesseract is by itself a project that written on C/C++ and should be > compiled differently for each platform. > Personally, i would put a requirement for those who want to work with > tesseract. Not sure that putting Tesseract in the sources is a right way to > go. > > >>How good tesseract is - depends on trained data at least + quality of > the input images. No simple answer exists. > > BR, > Oleg > > > On Thu, May 29, 2014 at 11:07 PM, Luis Filipe Nassif (JIRA) < > [email protected] > > wrote: > > > > > [ > > > https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810 > ] > > > > Luis Filipe Nassif commented on TIKA-93: > > ---------------------------------------- > > > > Thank you very much [~tpalsulich] for including unit tests! We could also > > include tests for normal images (not embedded). > > > > There is a simple timeout control that throws a TikaException with > > specific message if it happens. The idea to force setting a > > TesseractOCRConfig object in parseContext to run OCR is to not affect > users > > that do not want OCR, exactly because it could take seconds, even > minutes. > > So TesseractOCRParser can be included in Tika Parser list by default with > > no problem. We also could include a warning about OCR slowness in the > class > > description. > > > > I have no idea how to include Tesseract in the sources. Maybe Tika > > commiters can help with this? > > > > > OCR support > > > ----------- > > > > > > Key: TIKA-93 > > > URL: https://issues.apache.org/jira/browse/TIKA-93 > > > Project: Tika > > > Issue Type: New Feature > > > Components: parser > > > Reporter: Jukka Zitting > > > Assignee: Chris A. Mattmann > > > Priority: Minor > > > Fix For: 1.6 > > > > > > Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, > > TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, > > TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx > > > > > > > > > I don't know of any decent open source pure Java OCR libraries, but > > there are command line OCR tools like Tesseract ( > > http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika > to > > extract text content (where available) from image files. > > > > > > > > -- > > This message was sent by Atlassian JIRA > > (v6.2#6252) > > >
