Guys, Tesseract is by itself a project that written on C/C++ and should be compiled differently for each platform. Personally, i would put a requirement for those who want to work with tesseract. Not sure that putting Tesseract in the sources is a right way to go.
>>How good tesseract is - depends on trained data at least + quality of the input images. No simple answer exists. BR, Oleg On Thu, May 29, 2014 at 11:07 PM, Luis Filipe Nassif (JIRA) <[email protected] > wrote: > > [ > https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810] > > Luis Filipe Nassif commented on TIKA-93: > ---------------------------------------- > > Thank you very much [~tpalsulich] for including unit tests! We could also > include tests for normal images (not embedded). > > There is a simple timeout control that throws a TikaException with > specific message if it happens. The idea to force setting a > TesseractOCRConfig object in parseContext to run OCR is to not affect users > that do not want OCR, exactly because it could take seconds, even minutes. > So TesseractOCRParser can be included in Tika Parser list by default with > no problem. We also could include a warning about OCR slowness in the class > description. > > I have no idea how to include Tesseract in the sources. Maybe Tika > commiters can help with this? > > > OCR support > > ----------- > > > > Key: TIKA-93 > > URL: https://issues.apache.org/jira/browse/TIKA-93 > > Project: Tika > > Issue Type: New Feature > > Components: parser > > Reporter: Jukka Zitting > > Assignee: Chris A. Mattmann > > Priority: Minor > > Fix For: 1.6 > > > > Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, > TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, > TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx > > > > > > I don't know of any decent open source pure Java OCR libraries, but > there are command line OCR tools like Tesseract ( > http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to > extract text content (where available) from image files. > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) >
