[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810 ]
Luis Filipe Nassif commented on TIKA-93: ---------------------------------------- Thank you very much [~tpalsulich] for including unit tests! We could also include tests for normal images (not embedded). There is a simple timeout control that throws a TikaException with specific message if it happens. The idea to force setting a TesseractOCRConfig object in parseContext to run OCR is to not affect users that do not want OCR, exactly because it could take seconds, even minutes. So TesseractOCRParser can be included in Tika Parser list by default with no problem. We also could include a warning about OCR slowness in the class description. I have no idea how to include Tesseract in the sources. Maybe Tika commiters can help with this? > OCR support > ----------- > > Key: TIKA-93 > URL: https://issues.apache.org/jira/browse/TIKA-93 > Project: Tika > Issue Type: New Feature > Components: parser > Reporter: Jukka Zitting > Assignee: Chris A. Mattmann > Priority: Minor > Fix For: 1.6 > > Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, > TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, > TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx > > > I don't know of any decent open source pure Java OCR libraries, but there are > command line OCR tools like Tesseract > (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to > extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.2#6252)