[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810
 ] 

Luis Filipe Nassif commented on TIKA-93:
----------------------------------------

Thank you very much [~tpalsulich] for including unit tests! We could also 
include tests for normal images (not embedded).

There is a simple timeout control that throws a TikaException with specific 
message if it happens. The idea to force setting a TesseractOCRConfig object in 
parseContext to run OCR is to not affect users that do not want OCR, exactly 
because it could take seconds, even minutes. So TesseractOCRParser can be 
included in Tika Parser list by default with no problem. We also could include 
a warning about OCR slowness in the class description.

I have no idea how to include Tesseract in the sources. Maybe Tika commiters 
can help with this?

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to