[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908877#comment-13908877
 ] 

Luis Filipe Nassif commented on TIKA-93:
----------------------------------------

Another approach would be to include images and pdf into supportedTypes of 
OCRParser and call their respective parsers within the OCRParser, instead of 
modifying the code of existing parsers. 

About enabling and configuring the OCRParser, it could be included in 
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 
and could be passed a OCRConfig object via parseContext. If not enabled, 
OCRParser could simply call the existing image or pdf parser.

I agree with Timo that it would be better to print pdf to images rather than 
iterate over its objects.

Finally, Tesseract already includes support for tif files.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to