[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908877#comment-13908877 ]
Luis Filipe Nassif commented on TIKA-93: ---------------------------------------- Another approach would be to include images and pdf into supportedTypes of OCRParser and call their respective parsers within the OCRParser, instead of modifying the code of existing parsers. About enabling and configuring the OCRParser, it could be included in tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser and could be passed a OCRConfig object via parseContext. If not enabled, OCRParser could simply call the existing image or pdf parser. I agree with Timo that it would be better to print pdf to images rather than iterate over its objects. Finally, Tesseract already includes support for tif files. > OCR support > ----------- > > Key: TIKA-93 > URL: https://issues.apache.org/jira/browse/TIKA-93 > Project: Tika > Issue Type: New Feature > Components: parser > Reporter: Jukka Zitting > Assignee: Chris A. Mattmann > Priority: Minor > Fix For: 1.6 > > Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, > TIKA-93.patch, testOCR.docx, testOCR.pdf, testOCR.pptx > > > I don't know of any decent open source pure Java OCR libraries, but there are > command line OCR tools like Tesseract > (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to > extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)