[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895212#comment-13895212 ]
Chris A. Mattmann commented on TIKA-93: --------------------------------------- Grant no problem at all and happy to bear with ya. It's been a while since I delved deep into the code myself :-) Parsers are composable, there is a CompositeParser here: http://tika.apache.org/1.4/api/org/apache/tika/parser/CompositeParser.html So yeah you could have a OCRBaseParser extends CompositeParser and then calls super with the List<Parser> of parsers to call along with a specific MIMEregistry, etc.) And yep one could be Tesseract or JavaOCR, etc. > OCR support > ----------- > > Key: TIKA-93 > URL: https://issues.apache.org/jira/browse/TIKA-93 > Project: Tika > Issue Type: New Feature > Components: parser > Reporter: Jukka Zitting > Priority: Minor > > I don't know of any decent open source pure Java OCR libraries, but there are > command line OCR tools like Tesseract > (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to > extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)