[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895897#comment-13895897
]
Nick Burch commented on TIKA-93:
--------------------------------
Generally speaking, when a parser finds embedded resources, it calls out to the
Parser on the context to have it processed. You could therefore set your OCR
Parser there, and it'd be called for all kinds of embedded resources. It can
then OCR any suitable images it finds, and pass on everything else to another
parser (eg DefaultParser) to have the non-OCR-able embedded parts handled (if
required)
To handle OCRing of top level content, eg images, you'd need to register your
OCR parser as the parser for those types, in place of (or possibly even
wrapping) the default parser.
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Chris A. Mattmann
> Priority: Minor
> Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are
> command line OCR tools like Tesseract
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)