[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895241#comment-13895241
 ] 

Grant Ingersoll commented on TIKA-93:
-------------------------------------

Food for thought:

We introduce OCRParser that extends Parser (and we'd likely have a base class 
too)
In the Context, we set the instance, just like we do w/ the Parser.class:
{code}context.set(Parser.class, parser);{code}
i.e.
{code}context.set(OCRParser.class, ocrParser);{code}

Then, we can, over time, add to the various parsers the ability, when detecting 
Image info, to apply the OCRParser in the context of the current parser.  So, 
for instance, the PDFParser, when detecting an Image could optionally extract 
text from the images.  The other benefit, here, of course, is that the 
OCRParser implementation will work independently on anything that is an Image.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to