[jira] [Commented] (TIKA-93) OCR support

Chris A. Mattmann (JIRA) Fri, 07 Feb 2014 13:43:34 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895083#comment-13895083
 ]


Chris A. Mattmann commented on TIKA-93:
---------------------------------------

Thanks Grant, obtaining glory is win. 
Still sounds like a Parser to me though, but I'll be interested to see if you 
whip out some patches and what they would look like. The nice thing about 
Parsers is that they spit out XHTML and you can then transform it with 
ContentHandlers, which is where the real pipeline in Tika capabilities are. So 
moving into Parser ville gets you a pipeline effect downstream at least.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-93) OCR support

Reply via email to