[jira] [Commented] (TIKA-93) OCR support

Chris A. Mattmann (JIRA) Sun, 16 Mar 2014 20:50:23 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937431#comment-13937431
 ]


Chris A. Mattmann commented on TIKA-93:
---------------------------------------

It's looking good Luis! This seems to be a good case though for using Tika's 
External parser package:

http://tika.apache.org/1.5/api/org/apache/tika/parser/external/package-summary.html

I noticed that we are creating processes inside of the patch and it would be 
good maybe to simply make it leverage ExternalParser?
I'm happy to work through an update to the patch to do that. Give me a day or 
so.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-93) OCR support

Reply via email to