[jira] [Updated] (TIKA-93) OCR support

Tyler Palsulich (JIRA) Thu, 29 May 2014 11:09:22 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tyler Palsulich updated TIKA-93:
--------------------------------

    Attachment: TesseractOCR_Tyler.patch

Awesome! I attached another patch which includes TesseractOCRParser.patch with 
unit tests for the parser (PDF, PPTX, and DOCX files with embedded images with 
text). We could use more tests for images with no next, blurry text, and so on. 
But, I don't know how good Tesseract is.

Steps to apply this patch: install Tesseract \[1\], apply the patch, move the 
test files into tika-parsers/src/test/resources/test-documents/ocr. Run the 
tests with {{mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest 
-DfailIfNoTests=false}}.

What needs to happen from here? How should we include Tesseract in the sources? 
How should we handle timeouts (give the user a warning that OCR can be 
slow/timed out)?

\[1\] - [https://code.google.com/p/tesseract-ocr/wiki/ReadMe]

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-93) OCR support

Reply via email to