[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tyler Palsulich updated TIKA-93:
--------------------------------
Attachment: TesseractOCR_Tyler.patch
Awesome! I attached another patch which includes TesseractOCRParser.patch with
unit tests for the parser (PDF, PPTX, and DOCX files with embedded images with
text). We could use more tests for images with no next, blurry text, and so on.
But, I don't know how good Tesseract is.
Steps to apply this patch: install Tesseract \[1\], apply the patch, move the
test files into tika-parsers/src/test/resources/test-documents/ocr. Run the
tests with {{mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest
-DfailIfNoTests=false}}.
What needs to happen from here? How should we include Tesseract in the sources?
How should we handle timeouts (give the user a warning that OCR can be
slow/timed out)?
\[1\] - [https://code.google.com/p/tesseract-ocr/wiki/ReadMe]
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.6
>
> Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are
> command line OCR tools like Tesseract
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.2#6252)