[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895718#comment-13895718
]
Grant Ingersoll commented on TIKA-93:
-------------------------------------
bq. what is the dependency on jacoco in tika-parent? That stuff seems
orthogonal to the patch.
I put that in so that I can measure whether I am testing sufficiently. I can
separate it out to a different patch.
bq. dependency on custom external Maven repo – myGrid – any way to get the jar
from the Central repo somewhere? we have made an effort in Tika to remove any
specific deps on external repositories
We could make that one optional. All it does is add support for TIFF and a few
other file formats that aren't part of the standard ImageIO.
bq. in my CS572 class on Search Engines where we look at FBI Vault PDF files!
http://www-scf.usc.edu/~csci572/
I read your abstract for your talk and checked out the Vault and thought it
would be cool, too. The main issue is that JavaOCR needs to be trained in
order to work with that data set. Tesseract, on the other hand, works for it,
but alas, needs to be implemented as an OCRParser. Since Tess4J has some bad
deps, the only way I could see to do this is to exec the process or go write my
own JNI integration for Tesseract. The latter isn't likely to happen. The
former feels less than desirable, but would work.
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Chris A. Mattmann
> Priority: Minor
> Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are
> command line OCR tools like Tesseract
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)