[jira] [Commented] (TIKA-93) OCR support

Grant Ingersoll (JIRA) Sat, 08 Feb 2014 12:17:56 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895718#comment-13895718
 ]


Grant Ingersoll commented on TIKA-93:
-------------------------------------

bq. what is the dependency on jacoco in tika-parent? That stuff seems 
orthogonal to the patch.

I put that in so that I can measure whether I am testing sufficiently.  I can 
separate it out to a different patch.

bq. dependency on custom external Maven repo – myGrid – any way to get the jar 
from the Central repo somewhere? we have made an effort in Tika to remove any 
specific deps on external repositories

We could make that one optional.  All it does is add support for TIFF and a few 
other file formats that aren't part of the standard ImageIO.

bq.  in my CS572 class on Search Engines where we look at FBI Vault PDF files!  
http://www-scf.usc.edu/~csci572/

I read your abstract for your talk and checked out the Vault and thought it 
would be cool, too.  The main issue is that JavaOCR needs to be trained in 
order to work with that data set.  Tesseract, on the other hand, works for it, 
but alas, needs to be implemented as an OCRParser.  Since Tess4J has some bad 
deps, the only way I could see to do this is to exec the process or go write my 
own JNI integration for Tesseract.  The latter isn't likely to happen.  The 
former feels less than desirable, but would work.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-93) OCR support

Reply via email to