Re: [jira] [Commented] (TIKA-93) OCR support

Oleg Tikhonov Thu, 29 May 2014 13:27:35 -0700

Guys,
Tesseract is by itself a project that written on C/C++ and should be
compiled differently for each platform.
Personally, i would put a requirement for those who want to work with
tesseract. Not sure that putting Tesseract in the sources is a right way to
go.


>>How good tesseract is -  depends on trained data at least + quality of
the input images. No simple answer exists.

BR,
Oleg


On Thu, May 29, 2014 at 11:07 PM, Luis Filipe Nassif (JIRA) <[email protected]
> wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810]
>
> Luis Filipe Nassif commented on TIKA-93:
> ----------------------------------------
>
> Thank you very much [~tpalsulich] for including unit tests! We could also
> include tests for normal images (not embedded).
>
> There is a simple timeout control that throws a TikaException with
> specific message if it happens. The idea to force setting a
> TesseractOCRConfig object in parseContext to run OCR is to not affect users
> that do not want OCR, exactly because it could take seconds, even minutes.
> So TesseractOCRParser can be included in Tika Parser list by default with
> no problem. We also could include a warning about OCR slowness in the class
> description.
>
> I have no idea how to include Tesseract in the sources. Maybe Tika
> commiters can help with this?
>
> > OCR support
> > -----------
> >
> >                 Key: TIKA-93
> >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >            Reporter: Jukka Zitting
> >            Assignee: Chris A. Mattmann
> >            Priority: Minor
> >             Fix For: 1.6
> >
> >         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
> >
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>

Re: [jira] [Commented] (TIKA-93) OCR support

Reply via email to