Re: [jira] [Commented] (TIKA-93) OCR support

Oleg Tikhonov Mon, 10 Feb 2014 02:43:46 -0800

@Timo,
On the other hand this Parser can serves as a Composite for more
complicated parsers.
For example of DejaVu, you can "extract" images and parse them one by one,
and after just to append extracted text.



BR,
Oleg


On Mon, Feb 10, 2014 at 11:09 AM, Timo Boehme (JIRA) <[email protected]>wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13896339#comment-13896339]
>
> Timo Boehme commented on TIKA-93:
> ---------------------------------
>
> I would like to give some comments on detecting/handling of image based
> PDFs because the proposed solution will only work with a subset of these
> kind of documents. First one could classify the image based PDF into 3
> classes:
> # image only (one image per page)
> # image with text overlay/underlay already produced by an OCR process
> # multiple images per page (instead of one full page image there are
> images per word/line/paragraph)
>
> Thus from only testing for a page size image one does not known if we
> nevertheless have parseable text or if we have a class 3 document (in case
> of e.g. journals we might even have a full page background image). For an
> automatic classification one would need to first try to parse text in the
> standard way for a view pages. One should not expect image-only PDFs to
> contain no text - in some cases header/footer/page numbers are added as
> text whereas other content is only an image. An heuristic threshold are
> 60-80 characters per page below which we can assume to have an image PDF.
> If a PDF is assumed to be an image PDF the pages should be 'printed' into
> an image (in order to also handle class 3 documents and to keep mixed data
> (image + text)) and this image should be processed by OCR.
>
> Best,
> Timo
>
> > OCR support
> > -----------
> >
> >                 Key: TIKA-93
> >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >            Reporter: Jukka Zitting
> >            Assignee: Chris A. Mattmann
> >            Priority: Minor
> >         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
> >
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1.5#6160)
>

Re: [jira] [Commented] (TIKA-93) OCR support

Reply via email to