[jira] [Commented] (TIKA-93) OCR support

Timo Boehme (JIRA) Fri, 28 Mar 2014 01:28:30 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950487#comment-13950487
 ]


Timo Boehme commented on TIKA-93:
---------------------------------

Hi Anurag, which PDF are you referring to? Without knowing the size, page count 
and structure of the pages it is hard to say what is going wrong. For instance 
it could be as I already wrote in my last comment that the pages contain a 
large number of images (e.g. one per word or chunk) instead of a single one per 
page. Try to print the PDF to images (one per page) and run this through 
Tesseract.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-93) OCR support

Reply via email to