[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012525#comment-14012525
 ] 

Luis Filipe Nassif commented on TIKA-93:
----------------------------------------

It was not intentional, the patch should have only one copy of each class, I 
will fix it, thank you. You can use an AutoDetectParser to automatically 
process the PDF. But you must tell Tika what parser it have to use to process 
embedded files (eg images). If you want to only run OCR on embedded images:
{code}
parseContext.set(Parser.class, new TesseractOCRParser());
{code}
If you want to process any kind of embedded file:
{code}
parseContext.set(Parser.class, new AutoDetectParser());
{code}
But by default, trunk currently does not extract images from PDF files, see 
[TIKA-1294]. Try to turn it on with this code:
{code}
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
parseContext.set(PDFParserConfig.class, pdfConfig);
{code}
Let me know if this helps.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to