Re: Extracting Text from embedded images in PDF docs

Chris Mattmann Fri, 19 May 2017 07:20:59 -0700

Great work David!


Thank you. If you get a chance please add the below to the wiki for TikaOCR [1].

 

Thanks,

Chris

 

[1] http://wiki.apache.org/tika/TikaOCR 

 

 

 

From: David Pilato <da...@pilato.fr>
Reply-To: "user@tika.apache.org" <user@tika.apache.org>
Date: Friday, May 19, 2017 at 2:55 AM
To: "user@tika.apache.org" <user@tika.apache.org>
Subject: Re: Extracting Text from embedded images in PDF docs

 

Got it working. In case someone else hits the same issue, here is my config 
file... Well... That was obvious :D 

 

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="ocrStrategy" type="string">ocr_and_text</param>
            </params>
        </parser>
    </parsers>
</properties>
 


David 

 

Le 19 mai 2017 à 10:59, David Pilato <da...@pilato.fr> a écrit :

 

So I saw in debug mode that indeed config.getExtractInlineImages() is false so 
I'm going to check my config. 

 

:D


David 

 

Le 18 mai 2017 à 22:18, David Pilato <da...@pilato.fr> a écrit :

 

Hey guys 

 

 

First post here ;)

 

I'm trying to play with OCR with Tika. I installed Tesseract and I can extract 
text from a PNG image.

I created a PDF document with this image embedded and I'm trying now to extract 
the text out of it.

 

I added this configuration but I guess I'm doing it wrong:

 

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>
 

I'm creating my Tika instance with something like:

 

TikaConfig config = new 
TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
detector = config.getDetector();
parser = new AutoDetectParser(config);
tika = new Tika(detector, parser);
 

Any idea? I'm feeling that my xml config is wrong but can't find what should be 
the right syntax.

 

Thanks for your help guys!
David

Re: Extracting Text from embedded images in PDF docs

Reply via email to