RE: Extracting Text from embedded images in PDF docs

Allison, Timothy B. Fri, 19 May 2017 08:17:20 -0700

Y, well, sorry.  I’m thrilled someone is using it!

I tried to document that here:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29


See the OCR section.

And there’s a link to that page from https://wiki.apache.org/tika/TikaOCR (See 
OCR on PDFs)

How can we improve the documentation so that you don’t waste an hour?

From: David Pilato [mailto:da...@pilato.fr]
Sent: Friday, May 19, 2017 5:55 AM
To: user@tika.apache.org
Subject: Re: Extracting Text from embedded images in PDF docs

Got it working. In case someone else hits the same issue, here is my config 
file... Well... That was obvious :D


<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="ocrStrategy" type="string">ocr_and_text</param>
            </params>
        </parser>
    </parsers>
</properties>


David

Le 19 mai 2017 à 10:59, David Pilato <da...@pilato.fr<mailto:da...@pilato.fr>> 
a écrit :

So I saw in debug mode that indeed config.getExtractInlineImages() is false so 
I'm going to check my config.

:D

David

Le 18 mai 2017 à 22:18, David Pilato <da...@pilato.fr<mailto:da...@pilato.fr>> 
a écrit :

Hey guys


First post here ;)

I'm trying to play with OCR with Tika. I installed Tesseract and I can extract 
text from a PNG image.
I created a PDF document with this image embedded and I'm trying now to extract 
the text out of it.

I added this configuration but I guess I'm doing it wrong:


<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>

I'm creating my Tika instance with something like:


TikaConfig config = new 
TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
detector = config.getDetector();
parser = new AutoDetectParser(config);

tika = new Tika(detector, parser);

Any idea? I'm feeling that my xml config is wrong but can't find what should be 
the right syntax.

Thanks for your help guys!
David

RE: Extracting Text from embedded images in PDF docs

Reply via email to