RE: Extracting Text from embedded images in PDF docs

Allison, Timothy B. Fri, 19 May 2017 08:21:54 -0700

Note that there are two entirely different methods for PDFs.


1)      Extract inline images and let Tika’s usual image parsers handle them 
(including Tesseract)

2)      Render each page and then run Tesseract on that page.  (this has the 
overhead of generating a single image per page, _but_ it will correctly stitch 
together potentially hundreds of images that are used to render a page…very 
rare, only in edge cases).

It looks like you started with 1 and didn’t have luck…that troubles me.  And 
then you went with 2.

Did you ever get 1) working?

As may be obvious, we are only at the very beginning of integrating OCR with 
PDFs.  We’d like to add a strategy that applies OCR on a given page if, say, < 
10 words are extracted from the text…WDYT?

From: David Pilato [mailto:da...@pilato.fr]
Sent: Friday, May 19, 2017 5:55 AM
To: user@tika.apache.org
Subject: Re: Extracting Text from embedded images in PDF docs

Got it working. In case someone else hits the same issue, here is my config 
file... Well... That was obvious :D


<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="ocrStrategy" type="string">ocr_and_text</param>
            </params>
        </parser>
    </parsers>
</properties>


David

Le 19 mai 2017 à 10:59, David Pilato <da...@pilato.fr<mailto:da...@pilato.fr>> 
a écrit :

So I saw in debug mode that indeed config.getExtractInlineImages() is false so 
I'm going to check my config.

:D

David

Le 18 mai 2017 à 22:18, David Pilato <da...@pilato.fr<mailto:da...@pilato.fr>> 
a écrit :

Hey guys


First post here ;)

I'm trying to play with OCR with Tika. I installed Tesseract and I can extract 
text from a PNG image.
I created a PDF document with this image embedded and I'm trying now to extract 
the text out of it.

I added this configuration but I guess I'm doing it wrong:


<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>

I'm creating my Tika instance with something like:


TikaConfig config = new 
TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
detector = config.getDetector();
parser = new AutoDetectParser(config);

tika = new Tika(detector, parser);

Any idea? I'm feeling that my xml config is wrong but can't find what should be 
the right syntax.

Thanks for your help guys!
David

RE: Extracting Text from embedded images in PDF docs

Reply via email to