Note that there are two entirely different methods for PDFs.
1) Extract inline images and let Tika’s usual image parsers handle them (including Tesseract) 2) Render each page and then run Tesseract on that page. (this has the overhead of generating a single image per page, _but_ it will correctly stitch together potentially hundreds of images that are used to render a page…very rare, only in edge cases). It looks like you started with 1 and didn’t have luck…that troubles me. And then you went with 2. Did you ever get 1) working? As may be obvious, we are only at the very beginning of integrating OCR with PDFs. We’d like to add a strategy that applies OCR on a given page if, say, < 10 words are extracted from the text…WDYT? From: David Pilato [mailto:da...@pilato.fr] Sent: Friday, May 19, 2017 5:55 AM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"/> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="ocrStrategy" type="string">ocr_and_text</param> </params> </parser> </parsers> </properties> David Le 19 mai 2017 à 10:59, David Pilato <da...@pilato.fr<mailto:da...@pilato.fr>> a écrit : So I saw in debug mode that indeed config.getExtractInlineImages() is false so I'm going to check my config. :D David Le 18 mai 2017 à 22:18, David Pilato <da...@pilato.fr<mailto:da...@pilato.fr>> a écrit : Hey guys First post here ;) I'm trying to play with OCR with Tika. I installed Tesseract and I can extract text from a PNG image. I created a PDF document with this image embedded and I'm trying now to extract the text out of it. I added this configuration but I guess I'm doing it wrong: <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> </parser> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="extractInlineImages" type="bool">true</param> </params> </parser> </parsers> </properties> I'm creating my Tika instance with something like: TikaConfig config = new TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml")); detector = config.getDetector(); parser = new AutoDetectParser(config); tika = new Tika(detector, parser); Any idea? I'm feeling that my xml config is wrong but can't find what should be the right syntax. Thanks for your help guys! David