Y, well, sorry. I’m thrilled someone is using it! I tried to document that here: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29
See the OCR section. And there’s a link to that page from https://wiki.apache.org/tika/TikaOCR (See OCR on PDFs) How can we improve the documentation so that you don’t waste an hour? From: David Pilato [mailto:da...@pilato.fr] Sent: Friday, May 19, 2017 5:55 AM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"/> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="ocrStrategy" type="string">ocr_and_text</param> </params> </parser> </parsers> </properties> David Le 19 mai 2017 à 10:59, David Pilato <da...@pilato.fr<mailto:da...@pilato.fr>> a écrit : So I saw in debug mode that indeed config.getExtractInlineImages() is false so I'm going to check my config. :D David Le 18 mai 2017 à 22:18, David Pilato <da...@pilato.fr<mailto:da...@pilato.fr>> a écrit : Hey guys First post here ;) I'm trying to play with OCR with Tika. I installed Tesseract and I can extract text from a PNG image. I created a PDF document with this image embedded and I'm trying now to extract the text out of it. I added this configuration but I guess I'm doing it wrong: <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> </parser> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="extractInlineImages" type="bool">true</param> </params> </parser> </parsers> </properties> I'm creating my Tika instance with something like: TikaConfig config = new TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml")); detector = config.getDetector(); parser = new AutoDetectParser(config); tika = new Tika(detector, parser); Any idea? I'm feeling that my xml config is wrong but can't find what should be the right syntax. Thanks for your help guys! David