Great work David!
Thank you. If you get a chance please add the below to the wiki for TikaOCR [1]. Thanks, Chris [1] http://wiki.apache.org/tika/TikaOCR From: David Pilato <da...@pilato.fr> Reply-To: "user@tika.apache.org" <user@tika.apache.org> Date: Friday, May 19, 2017 at 2:55 AM To: "user@tika.apache.org" <user@tika.apache.org> Subject: Re: Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"/> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="ocrStrategy" type="string">ocr_and_text</param> </params> </parser> </parsers> </properties> David Le 19 mai 2017 à 10:59, David Pilato <da...@pilato.fr> a écrit : So I saw in debug mode that indeed config.getExtractInlineImages() is false so I'm going to check my config. :D David Le 18 mai 2017 à 22:18, David Pilato <da...@pilato.fr> a écrit : Hey guys First post here ;) I'm trying to play with OCR with Tika. I installed Tesseract and I can extract text from a PNG image. I created a PDF document with this image embedded and I'm trying now to extract the text out of it. I added this configuration but I guess I'm doing it wrong: <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> </parser> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="extractInlineImages" type="bool">true</param> </params> </parser> </parsers> </properties> I'm creating my Tika instance with something like: TikaConfig config = new TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml")); detector = config.getDetector(); parser = new AutoDetectParser(config); tika = new Tika(detector, parser); Any idea? I'm feeling that my xml config is wrong but can't find what should be the right syntax. Thanks for your help guys! David