RE: Extract PDF inline images
Andrea, I’m about to commit an example (see TIKA-1674). In about 10 minutes, look for org.apache.tika.example.ExtractEmbeddedFiles in the tika-examples module. I’m still a bit stumped though on why my example isn’t working recursively. It is only pulling out the children of the input document. Stay tuned to TIKA-1674 for follow up on that. Best, Tim From: Andrea Asta [mailto:asta.and...@gmail.com] Sent: Tuesday, July 07, 2015 6:22 AM To: user@tika.apache.org Subject: Re: Extract PDF inline images Hi Tim, thanks for your response, but I can't find a complete solution. I've created a class using the same FileEmbeddedDocumentExtractor from TikaCLI, and now I'm trying to do a sample main program with a PDF containing some images. This is my code, but I can't have any image stored and the methods of DocumentExtractor are never called using debugger. Thanks Andrea RecursiveParserWrapper parser = new RecursiveParserWrapper( new AutoDetectParser(), new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1) ); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); FileEmbeddedDocumentExtractor extractor = new FileEmbeddedDocumentExtractor(); context.set(FileEmbeddedDocumentExtractor.class, extractor); PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); config.setExtractUniqueInlineImagesOnly(true); context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config); context.set(org.apache.tika.parser.Parser.class, new AutoDetectParser()); InputStream is = PdfRecursiveExample.class.getResourceAsStream("/my.PDF"); ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new File("out.txt")), "UTF-8"); parser.parse(is, handler, metadata, context); 2015-07-06 12:59 GMT+02:00 Allison, Timothy B. mailto:talli...@mitre.org>>: Hi Andrea, The RecursiveParserWrapper, as you found, is only for extracted content and metadata. It was designed to cache metadata and content from embedded documents so that you can easily keep those two things together for each embedded document. To extract the raw bytes from embedded files, try implementing an EmbeddedDocumentExtractor and passing that into the ParseContext. Take a look at http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java and specifically the inner class MyEmbeddedDocument extractor for an example. As another example, look at http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java, and specifically the inner class: FileEmbeddedDocumentExtractor Basically, in ParseEmbedded, just copy the InputStream to a FileOutputStream, and you should be good to go. public boolean shouldParseEmbedded(Metadata metadata) { return true; } public void parseEmbedded(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, boolean b) throws SAXException, IOException { Best, Tim From: Andrea Asta [mailto:asta.and...@gmail.com<mailto:asta.and...@gmail.com>] Sent: Monday, July 06, 2015 6:11 AM To: user@tika.apache.org<mailto:user@tika.apache.org> Subject: Extract PDF inline images Hello, I'm trying to store the inline images from a PDF to a local folder, but can't find any valid example. I can only use the RecursiveParserWrapper to get all the available metadata, but not the binary image content. This is my code: RecursiveParserWrapper parser = new RecursiveParserWrapper( new AutoDetectParser(), new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1) ); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); PDFParserConfig config = new PDFParserConfig(); PDFParser p; config.setExtractInlineImages(true); config.setExtractUniqueInlineImagesOnly(false); context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config); context.set(org.apache.tika.parser.Parser.class, parser); InputStream is = PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF"); //parsing the file ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new File("out.txt")), "UTF-8"); parser.parse(is, handler, metadata, context); How can I store each image file to a folder? Thanks Andrea
Re: Extract PDF inline images
Hi Tim, thanks for your response, but I can't find a complete solution. I've created a class using the same FileEmbeddedDocumentExtractor from TikaCLI, and now I'm trying to do a sample main program with a PDF containing some images. This is my code, but I can't have any image stored and the methods of DocumentExtractor are never called using debugger. Thanks Andrea RecursiveParserWrapper parser = new RecursiveParserWrapper( new AutoDetectParser(), new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1) ); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); FileEmbeddedDocumentExtractor extractor = new FileEmbeddedDocumentExtractor(); context.set(FileEmbeddedDocumentExtractor.class, extractor); PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); config.setExtractUniqueInlineImagesOnly(true); context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config); context.set(org.apache.tika.parser.Parser.class, new AutoDetectParser()); InputStream is = PdfRecursiveExample.class.getResourceAsStream("/my.PDF"); ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new File("out.txt")), "UTF-8"); parser.parse(is, handler, metadata, context); 2015-07-06 12:59 GMT+02:00 Allison, Timothy B. : > Hi Andrea, > > The RecursiveParserWrapper, as you found, is only for extracted content and > metadata. It was designed to cache metadata and content from embedded > documents so that you can easily keep those two things together for each > embedded document. > > To extract the raw bytes from embedded files, try implementing an > EmbeddedDocumentExtractor and passing that into the ParseContext. Take a > look at > http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java > and specifically the inner class MyEmbeddedDocument extractor for an > example. As another example, look at > http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java, > and specifically the inner class: FileEmbeddedDocumentExtractor > > > > > > Basically, in ParseEmbedded, just copy the InputStream to a > FileOutputStream, and you should be good to go. > > > > *public boolean *shouldParseEmbedded(Metadata metadata) { > *return true*; > } > > *public void *parseEmbedded(InputStream inputStream, ContentHandler > contentHandler, Metadata metadata, *boolean *b) *throws *SAXException, > IOException { > > > > Best, > > > >Tim > > > > *From:* Andrea Asta [mailto:asta.and...@gmail.com] > *Sent:* Monday, July 06, 2015 6:11 AM > *To:* user@tika.apache.org > *Subject:* Extract PDF inline images > > > > Hello, > > I'm trying to store the inline images from a PDF to a local folder, but > can't find any valid example. I can only use the RecursiveParserWrapper to > get all the available metadata, but not the binary image content. > > This is my code: > > RecursiveParserWrapper parser = new RecursiveParserWrapper( > new AutoDetectParser(), > new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1) > ); > Metadata metadata = new Metadata(); > ParseContext context = new ParseContext(); > PDFParserConfig config = new PDFParserConfig(); > PDFParser p; > config.setExtractInlineImages(true); > config.setExtractUniqueInlineImagesOnly(false); > context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config); > context.set(org.apache.tika.parser.Parser.class, parser); > > InputStream is = > PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF"); > //parsing the file > ToXMLContentHandler handler = new ToXMLContentHandler(new > FileOutputStream(new File("out.txt")), "UTF-8"); > parser.parse(is, handler, metadata, context); > > How can I store each image file to a folder? > > Thanks > > Andrea >
RE: Extract PDF inline images
Hi Andrea, The RecursiveParserWrapper, as you found, is only for extracted content and metadata. It was designed to cache metadata and content from embedded documents so that you can easily keep those two things together for each embedded document. To extract the raw bytes from embedded files, try implementing an EmbeddedDocumentExtractor and passing that into the ParseContext. Take a look at http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java and specifically the inner class MyEmbeddedDocument extractor for an example. As another example, look at http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java, and specifically the inner class: FileEmbeddedDocumentExtractor Basically, in ParseEmbedded, just copy the InputStream to a FileOutputStream, and you should be good to go. public boolean shouldParseEmbedded(Metadata metadata) { return true; } public void parseEmbedded(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, boolean b) throws SAXException, IOException { Best, Tim From: Andrea Asta [mailto:asta.and...@gmail.com] Sent: Monday, July 06, 2015 6:11 AM To: user@tika.apache.org Subject: Extract PDF inline images Hello, I'm trying to store the inline images from a PDF to a local folder, but can't find any valid example. I can only use the RecursiveParserWrapper to get all the available metadata, but not the binary image content. This is my code: RecursiveParserWrapper parser = new RecursiveParserWrapper( new AutoDetectParser(), new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1) ); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); PDFParserConfig config = new PDFParserConfig(); PDFParser p; config.setExtractInlineImages(true); config.setExtractUniqueInlineImagesOnly(false); context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config); context.set(org.apache.tika.parser.Parser.class, parser); InputStream is = PdfRecursiveExample.class.getResourceAsStream("/BA200PDE.PDF"); //parsing the file ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new File("out.txt")), "UTF-8"); parser.parse(is, handler, metadata, context); How can I store each image file to a folder? Thanks Andrea