Hi Nick, Took some time, but I glued it all together, so now it works without modifying Tika sources, only by using custom handler, extractor and parser. It works with WordExtractor, although it is looking as a dirty hack. As I could not override the behavior of WordExtractor, in the handler I ignore elements <img> if the src is "embedded:xxx", and let trough only images with src with data URI.
The problem is – it does not work at all with OOXMLParser, PDFParser, and probably others. I could not find in the code of these parsers recursive handling of the embedded images, similar to the call to handleEmbeddedResource in WordExtractor.handlePictureCharacterRun So my questions are: 1. Does my handler, parser and extractor do what you meant? 2. Did I miss the call to ParsingEmbeddedDocumentExtractor in OOXMLParser? I found img generating code in XWPFWordExtractorDecorator, but the code is deep in private functions call tree, and XWPFWordExtractorDecorator is pretty much hardwired to OOXMLParser via OOXMLExtractorFactory, so I did not see an easy way to inject my code. Thank you very much. Andrew. On Wed, Jun 25, 2014 at 12:39 PM, Nick Burch <apa...@gagravarr.org> wrote: > On Wed, 25 Jun 2014, Andrew Skiba wrote: > >> Let me check I understand you right. WordExtractor will continue to create >> <img src="embedded:filename.jpg"/> >> > > Yes, as will (should..) the other parsers which find embedded resources > > > and call the ImageParser once for every file name. >> > > No. It'll call your code, as you'll have registered your code as the > EmbeddedDocumentExtractor to call for embedded resources like images. > > (If there isn't one, then a ParsingEmbeddedDocumentExtractor is used, > which calls the default parser, which is how it ends up in ImageParser if > you're recursing) > > Nick >
dataUri.tar.gz
Description: GNU Zip compressed data