Nobody replied 4 days. I see the context of the message was lost - it is about https://issues.apache.org/jira/browse/TIKA-1344
On Thu, Jul 10, 2014 at 6:13 PM, Andrew Skiba <and...@tikalk.com> wrote: > Hi Nick, > > Took some time, but I glued it all together, so now it works without > modifying Tika sources, only by using custom handler, extractor and parser. > It works with WordExtractor, although it is looking as a dirty hack. As I > could not override the behavior of WordExtractor, in the handler I ignore > elements <img> if the src is "embedded:xxx", and let trough only images > with src with data URI. > > The problem is – it does not work at all with OOXMLParser, PDFParser, and > probably others. I could not find in the code of these parsers recursive > handling of the embedded images, similar to the call to > handleEmbeddedResource in WordExtractor.handlePictureCharacterRun > > So my questions are: > > 1. Does my handler, parser and extractor do what you meant? > 2. Did I miss the call to ParsingEmbeddedDocumentExtractor in OOXMLParser? > I found img generating code in XWPFWordExtractorDecorator, but the code is > deep in private functions call tree, and XWPFWordExtractorDecorator is > pretty much hardwired to OOXMLParser via OOXMLExtractorFactory, so I did > not see an easy way to inject my code. > > Thank you very much. > > Andrew. > > > On Wed, Jun 25, 2014 at 12:39 PM, Nick Burch <apa...@gagravarr.org> wrote: > >> On Wed, 25 Jun 2014, Andrew Skiba wrote: >> >>> Let me check I understand you right. WordExtractor will continue to >>> create >>> <img src="embedded:filename.jpg"/> >>> >> >> Yes, as will (should..) the other parsers which find embedded resources >> >> >> and call the ImageParser once for every file name. >>> >> >> No. It'll call your code, as you'll have registered your code as the >> EmbeddedDocumentExtractor to call for embedded resources like images. >> >> (If there isn't one, then a ParsingEmbeddedDocumentExtractor is used, >> which calls the default parser, which is how it ends up in ImageParser if >> you're recursing) >> >> Nick >> > >
dataUri.tar.gz
Description: GNU Zip compressed data