Nobody replied 4 days. I see the context of the message was lost - it is

On Thu, Jul 10, 2014 at 6:13 PM, Andrew Skiba <> wrote:

> Hi Nick,
> Took some time, but I glued it all together, so now it works without
> modifying Tika sources, only by using custom handler, extractor and parser.
> It works with WordExtractor, although it is looking as a dirty hack. As I
> could not override the behavior of WordExtractor, in the handler I ignore
> elements <img> if the src is "embedded:xxx", and let trough only images
> with src with data URI.
> The problem is – it does not work at all with OOXMLParser, PDFParser, and
> probably others. I could not find in the code of these parsers recursive
> handling of the embedded images, similar to the call to
> handleEmbeddedResource in WordExtractor.handlePictureCharacterRun
> So my questions are:
> 1. Does my handler, parser and extractor do what you meant?
> 2. Did I miss the call to ParsingEmbeddedDocumentExtractor in OOXMLParser?
> I found img generating code in XWPFWordExtractorDecorator, but the code is
> deep in private functions call tree, and XWPFWordExtractorDecorator is
> pretty much hardwired to OOXMLParser via OOXMLExtractorFactory, so I did
> not see an easy way to inject my code.
> Thank you very much.
> Andrew.
> On Wed, Jun 25, 2014 at 12:39 PM, Nick Burch <> wrote:
>> On Wed, 25 Jun 2014, Andrew Skiba wrote:
>>> Let me check I understand you right. WordExtractor will continue to
>>> create
>>> <img src="embedded:filename.jpg"/>
>> Yes, as will (should..) the other parsers which find embedded resources
>>  and call the ImageParser once for every file name.
>> No. It'll call your code, as you'll have registered your code as the
>> EmbeddedDocumentExtractor to call for embedded resources like images.
>> (If there isn't one, then a ParsingEmbeddedDocumentExtractor is used,
>> which calls the default parser, which is how it ends up in ImageParser if
>> you're recursing)
>> Nick

Attachment: dataUri.tar.gz
Description: GNU Zip compressed data

Reply via email to