On Thu, 10 Jul 2014, Andrew Skiba wrote:
Took some time, but I glued it all together, so now it works without modifying Tika sources, only by using custom handler, extractor and parser. It works with WordExtractor, although it is looking as a dirty hack. As I could not override the behavior of WordExtractor, in the handler I ignore elements <img> if the src is "embedded:xxx", and let trough only images with src with data URI.

Hmm, I would've expected that you'd only want to change the embedded: ones, since those are the only ones where you'll be getting the image through as an embedded part, no?

(That's largely what the Alfresco example I pointed you at does - it catches the embedded urls and re-writes them, while storing the image data in somewhere it knows about + records so it knows what to re-write the embedded one too)

The problem is – it does not work at all with OOXMLParser, PDFParser, and probably others. I could not find in the code of these parsers recursive handling of the embedded images, similar to the call to handleEmbeddedResource in WordExtractor.handlePictureCharacterRun

Hmm, I would've expected it to work for .docx in the same way as .doc. We have some "matching" .doc and .docx test files, are you getting different behaviour for them?

Some of the other parsers may need updating to match the pattern from .doc, especially if you're the first person to try to work with embedded images from them...

Nick

Reply via email to