Re: Patch: self-contained HTML using Data URI

Nick Burch Mon, 14 Jul 2014 11:37:21 -0700

On Thu, 10 Jul 2014, Andrew Skiba wrote:

Took some time, but I glued it all together, so now it works withoutmodifying Tika sources, only by using custom handler, extractor andparser. It works with WordExtractor, although it is looking as a dirtyhack. As I could not override the behavior of WordExtractor, in thehandler I ignore elements <img> if the src is "embedded:xxx", and lettrough only images with src with data URI.

Hmm, I would've expected that you'd only want to change the embedded:ones, since those are the only ones where you'll be getting the imagethrough as an embedded part, no?

(That's largely what the Alfresco example I pointed you at does - itcatches the embedded urls and re-writes them, while storing the image datain somewhere it knows about + records so it knows what to re-write theembedded one too)

The problem is – it does not work at all with OOXMLParser, PDFParser,and probably others. I could not find in the code of these parsersrecursive handling of the embedded images, similar to the call tohandleEmbeddedResource in WordExtractor.handlePictureCharacterRun

Hmm, I would've expected it to work for .docx in the same way as .doc. Wehave some "matching" .doc and .docx test files, are you getting differentbehaviour for them?

Some of the other parsers may need updating to match the pattern from.doc, especially if you're the first person to try to work with embeddedimages from them...


Nick

Re: Patch: self-contained HTML using Data URI

Reply via email to