On Thu, 14 Jan 2016, Andreas Beeker wrote:
POI will have a WMF module (org.apache.poi.hwmf.*) in the next beta. Looking over the govdocs collection, those embedded wmfs might contain interesting information for TIKA.

Should the output be part of the embedding document, e.g. ppt, or does it make sense to crawl over various extensions and extract those metadata separately?

I'd suggest a two-step process. One is to update the current office parsers (especially HSLF) as needed to expose the embedded WMF files as embedded resources, much as they do for embedded jpegs, pngs etc

Next, add a WMF parser that uses HWMF to expose any useful metadata you can find

Tika will then call the WMF parser for embedded WMFs where requested

Nick

Reply via email to