On Thu, 14 Jan 2016, Andreas Beeker wrote:
POI will have a WMF module (org.apache.poi.hwmf.*) in the next beta.
Looking over the govdocs collection, those embedded wmfs might contain
interesting information for TIKA.
Should the output be part of the embedding document, e.g. ppt, or does
it make sense to crawl over various extensions and extract those
metadata separately?
I'd suggest a two-step process. One is to update the current office
parsers (especially HSLF) as needed to expose the embedded WMF files as
embedded resources, much as they do for embedded jpegs, pngs etc
Next, add a WMF parser that uses HWMF to expose any useful metadata you
can find
Tika will then call the WMF parser for embedded WMFs where requested
Nick